FP reciprocal sqrt extension proposal

336 views
Skip to first unread message

Jacob Lifshay

unread,
Jul 11, 2019, 6:45:38 AM7/11/19
to RISC-V ISA Dev, Libre-RISCV General Development
I propose a Zfrsqrt extension that consists of the frsqrt.s, frsqrt.d,
frsqrt.q, and frsqrt.h instructions, where the 32-bit, 64-bit,
128-bit, and 16-bit versions require the corresponding F, D, Q, etc.
extensions.
If only the F and Zfrsqrt extensions are supported, then only the
frsqrt.s instruction is supported.
If only the F, D, and Zfrsqrt extensions are supported, then only the
frsqrt.s and the frsqrt.d instructions are supported. Likewise for
frsqrt.q and frsqrt.h requiring the corresponding extensions enabled.

The operation implemented by frsqrt.* is a correctly-rounded
implementation of the IEEE 754-2008 rSqrt operation, with all the
usual FP rounding modes supported.

For the encoding, I think using an encoding similar to both the
fsqrt.* and fdiv.* encodings is a good idea, since frsqrt is similar
to both fdiv and fsqrt; Therefore, as an initial proposal, I think
using a funct7 value of 0111100 and the rest of the instruction
identical to fsqrt is a good idea, since, as far as I'm aware, that
doesn't conflict with anything currently.

We (libre-riscv.org) are currently planning on implementing frsqrt in
our libre GPU, since frsqrt is a common operation in 3D graphics (used
for vector normalization, among other things).

Comments, modifications, etc. welcome.

Jacob Lifshay

Bill Huffman

unread,
Jul 11, 2019, 1:25:09 PM7/11/19
to Jacob Lifshay, RISC-V ISA Dev, Libre-RISCV General Development
Avoiding doing both a square root and a divide to get this is a
worthwhile goal. I do wonder if it's better to have a slightly looser
accuracy requirement. The instruction can be considerably faster if
it's required to be within 1-ulp than if it's required to be correctly
rounded. On the other hand, that means different implementations get
different answers, which is not so good.

Bill

Guy Lemieux

unread,
Jul 11, 2019, 4:42:30 PM7/11/19
to Bill Huffman, Jacob Lifshay, Libre-RISCV General Development, RISC-V ISA Dev
1/sqrt(a) is a single-operand instruction.

might there be more performance value in making it dual-operand to make better use of available read ports, eg:

a/sqrt(b)
  or
1/sqrt(a+b)

both are common forms of usage. i suppose these could be formed by chaining, but if that’s the case there’s little need for rsqrt if you have both div and sqrt.

guy


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/bb41761b-4fc4-f636-cf77-c0dd216d41b2%40cadence.com.

Dan Petrisko

unread,
Jul 11, 2019, 4:58:53 PM7/11/19
to Guy Lemieux, Bill Huffman, Jacob Lifshay, Libre-RISCV General Development, RISC-V ISA Dev
might there be more performance value in making it dual-operand to make better use of available read ports, eg:
a/sqrt(b)
  or
1/sqrt(a+b) 

The FSQRT instruction in the base F extension is only single-operand.  However, from a quick skim I don't see any commentary as to why that was chosen that as opposed to something analogous to your suggestion, Guy.  Presumably, it's because the extra arithmetic complexity outweighs the wasted read port. (and it's not much of a waste, since that read can be gated).

Perhaps someone who was involved in the original decision could weigh in?

Best,
Dan Petrisko

Tommy Murphy

unread,
Jul 11, 2019, 5:05:10 PM7/11/19
to Libre-RISCV General Development, RISC-V ISA Dev
Apologies if this is slightly off topic but

(1) does support for the D extension imply support for the F extension?
(2) does support for the Q extension imply support for both D and F extensions?

Or can they all be independent - e.g. support for Q but not D or F etc.?

Andrew Waterman

unread,
Jul 11, 2019, 5:09:47 PM7/11/19
to Dan Petrisko, Bill Huffman, Guy Lemieux, Jacob Lifshay, Libre-RISCV General Development, RISC-V ISA Dev
On Thu, Jul 11, 2019 at 1:58 PM Dan Petrisko <petr...@cs.washington.edu> wrote:
might there be more performance value in making it dual-operand to make better use of available read ports, eg:
a/sqrt(b)
  or
1/sqrt(a+b) 

The FSQRT instruction in the base F extension is only single-operand.  However, from a quick skim I don't see any commentary as to why that was chosen that as opposed to something analogous to your suggestion, Guy.  Presumably, it's because the extra arithmetic complexity outweighs the wasted read port. (and it's not much of a waste, since that read can be gated).

It’s not straightforward to correctly implement sqrt(x) using something like sqrt(x+y), because the addition messes up the sign of zero for sqrt(-0). You’d need to use 2-3 instructions to get the IEEE 754-mandated value (load zero into y; copy the sign of x onto y; then perform sqrt(x+y)).

So, in addition to being more complex to implement, it’s also a less useful instruction than plain-old sqrt.

Jacob Lifshay

unread,
Jul 11, 2019, 5:12:07 PM7/11/19
to Guy Lemieux, Bill Huffman, Libre-RISCV General Development, RISC-V ISA Dev
On Thu, Jul 11, 2019, 13:42 Guy Lemieux <glem...@vectorblox.com> wrote:
1/sqrt(a) is a single-operand instruction.

might there be more performance value in making it dual-operand to make better use of available read ports, eg:

a/sqrt(b)
  or
1/sqrt(a+b)

both are common forms of usage. i suppose these could be formed by chaining, but if that’s the case there’s little need for rsqrt if you have both div and sqrt.
the reason for not just chaining two instructions is that sqrt followed by div gives different results due to rounding twice. div followed by sqrt gives different results due to rounding twice, returning NaN instead of -Inf for -0 inputs, and very small inputs overflow to +Inf where rsqrt won't ever overflow or underflow.

1/sqrt(a+b) additionally gives different results for special cases, assuming the addition is defined the same as fadd, since the sign of the result is different for different rounding modes when adding +0 and -0, whereas (if I recall correctly) IEEE 754-2008 defines rsqrt(+0) to give +Inf and rsqrt(-0) to give -Inf for all rounding modes.

I think having a frsqrt instruction that doesn't need an additional input will be useful since the compiler doesn't have to load the constant 1 (for a/sqrt(b)) or +0 or -0 (for 1/sqrt(a+b)).

Additionally, frsqrt is a commonly implemented operation, so already working and verified single-input frsqrt HW is readily available, whereas the modified versions that have 2 inputs aren't as likely to be available, and it would quite complex to implement and verify a new FP unit, greatly restricting who can implement the Zfrsqrt extension.

I have no problems modifying the encoding to permit 2-input frsqrt, I just think that should be an additional extension on top of Zfrsqrt.


Jacob

Andrew Waterman

unread,
Jul 11, 2019, 5:14:37 PM7/11/19
to Tommy Murphy, Libre-RISCV General Development, RISC-V ISA Dev

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Jacob Lifshay

unread,
Jul 11, 2019, 5:14:54 PM7/11/19
to Tommy Murphy, Libre-RISCV General Development, RISC-V ISA Dev


On Thu, Jul 11, 2019, 14:05 Tommy Murphy <tommy_...@hotmail.com> wrote:
Apologies if this is slightly off topic but

(1) does support for the D extension imply support for the F extension?
(2) does support for the Q extension imply support for both D and F extensions?
yes and yes.

Jacob

Guy Lemieux

unread,
Jul 11, 2019, 5:15:05 PM7/11/19
to Dan Petrisko, Bill Huffman, Jacob Lifshay, Libre-RISCV General Development, RISC-V ISA Dev
Dan, I'm not talking about the original sqrt, i'm talking about rsqrt
(reciprocal).

Regular sqrt is used many many different ways.

Reciprocal sqrt adds an implicit division. Usually, it is implemented
as its own function (not by doing separate sqrt followed by divide),
often with an initial lookup table and Newton-Raphson iteration to get
the rest of the bits correct. However, if it is treated as separate
sqrt followed by divide, you can get a/sqrt(b) without doing the extra
1/sqrt(a) lookup table.

As for 1/sqrt(a+b), maybe I'll follow Andrew W's note that IEEE rules
are just weird, eg +0 and -0 and drop the suggestion. (Although, the
anarchist in me wants to say that handling those special cases needs
to be done in software anyways, so even more advantage to putting it
all in hardware!)

Guy

On Thu, Jul 11, 2019 at 1:58 PM Dan Petrisko <petr...@cs.washington.edu> wrote:
>>

Dan Petrisko

unread,
Jul 11, 2019, 5:15:11 PM7/11/19
to Jacob Lifshay, Guy Lemieux, Bill Huffman, Libre-RISCV General Development, RISC-V ISA Dev
Or can they all be independent - e.g. support for Q but not D or F etc.?

Chapter 12: D

The D extension depends on the base single-precision instruction subset F. 

Chapter 13: Q

The quad-precision binary floating-point instruction-set extension is named “Q”; it depends on the double-precision floating- point extension D. 


So no, Q requires D requires F. 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Tommy Murphy

unread,
Jul 11, 2019, 5:16:45 PM7/11/19
to Libre-RISCV General Development, RISC-V ISA Dev
Thanks all for the clarification on the Q/D/F dependencies and apologies for not realising that they were already covered in the specs.

Guy Lemieux

unread,
Jul 11, 2019, 5:20:28 PM7/11/19
to Jacob Lifshay, Bill Huffman, Libre-RISCV General Development, RISC-V ISA Dev
Just because "that's the way it's always been done" is not a good
reason to justify its continuance.

1/sqrt(a) has been done as single-operand because it's an easy,
independent table-lookup operation, followed by iteration to get the
desired precision. it converges nicely.

however, in real software, the function 1/sqrt(a) almost never stands
alone. it is used for normalization, so it is almost always followed
by a multiplication, ie a/sqrt(b), or preceeded by an addition, ie
1/sqrt(a+b).

saying it is "subjected to rounding twice" isn't really fair. if done
as separate operations, it is subjected to rounding twice. when done
as an atomic operation, you can arrange extended precision and round
only once.

sorry, i'm not trying to overly advocate for this, just trying to
encourage that we keep an open mind.

guy

Dan Petrisko

unread,
Jul 11, 2019, 5:26:23 PM7/11/19
to Guy Lemieux, Jacob Lifshay, Bill Huffman, Libre-RISCV General Development, RISC-V ISA Dev
Dan, I'm not talking about the original sqrt, i'm talking about rsqrt
(reciprocal).

Yes, I understand.  My assumption was that there would be similar motivations for have a single operand since the instructions are similar.  

However, if it is treated as separate
sqrt followed by divide, you can get a/sqrt(b) without doing the extra
1/sqrt(a) lookup table.

This is interesting, though. I'm not solid on FPU implementations, so forgive my ignorance for the following:

1/sqrt(a) has been done as single-operand because it's an easy,
independent table-lookup operation, followed by iteration to get the
desired precision. it converges nicely.

Is this a hardware implementation you're describing or an FP software emulation library? If it's the former, then it doesn't sound like a/sqrt(b) is easily implementable.  If it ends up being a LUT * a in hardware anyway, we haven't saved much.

Best,
Dan Petrisko
 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Jacob Lifshay

unread,
Jul 11, 2019, 5:50:28 PM7/11/19
to Guy Lemieux, Bill Huffman, Libre-RISCV General Development, RISC-V ISA Dev
On Thu, Jul 11, 2019, 14:20 Guy Lemieux <glem...@vectorblox.com> wrote:
Just because "that's the way it's always been done" is not a good
reason to justify its continuance.
that's true, but conversely, we should take into account the additional HW investment needed.

1/sqrt(a) has been done as single-operand because it's an easy,
independent table-lookup operation, followed by iteration to get the
desired precision. it converges nicely.
Yes.

We (libre-riscv.org) are currently planning on using a different algorithm: binary search to solve 1 == x * x * a gives x = 1/sqrt(a), the remainder (1 - x * x * a) gives 0 if the result is exact, allowing exact rounding by just generating 2 additional result bits and checking if the remainder is zero to give the guard, round, and sticky bits. we have modified the binary search into a radix-8 search, giving 3 result bits per stage.

This algorithm could be simply modified to support the a/sqrt(b) case if needed, by replacing the lhs of the equation being solved with the dividend.

however, in real software, the function 1/sqrt(a) almost never stands
alone. it is used for normalization, so it is almost always followed
by a multiplication, ie a/sqrt(b), or preceeded by an addition, ie
1/sqrt(a+b).
one point to note: the a/sqrt(b) form may not help when more than 1 component of the normalized vector is needed, since, for 3D vectors, all 3 components are usually needed, meaning that a/sqrt(b) has to be used with a == 1 and 3 separate multiplications by the rsqrt need to be done to get the 3 components.

saying it is "subjected to rounding twice" isn't really fair. if done
as separate operations, it is subjected to rounding twice.  when done
as an atomic operation, you can arrange extended precision and round
only once.
That was my reasoning why a separate frsqrt instruction is needed over just fusing two separate instructions (fsqrt followed by a fdiv) into a single operation since instruction fusion is required to have the same effects as executing the instructions one at a time. This doesn't affect defining frsqrt to compute a/sqrt(b), since the frsqrt is still a single instruction.

Jacob

Bruce Hoult

unread,
Jul 11, 2019, 6:35:51 PM7/11/19
to Dan Petrisko, Guy Lemieux, Bill Huffman, Jacob Lifshay, Libre-RISCV General Development, RISC-V ISA Dev
When you want 1/sqrt(x) you usually want to multiply three or four
different things by the result, e.g. to normalise the length of a
vector.

On Thu, Jul 11, 2019 at 1:58 PM Dan Petrisko <petr...@cs.washington.edu> wrote:
>>
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CABXpatq%3DjpDi3s3WBktcfPWwgsDYi%2Biddx1B2K5-UmFAuTF%3DeQ%40mail.gmail.com.

Guy Lemieux

unread,
Jul 11, 2019, 8:02:32 PM7/11/19
to Bruce Hoult, Dan Petrisko, Bill Huffman, Jacob Lifshay, Libre-RISCV General Development, RISC-V ISA Dev
ah yes, that's right.... sorry my brain is only half operational today.

g

lkcl

unread,
Jul 12, 2019, 2:45:37 AM7/12/19
to RISC-V ISA Dev, bruce...@sifive.com, petr...@cs.washington.edu, huf...@cadence.com, program...@gmail.com, libre-r...@lists.libre-riscv.org
http://www.acsel-lab.com/arithmetic/arith15/papers/ARITH15_Takagi.pdf

http://bugs.libre-riscv.org/show_bug.cgi?id=44

Some context, above, apologies using a phone to type, quite awkward to keep thread replies.

Vulkan's accuracy requirements are extreme. Error is only allowed in the last 2 bits of mantissa.

3D GPU requirements are also extreme. One DIV and one ISQRT per pixel, no compromises allowed. This for normalisation, typically 1/(x^2 + y^2 + z^2).

Also there are power requirements to meet.

This eliminates Newton Raphson and other iterative methods as there is no guaranteed completion time, plus, if providing enough engines to do so in a readonable timeframe (higher radix) the number of multipliers and in particular their increased size will kill all chance of meeting the power budget.

We therefore had to research pipelined designs ONLY, and Jacob found the above paper. It uses On the Fly conversion as well as redundant carry save format, between pipeline stages this saves hugely on gate count.

The fascinating bit is that the OTFC outputs BOTH sqrt AND isqrt from the SAME hardware. This because it needs the partial results from each to make decisions on what to do within each stage.

Unfortunately the paper is extremely obtuse, like many academic papers, and there is no verilog source. Sigh.

So in the meantime we go with a simpler design, at least we have something, and Jacob has worked out that there are adjustable magic constants so that DIV, SQRT and ISQRT can be covered by at least the same algorithm if not the actual same hardware, with very little extra gate count.

Summary:

1. For 3D we absolutely need isqrt, this is going to go ahead.

2. Lookup tables and Newton Raphson are off the table for us.

3. There exist algorithms that give ISQRT "for free".

4. Love the idea, Guy, of the add, however we may need more than 2 operands, 3 adds would be more useful. Perhaps a separate opcode?

5. We have no problem with a spec requiring less accuracy, however it is something that other implementors may come to regret, particularly when it comes to testing. We use softfloat python bindings on DIV SQRT MUL ADD, perform direct comparisons, and it works extremely well.

L.

lkcl

unread,
Jul 12, 2019, 3:16:29 AM7/12/19
to RISC-V ISA Dev, huf...@cadence.com, program...@gmail.com, libre-r...@lists.libre-riscv.org
On Friday, July 12, 2019 at 4:42:30 AM UTC+8, glemieux wrote:

> might there be more performance value in making it dual-operand to make better use of available read ports, eg:
>
>
> a/sqrt(b)
>   or
> 1/sqrt(a+b)

The hybrid combibation of divide and isqrt (or, multiply and isqrt), I have not seen any hardware out there that does this. I would be concerned about the increase in gate count, it is 2 complex special purpose blocks, back to back.

Also I would be concerned about the rounding, just working it out (let alone implementing it).

In our SoC we are very deliberately avoiding innovation like this, in such specialist areas, relying instead heavily on other people's excellent pathfinding, and on existing standards, where practical.

Although the same tricks as FMAC cannot be applied, add on the other hand is extremely simple relatively speaking, and where we actually need 3 ops to be added for normalisation (per pixel), even a 2 op isqrt a+b would save one instruction in a critical loop.

L.

lkcl

unread,
Jul 12, 2019, 3:30:04 AM7/12/19
to RISC-V ISA Dev, petr...@cs.washington.edu, huf...@cadence.com, glem...@vectorblox.com, program...@gmail.com, libre-r...@lists.libre-riscv.org
On Friday, July 12, 2019 at 5:09:47 AM UTC+8, andrew wrote:

> It’s not straightforward to correctly implement sqrt(x) using something like sqrt(x+y), because the addition messes up the sign of zero for sqrt(-0). You’d need to use 2-3 instructions to get the IEEE 754-mandated value (load zero into y; copy the sign of x onto y; then perform sqrt(x+y)).

Assuming this is isqrt rather than sqrt we are discussing.

So what you are saying Andrew is that FP exceptions on the add part would make a hybrid operation much more complex. Two possible exceptions could occur, and I assume the same +/- zero issues arise?

FMUL on the other hand, exceptions etc. have been thought through and the add has not been problematic.

With the possibility of a FP HW Exception extension to be created, it would be even more important to get this right.

> So, in addition to being more complex to implement, it’s also a less useful instruction than plain-old sqrt.

Isqrt. Definitely needed for 3D.

Thank you for this insight Andrew it cuts off a lot of development time potentially expended unnecessarily.

L.

Jacob Lifshay

unread,
Jul 12, 2019, 4:28:09 AM7/12/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Bill Huffman, libre-r...@lists.libre-riscv.org
On Fri, Jul 12, 2019, 00:16 lkcl <luke.l...@gmail.com> wrote:
On Friday, July 12, 2019 at 4:42:30 AM UTC+8, glemieux wrote:

> might there be more performance value in making it dual-operand to make better use of available read ports, eg:
>
>
> a/sqrt(b)
>   or
> 1/sqrt(a+b)

The hybrid combibation of divide and isqrt (or, multiply and isqrt), I have not seen any hardware out there that does this. I would be concerned about the increase in gate count, it is 2 complex special purpose blocks, back to back.

it barely increases complexity over what we already have:
for the DivPipeCore* classes I've been writing for libre-riscv's gpu (they handle the mantissa for fdiv, fsqrt, and frsqrt, as well as unsigned integer div/rem), supporting a/sqrt(b) is as simple as assigning divisor to compare_lhs instead of 1.0:

Also I would be concerned about the rounding, just working it out (let alone implementing it).
rounding uses the exact same algorithm, generate 2 more bits of quotient/root for guard and round, then compare remainder to zero to generate sticky bit.

Jacob

Andrew Waterman

unread,
Jul 12, 2019, 4:47:56 AM7/12/19
to lkcl, RISC-V ISA Dev, glem...@vectorblox.com, huf...@cadence.com, libre-r...@lists.libre-riscv.org, petr...@cs.washington.edu, program...@gmail.com
On Fri, Jul 12, 2019 at 12:30 AM lkcl <luke.l...@gmail.com> wrote:
On Friday, July 12, 2019 at 5:09:47 AM UTC+8, andrew wrote:

> It’s not straightforward to correctly implement sqrt(x) using something like sqrt(x+y), because the addition messes up the sign of zero for sqrt(-0). You’d need to use 2-3 instructions to get the IEEE 754-mandated value (load zero into y; copy the sign of x onto y; then perform sqrt(x+y)).

Assuming this is isqrt rather than sqrt we are discussing.

So what you are saying Andrew is that FP exceptions on the add part would make a hybrid operation much more complex. Two possible exceptions could occur, and I assume the same +/- zero issues arise?

Yeah, and though I was talking about sqrt rather than isqrt, similar issues will apply.



FMUL on the other hand, exceptions etc. have been thought through and the add has not been problematic.

With the possibility of a FP HW Exception extension to be created, it would be even more important to get this right.

> So, in addition to being more complex to implement, it’s also a less useful instruction than plain-old sqrt.

Isqrt. Definitely needed for 3D.

Thank you for this insight Andrew it cuts off a lot of development time potentially expended unnecessarily.

L.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

lkcl

unread,
Jul 12, 2019, 4:52:42 AM7/12/19
to RISC-V ISA Dev, luke.l...@gmail.com, huf...@cadence.com, libre-r...@lists.libre-riscv.org
On Friday, July 12, 2019 at 4:28:09 PM UTC+8, Jacob Lifshay wrote:
> On Fri, Jul 12, 2019, 00:16 lkcl <luke.l...@gmail.com> wrote:

>
> > a/sqrt(b)

> The hybrid combination of divide and isqrt (or, multiply and isqrt), I have not seen any hardware out there that does this. I would be concerned about the increase in gate count, it is 2 complex special purpose blocks, back to back.
>
>
> it barely increases complexity over what we already have:
>
>
>
> for the DivPipeCore* classes I've been writing for libre-riscv's gpu (they handle the mantissa for fdiv, fsqrt, and frsqrt, as well as unsigned integer div/rem), supporting a/sqrt(b) is as simple as assigning divisor to compare_lhs instead of 1.0:
> https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/div_rem_sqrt_rsqrt/core.py;h=e6a0b9b9d848d93cbef2b35b371befb2d946d7a2;hb=HEAD#l253

Indeed: that is for that particular implementation. An alternative implementation, such as the one in the Tanako paper, specifically and only works if the divisor has been arranged to be within the range 0.25 to 0.5 (something like that).

With the Tanako implementation being so efficient, due to the use of redundant CSA, it would be... unwise shall we say, to make even potential adoption of that algorithm less attractive not just for our own ASIC but for other implementors as well.

(the Tanako implementation cannot use the same trick, so a separate multiplier would be needed).

>
>
>
>
> Also I would be concerned about the rounding, just working it out (let alone implementing it).
>
> rounding uses the exact same algorithm, generate 2 more bits of quotient/root for guard and round, then compare remainder to zero to generate sticky bit.

What I mean is, the implications of rounding due to interactions between the a and the b operands in a/sqrt(b).

This is beyond my ability to assess with any degree of confidence, and I would only be happy if it was assessed and found to be correct (no complications) by consensus of several peers with significant long term expertise in IEEE754 FP.

Andrew raised one concern already (+/- zero), I raised another (exception interaction), there may be others and my concern is that it adds months if not years to our schedule, whereas FP-RSQRT on its own is very easy to justify as the use case is extremely clear.

Standards are quite challenging, so many factors to take into account!

L.

Bill Huffman

unread,
Jul 12, 2019, 11:07:14 AM7/12/19
to Jacob Lifshay, Luke Kenneth Casson Leighton, RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

The rounding isn't difficult in an N-bit at a time algorithm that doesn't have a redundant result representation.  For a Newton-Raphson implementation or a redundant result implementation, rounding is more difficult.

      Bill

On 7/12/19 1:27 AM, Jacob Lifshay wrote:
EXTERNAL MAIL

lkcl

unread,
Jul 13, 2019, 2:28:56 AM7/13/19
to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Friday, July 12, 2019 at 11:07:14 PM UTC+8, Bill Huffman wrote:
> The rounding isn't difficult in an N-bit at a time algorithm that doesn't have a redundant result representation.  For a Newton-Raphson implementation or a redundant result implementation, rounding is more difficult.

Thank you Bill (also Andrew) for the insights.

L.

Aneesh Raveendran

unread,
Jul 13, 2019, 1:30:46 PM7/13/19
to lkcl, RISC-V ISA Dev, program...@gmail.com, libre-r...@lists.libre-riscv.org
Hi all,
    Myself Aneesh Raveendran. I worked on RISC-V floating point co-processor. I have few doubts regarding floating point reciprocal square-root.

1. In which application/bench marking suites will infer floating point reciprocal square-root operations?
2. If this instruction is proposing, what could be the possible instruction formats? (opcodes, f7, f5 field values )
3. Any testsuites are available to verify the functional correctness of the module?

Thanks and regards
Aneesh Raveendran

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.


--
ANEESH R
9995604009

Jacob Lifshay

unread,
Jul 14, 2019, 2:39:03 AM7/14/19
to Aneesh Raveendran, lkcl, RISC-V ISA Dev, Libre-RISCV General Development
On Sat, Jul 13, 2019 at 10:30 AM Aneesh Raveendran <anees...@gmail.com> wrote:
>
> Hi all,
>     Myself Aneesh Raveendran. I worked on RISC-V floating point co-processor. I have few doubts regarding floating point reciprocal square-root.
>
> 1. In which application/bench marking suites will infer floating point reciprocal square-root operations?
reciprocal sqrt is used a lot in 3D graphics for normalizing vectors -- the pseudocode for normalizing 3D a vector is:
fn normalize(x: float, y: float, z: float) -> (float, float, float) {
    let sum_of_squares = x * x + y * y + z * z;
    let factor = rsqrt(sum_of_squares);
    return (factor * x, factor * y, factor * z);
}

It can also be used in machine learning to normalize 1-hot output vectors, though would not be particularly performance critical for that particular usecase.

> 2. If this instruction is proposing, what could be the possible instruction formats? (opcodes, f7, f5 field values )
The proposed instructions are:
+----------+---------+-------+-----+--------+----+---------+
| Mnemonic | funct7  | rs2   | rs1 | funct3 | rd | opcode  |
+==========+=========+=======+=====+========+====+=========+
| fsqrt.s  | 0111100 | 00000 | rs1 | rm     | rd | 1010011 |
+----------+---------+-------+-----+--------+----+---------+
| fsqrt.d  | 0111101 | 00000 | rs1 | rm     | rd | 1010011 |
+----------+---------+-------+-----+--------+----+---------+
| fsqrt.q  | 0111110 | 00000 | rs1 | rm     | rd | 1010011 |
+----------+---------+-------+-----+--------+----+---------+
| fsqrt.h  | 0111111 | 00000 | rs1 | rm     | rd | 1010011 |
+----------+---------+-------+-----+--------+----+---------+

> 3. Any testsuites are available to verify the functional correctness of the module?
mpfr implements reciprocal sqrt, however it doesn't support all of RISC-V's rounding modes and may be missing support for other features needed for testing.
Softfloat doesn't currently implement rsqrt.
I have not researched other softfloat libraries yet.

Jacob Lifshay

Jacob Lifshay

unread,
Jul 14, 2019, 2:41:15 AM7/14/19
to RISC-V ISA Dev, anees...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
All of the opcodes in the table should have started with frsqrt instead of fsqrt

lkcl

unread,
Jul 14, 2019, 3:26:01 AM7/14/19
to RISC-V ISA Dev, anees...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org


On Sunday, July 14, 2019 at 7:39:03 AM UTC+1, Jacob Lifshay wrote:
On Sat, Jul 13, 2019 at 10:30 AM Aneesh Raveendran <anees...@gmail.com> wrote:
>
> Hi all,
>     Myself Aneesh Raveendran. I worked on RISC-V floating point co-processor. I have few doubts regarding floating point reciprocal square-root.
>
> 1. In which application/bench marking suites will infer floating point reciprocal square-root operations?
reciprocal sqrt is used a lot in 3D graphics for normalizing vectors -- the pseudocode for normalizing 3D a vector is:
fn normalize(x: float, y: float, z: float) -> (float, float, float) {
    let sum_of_squares = x * x + y * y + z * z;
    let factor = rsqrt(sum_of_squares);
    return (factor * x, factor * y, factor * z);
}

It can also be used in machine learning to normalize 1-hot output vectors, though would not be particularly performance critical for that particular usecase.

where for 3D it definitely is (even FDIV has to be pipelined)
 
> 2. If this instruction is proposing, what could be the possible instruction formats? (opcodes, f7, f5 field values )
The proposed instructions are:
+----------+---------+-------+-----+--------+----+---------+
| Mnemonic | funct7  | rs2   | rs1 | funct3 | rd | opcode  |
+==========+=========+=======+=====+========+====+=========+
| frsqrt.s | 0111100 | 00000 | rs1 | rm     | rd | 1010011 |
+----------+---------+-------+-----+--------+----+---------+
...
...

i.e. *exactly* the same format as FSQRT... just with a new funct7.


> 3. Any testsuites are available to verify the functional correctness of the module?
mpfr implements reciprocal sqrt, however it doesn't support all of RISC-V's rounding modes and may be missing support for other features needed for testing.
Softfloat doesn't currently implement rsqrt.
I have not researched other softfloat libraries yet.

the key one that we're using is softfloat-3 (custom-compiled to enable RISC-V mode), via manually-compiled python bindings (sfpy) because if you install the debian package, of course it uses the *INTEL*-compiled softfloat-3 library, which is precisely what you absolutely do not want.  instructions to do this are here:


for the nmigen IEEE754 FPU we started from this code:

the majority of that code is the unit testing, except for multiplier.v itself which is a FSM (extremely compact, fits really well into a very small FPGA as long as it has a decent on-board DSP; performance is absolutely dreadful due to single-cycle shifting in the normalisation phase.  replacing that with single-cycle was interesting).

you can see in the c_test directory that jon checked in a binary executable (do NOT run it, it is clearly unsafe to do so), and next to it is the source test.cpp.  clearly this code uses the STANDARD C FP LIBRARY on whatever platform it is compiled on.  this is just as clearly NOT WHAT YOU WANT, because if compiled on an intel x86 system, the unit tests will pass only intel x86 FP RTL.

this is precisely why we use sfpy [compiled specifically for RISC-V].

jon's unit test code has "morphed" and become extremely generic:

examples of how it is used are here - see test_fpmul_pipe_16.py:

they're dead-simple, at this level.  note that sfpy.Float16 and operator.mul are passed in to the test function: that's the two key parameters that are all that is needed (that and "width") to verify the RTL.  we're testing FPMUL, therefore we pass in operator.mul.  we're testing FP16, therefore we pass in sfpy.Float16.  duh :)

the key function run_pipe_fp yields the unit test cases to cover a full random range of specialist combinations (corner cases) that are highly likely to fail, and, only after covering those, full arbitrary random numbers are generated.

i can strongly recommend developing *generic* RTL that is *FULLY PARAMETERISEABLE*, and testing FP16 *FIRST*.  the reason is really simple: FP16, by way of being much smaller bitwidths for both the exponent and mantissa, results in far better coverage of corner cases, which, in FP32 and FP64 are simply too low probability of occurring through pseudo-random monte-carlo testing.

however if the RTL is fully parameterisable, guess what? when it comes to FP32 and FP64, you already tested the corner-cases of the exact same code that generated FP16, so the probability of correctness may be deemed much higher.


later, we will add in formal mathematical proofs, using symbiyosys.  this as an entirely separate project.  we do not really trust random testing, not even on corner-cases.


if anyone has verilog FP RTL that they need testing, and would like to use the above unit test infrastructure, that's dead easy: investigate cocotb.  cocotb is a python wrapper around icarus verilog, and is extremely nifty.  cocotb compiles up VERILOG and inserts instrumentation into the datastream (at the dut level) which allows it to set and read VERILOG parameters from PYTHON.

the cocotb unit test at which i went "holy cow that's awesome" was one which used python's PIL (imaging library) to decode a JPEG... and then compared it directly against the output from a libre licensed verilog JPEG decoder.  all from python.


jacob points out however that because sfpy does not have an FRSQRT function, we cannot use it.  therefore we will need to write our own (python-based) FRSQRT soft-emulation routine.  once written, it gets called exactly like this:

everything else that we need, softfloat-3 has it, and therefore (with the exception of RISC-V tininess bindings) the sfpy python bindings also has everything we need.

this gives us the confidence that, by testing against a WELL TESTED floating-point emulation library, we have similar confidence in the correctness of the libre risc-v CPU/GPU IEEE754 FPU.

using bigfloat to perform the reciprocal-square-root in a much higher precision will cover the requirement to provide accurate FPSQRT.  however the corner-cases (at the extreme limits of the exponent, and when the mantissa's MSB is zero) are going to be a bundle of fun.

the issue is: as we will be running an UNTESTED (unproven) soft-emulation against an UNTESTED (unproven) hardware simulation, we have zero confidence in either.  exactly how to deal with this will be the subject of intensive further investigation.

l.

Jacob Lifshay

unread,
Jul 14, 2019, 3:39:48 AM7/14/19
to lkcl, RISC-V ISA Dev, Aneesh Raveendran, Libre-RISCV General Development
On Sun, Jul 14, 2019 at 12:26 AM lkcl <luke.l...@gmail.com> wrote:
> using bigfloat to perform the reciprocal-square-root in a much higher precision will cover the requirement to provide accurate FPSQRT. however the corner-cases (at the extreme limits of the exponent, and when the mantissa's MSB is zero) are going to be a bundle of fun.
>
Note that mpfr has code to emulate fixed-size floating point numbers,
including handling denormal numbers. It also has mostly (see caveats)
the same special-case semantics (+-0, +-Inf, NaN) as IEEE 754, so that
should make it a lot easier to use.

MPFR is used in gcc to evaluate floating-point expressions at compile
time, so it is well-tested and likely to be correct.

mpfr does, however, have some notable caveats: see
https://www.mpfr.org/mpfr-current/mpfr.html#MPFR-and-the-IEEE-754-Standard

Jacob

lkcl

unread,
Jul 14, 2019, 4:27:37 AM7/14/19
to RISC-V ISA Dev, luke.l...@gmail.com, anees...@gmail.com, libre-r...@lists.libre-riscv.org


On Sunday, July 14, 2019 at 8:39:48 AM UTC+1, Jacob Lifshay wrote:
On Sun, Jul 14, 2019 at 12:26 AM lkcl <luke.l...@gmail.com> wrote:
> using bigfloat to perform the reciprocal-square-root in a much higher precision will cover the requirement to provide accurate FPSQRT.  however the corner-cases (at the extreme limits of the exponent, and when the mantissa's MSB is zero) are going to be a bundle of fun.
>
Note that mpfr has code to emulate fixed-size floating point numbers,
including handling denormal numbers. It also has mostly (see caveats)
the same special-case semantics (+-0, +-Inf, NaN) as IEEE 754, so that
should make it a lot easier to use.


that's interesting.  some IEEE754-like contexts are provided.
  

MPFR is used in gcc to evaluate floating-point expressions at compile
time, so it is well-tested and likely to be correct.

excellent
 

mpfr does, however, have some notable caveats: see
https://www.mpfr.org/mpfr-current/mpfr.html#MPFR-and-the-IEEE-754-Standard

under-normalisation not supported, because, obviously, with an arbitrary-sized exponent you don't need to lose accuracy in the mantissa.

>>> from bigfloat import BigFloat
>>> b = BigFloat(1.0)
>>> b
BigFloat.exact('1.0000000000000000', precision=53)
>>> b.hex()
'0x0.80000000000000p+1'
>>> b._exponent()
1
>>> b._format_to_fixed_precision(1)
(False, '10', -1)
>>> b = BigFloat(52.39)
>>> b._format_to_fixed_precision(24)
(False, '52390000000000000568434189', -24)
>>> b.hex()
'0x0.d18f5c28f5c290p+6'

ah ha!  perfect!  that's exactly what we need.  and there's a "BigFloat.fromhex()" function as well, so it will be possible to transfer numbers in and out of bigfloat, in order to perform conversions to (full) IEEE754 FP16/32/64 format.

i am still puzzled as to how a *soft* FP rsqrt implementation may be verified and found to be a correct implementation of the IEEE754  standard.

l.

Jacob Lifshay

unread,
Jul 14, 2019, 5:05:35 AM7/14/19
to lkcl, RISC-V ISA Dev, Aneesh Raveendran, Libre-RISCV General Development
On Sun, Jul 14, 2019 at 1:27 AM lkcl <luke.l...@gmail.com> wrote:
>
>> mpfr does, however, have some notable caveats: see
>> https://www.mpfr.org/mpfr-current/mpfr.html#MPFR-and-the-IEEE-754-Standard
>
>
> under-normalisation not supported, because, obviously, with an arbitrary-sized exponent you don't need to lose accuracy in the mantissa.
use mpfr_subnormalize as described in the docs. bigfloat appears to
have built-in code to use mpfr_subnormalize, though I'll have to check
that we don't end up with double rounding.

>
> >>> from bigfloat import BigFloat
> >>> b = BigFloat(1.0)
> >>> b
> BigFloat.exact('1.0000000000000000', precision=53)
> >>> b.hex()
> '0x0.80000000000000p+1'
> >>> b._exponent()
> 1
> >>> b._format_to_fixed_precision(1)
> (False, '10', -1)
> >>> b = BigFloat(52.39)
> >>> b._format_to_fixed_precision(24)
> (False, '52390000000000000568434189', -24)
> >>> b.hex()
> '0x0.d18f5c28f5c290p+6'
>
> ah ha! perfect! that's exactly what we need. and there's a "BigFloat.fromhex()" function as well, so it will be possible to transfer numbers in and out of bigfloat, in order to perform conversions to (full) IEEE754 FP16/32/64 format.
I submitted a bug report to allow conversion to/from python3 int
(arbitrary-precision int) without being limited to a C long or needing
conversion to/from strings:
https://github.com/mdickinson/bigfloat/issues/80
>
> i am still puzzled as to how a *soft* FP rsqrt implementation may be verified and found to be a correct implementation of the IEEE754 standard.
All that's needed is that it provides the correct answers, since the
IEEE 754-2008 standard defines exactly what the answer should be in
all cases (ignoring different NaN encodings).

Jacob

lkcl

unread,
Jul 14, 2019, 5:22:25 AM7/14/19
to RISC-V ISA Dev, luke.l...@gmail.com, anees...@gmail.com, libre-r...@lists.libre-riscv.org
On Sunday, July 14, 2019 at 10:05:35 AM UTC+1, Jacob Lifshay wrote:
> i am still puzzled as to how a *soft* FP rsqrt implementation may be verified and found to be a correct implementation of the IEEE754  standard.
All that's needed is that it provides the correct answers, since the
IEEE 754-2008 standard defines exactly what the answer should be in
all cases (ignoring different NaN encodings).


i should have been clearer: how do we *prove* (or demonstrate, with a high degree of confidence) that the answers (from a bigfloat-based frsqrt) are correct? this is what aneesh's question comes down to.  when we're using sfpy (softfloat-3) it's dead-easy: use a hex-conversion of two operands into sfpy.Float32 (or whatever), do a mul/add/div (using operator.mul, operator.add etc.), and get the hex bits out of the result instance.

done.  single-operand (fsqrt) works exactly the same way.

except there _is_ no softfloat fRsqrt to do that trick.

does anyone happen to know of a soft-implementation of an IEEE754 Reciprocal Square-Root anywhere?

l.

lkcl

unread,
Jul 14, 2019, 5:26:17 AM7/14/19
to RISC-V ISA Dev, luke.l...@gmail.com, anees...@gmail.com, libre-r...@lists.libre-riscv.org

Jacob Lifshay

unread,
Jul 14, 2019, 11:00:37 PM7/14/19
to lkcl, RISC-V ISA Dev, Aneesh Raveendran, Libre-RISCV General Development
I found a project that is designed for building implementations of
math operations compliant to the IEEE 754 spec:
http://www.metalibm.org/ (kinda successor to crlibm)

From what I can tell, it supports converting a mathematical expression
(on real numbers) to a implementation of that expression in code that
produces results matching IEEE 754 along with a proof of correctness.

This looks like it would be ideal for verifying frsqrt implementations
along with other math operations.

Jacob

Allen Baum

unread,
Jul 15, 2019, 12:43:55 AM7/15/19
to Jacob Lifshay, lkcl, RISC-V ISA Dev, Aneesh Raveendran, Libre-RISCV General Development
Does it handle DP as well as SP?

-Allen
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAC2bXD42h4wKEetVY%3DoPwSV_iaOGZ6hU1t9MqMGNqo45JWRxUg%40mail.gmail.com.

Jacob Lifshay

unread,
Jul 15, 2019, 12:47:06 AM7/15/19
to Allen Baum, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Aneesh Raveendran, Libre-RISCV General Development
it handles (if I'm reading the right part of the docs):
half
single
double
quad
double-extended (80-bit?)
double-double
double-double-double

lkcl

unread,
Jul 15, 2019, 1:03:22 AM7/15/19
to RISC-V ISA Dev, allen...@esperantotech.com, luke.l...@gmail.com, anees...@gmail.com, libre-r...@lists.libre-riscv.org
https://github.com/kalray/metalibm/blob/master/metalibm_functions/ml_isqrt.py

Fascinating. It's a code generator / compiler. With the IEEE754 arithmetic algorithms encoded as python objects, in an Abstract Syntax Tree, that may be handed to any code-generator backend.

Wow.

So there is no reason why, for example, a Chisel3 backend should not be created.

Or a verilog one.

Or a nmigen one.

That would basically AUTOGENERATE the RTL needed to be conformant with the IEE754 spec in any one of the required algorithms.

It is also quite likely to be able to autogenerate the unit tests as well (might need some work)

Mind you, that recip sqrt is designed to autogenerate a Newton Raphson algorithm, which some people will not be happy with.

Still, it is pretty awesome. Good find Jacob.

L.

lkcl

unread,
Jul 25, 2019, 4:27:15 AM7/25/19
to RISC-V ISA Dev, allen...@esperantotech.com, luke.l...@gmail.com, anees...@gmail.com, libre-r...@lists.libre-riscv.org, Jeffrey Osier-Mixon
so, with thanks to andrew for pointing out the difference between fsqrt and fRsqrt in the special cases handling, that saved some time in wriiing the RTL:

        with m.If(self.i.ctx.op == 2): # RSQRT
            # if a is +/- zero return NaN
            with m.If(a1.is_zero):
                m.d.comb += self.o.z.nan(0)
            # -ve number is NaN
            with m.Elif(a1.s):
                m.d.comb += self.o.z.nan(0)
            # if a is inf return zero (-ve already excluded, above)
            with m.Elif(a1.is_inf):
                m.d.comb += self.o.z.zero(0)
            # if a is NaN return NaN
            with m.Elif(a1.is_nan):
                m.d.comb += self.o.z.nan(0)
            # Denormalised Number checks next, so pass a/b data through
            with m.Else():
                ....

however one key difference: where fpsqrt(-ve 0) returns -ve inf, fpRsqrt(-ve 0) still returns canonical NaN.  i.e. i don't believe it's quite exactly the same (swapping 0-test and Inf-test).

does that look reasonable?


secondly: aware that things are contentious at the moment (please be mindful of jeffrey's request), however we still have to raise this question, and still need to make people aware that, in the past, there has been no response - at all - to similar questions (raised similarly in good faith).  the question is this:

there is clearly interest in a floating-point reciprocal square-root.  by what process may *outsiders* - those excluded for legitimate business reasons and other reasons - submit proposals for *official* inclusion in RISC-V?

l.

Jacob Lifshay

unread,
Jul 25, 2019, 4:45:08 AM7/25/19
to lkcl, RISC-V ISA Dev, Allen Baum, Aneesh Raveendran, Libre-RISCV General Development, Jeffrey Osier-Mixon
On Thu, Jul 25, 2019 at 1:27 AM lkcl <luke.l...@gmail.com> wrote:
>
> so, with thanks to andrew for pointing out the difference between fsqrt and fRsqrt in the special cases handling, that saved some time in wriiing the RTL:
>
> with m.If(self.i.ctx.op == 2): # RSQRT
> # if a is +/- zero return NaN
> with m.If(a1.is_zero):
> m.d.comb += self.o.z.nan(0)
> # -ve number is NaN
> with m.Elif(a1.s):
> m.d.comb += self.o.z.nan(0)
> # if a is inf return zero (-ve already excluded, above)
> with m.Elif(a1.is_inf):
> m.d.comb += self.o.z.zero(0)
> # if a is NaN return NaN
> with m.Elif(a1.is_nan):
> m.d.comb += self.o.z.nan(0)
> # Denormalised Number checks next, so pass a/b data through
> with m.Else():
> ....
>
> however one key difference: where fpsqrt(-ve 0) returns -ve inf, fpRsqrt(-ve 0) still returns canonical NaN. i.e. i don't believe it's quite exactly the same (swapping 0-test and Inf-test).
>
> does that look reasonable?
The special case values should be (for reciprocal sqrt):
NaN -> NaN (ignoring signaling/quiet)
-Inf -> NaN
-finite -> NaN
-0 -> -Inf (div-by-zero; weird, but this is how ieee 754 defines it)
+0 -> +Inf (div-by-zero)
+finite -> rsqrt
+Inf -> +0

Jacob Lifshay

lkcl

unread,
Jul 25, 2019, 4:54:00 AM7/25/19
to RISC-V ISA Dev, luke.l...@gmail.com, allen...@esperantotech.com, anees...@gmail.com, libre-r...@lists.libre-riscv.org, je...@linuxfoundation.org
On Thursday, July 25, 2019 at 9:45:08 AM UTC+1, Jacob Lifshay wrote:
 
The special case values should be (for reciprocal sqrt):
NaN -> NaN (ignoring signaling/quiet)
-Inf -> NaN
-finite -> NaN
-0 -> -Inf (div-by-zero; weird, but this is how ieee 754 defines it)
+0 -> +Inf (div-by-zero)
+finite -> rsqrt
+Inf -> +0

do you happen to know if that's the exact order in which those tests have to be actioned?  the reason i ask is because i got caught out when doing fpsqrt special cases: i'd placed zero-testing later in the list, tested -ve numbers (all -ve numbers) first to return canonical-NaN, and of course sqrt(-ve zero) is -ve zero.

this one "-0 -> -Inf" kiiinda makes sense if the 1/ is considered to take precedence over sqrt() part.

l.

Jacob Lifshay

unread,
Jul 25, 2019, 5:17:57 AM7/25/19
to lkcl, RISC-V ISA Dev, Allen Baum, Aneesh Raveendran, Libre-RISCV General Development, Jeffrey Osier-Mixon
On Thu, Jul 25, 2019 at 1:54 AM lkcl <luke.l...@gmail.com> wrote:
>
> On Thursday, July 25, 2019 at 9:45:08 AM UTC+1, Jacob Lifshay wrote:
>
>>
>> The special case values should be (for reciprocal sqrt):
>> NaN -> NaN (ignoring signaling/quiet)
>> -Inf -> NaN
>> -finite -> NaN
>> -0 -> -Inf (div-by-zero; weird, but this is how ieee 754 defines it)
>> +0 -> +Inf (div-by-zero)
>> +finite -> rsqrt
>> +Inf -> +0
>
>
> do you happen to know if that's the exact order in which those tests have to be actioned? the reason i ask is because i got caught out when doing fpsqrt special cases: i'd placed zero-testing later in the list, tested -ve numbers (all -ve numbers) first to return canonical-NaN, and of course sqrt(-ve zero) is -ve zero.

Those cases are more like a switch statement in C with breaks in that
each case is independent:

Float frsqrt(Float input)
{
switch(classify(input))
{
case NAN:
case NEG_INF:
case NEG_FINITE:
case NEG_DENORMAL:
return NaN;
case NEG_ZERO:
return -Inf;
case POS_ZERO:
return +Inf;
case POS_FINITE:
case POS_DENORMAL:
return calculate_frsqrt(input);
case POS_INF:
return +0.0;
}
unreachable();
}

So, the order you test them in is whatever order the classify() algorithm uses.

> this one "-0 -> -Inf" kiiinda makes sense if the 1/ is considered to take precedence over sqrt() part.
yeah, but it makes it more annoying since otherwise the sign of the
result (ignoring NaNs) is always positive, like you would expect from
the mathematical limits at 0 and infinity.

If I recall correctly the actual definition of reciprocal sqrt in IEEE
754 is (expanded):
fn frsqrt(v) {
let temp = sqrt(v); // IEEE 754 sqrt except it returns the exact
mathematical result without rounding
return 1.0 / temp; // IEEE 754 division; rounds the result
}

the reason frsqrt(-0) returns -Inf is that IEEE 754 sqrt(-0) returns
-0 which then gets converted to -Infinity.

Jacob Lifshay

lkcl

unread,
Jul 25, 2019, 6:01:40 AM7/25/19
to RISC-V ISA Dev, luke.l...@gmail.com, allen...@esperantotech.com, anees...@gmail.com, libre-r...@lists.libre-riscv.org, je...@linuxfoundation.org, John Hauser, John Hauser
On Thursday, July 25, 2019 at 10:17:57 AM UTC+1, Jacob Lifshay wrote:
 do you happen to know if that's the exact order in which those tests have to be actioned?  the reason i ask is because i got caught out when doing fpsqrt special cases: i'd placed zero-testing later in the list, tested -ve numbers (all -ve numbers) first to return canonical-NaN, and of course sqrt(-ve zero) is -ve zero.

Those cases are more like a switch statement in C with breaks in that
each case is independent:

ah ok.
 
> this one "-0 -> -Inf" kiiinda makes sense if the 1/ is considered to take precedence over sqrt() part.
yeah, but it makes it more annoying since otherwise the sign of the
result (ignoring NaNs) is always positive, like you would expect from
the mathematical limits at 0 and infinity.

If I recall correctly the actual definition of reciprocal sqrt in IEEE
754 is (expanded):
fn frsqrt(v) {
    let temp = sqrt(v); // IEEE 754 sqrt except it returns the exact
mathematical result without rounding
    return 1.0 / temp; // IEEE 754 division; rounds the result
}

the reason frsqrt(-0) returns -Inf is that IEEE 754 sqrt(-0) returns
-0 which then gets converted to -Infinity.

ok cool.

ah ha!  right!  i have an idea.  if softfloat-3 rounding can be switched *off* [just checked: unfortunately it doesn't look like there's a mode to do that).  or, i know: if it's really that simple, that's really not so hard to add a patch to softfloat-3 to support fprsqrt.

cc'ing john hauser: john, what do you think?

l.

John Hauser

unread,
Jul 25, 2019, 2:16:13 PM7/25/19
to RISC-V ISA Dev
Jacob Lifshay wrote:
> If I recall correctly the actual definition of reciprocal sqrt in IEEE
> 754 is (expanded):
>
> fn frsqrt(v) {
>     let temp = sqrt(v); // IEEE 754 sqrt except it returns the exact
>                             mathematical result without rounding
>     return 1.0 / temp; // IEEE 754 division; rounds the result
> }

Luke K.C. Leighton:

> i have an idea.  if softfloat-3 rounding can be switched *off* [just
> checked: unfortunately it doesn't look like there's a mode to do
> that).  or, i know: if it's really that simple, that's really not so
> hard to add a patch to softfloat-3 to support fprsqrt.

I'm afraid it's not going to be that easy.

Since most square roots are irrational, it's impossible for any
finite machine to compute a binary square root result that is "the
exact mathematical result without rounding".  SoftFloat computes an
approximate square root with some extra precision, together with an
exact remainder, and uses these together to figure out how to correctly
round the approximation.

Even if you extract this extra-precise-but-not-exact approximation
from the SoftFloat sqrt function and do an extra-precise division (not
exactly supported by SoftFloat, so you're on your own), there's a lot
more to getting a correctly rounded result for reciprocal square root
than you realize.  The best-case scenario would be for your computed
results to be correctly rounded most of the time, but not always.
Getting to "always" requires so much more effort that I'd expect it's
easier to build the reciprocal square-root function from scratch.

Sorry,

    - John Hauser

MitchAlsup

unread,
Jul 25, 2019, 4:43:14 PM7/25/19
to RISC-V ISA Dev
Just some notes I have on RSQRT::

     RSQRT( ±0  ) = ±∞       , and signals the DIV ZERO exception

     RSQRT( +∞  ) = +0

     RSQRT( -  ) = Quiet NaN, and signals the Operation exception.

     RSQRT( NaN ) = Quiet NaN

On the other hand, the thread is about a reciprocal sqrt approximation. Why approximate when the real result is easily calculated in HW where all the
boundary cases can be done with <essentially> no overhead.
There is technology that enables an FMAC unit to calculate RSQRT and produce its IEEE 754-2008 correctly rounded result in low 20's clock cycles.
One can get this down to 14 cycles if one can accept faithfully rounded results.

If you want details, e-mail me.

lkcl

unread,
Jul 25, 2019, 5:01:40 PM7/25/19
to RISC-V ISA Dev


On Thursday, July 25, 2019 at 7:16:13 PM UTC+1, John Hauser wrote:
 
than you realize.  The best-case scenario would be for your computed
results to be correctly rounded most of the time, but not always.
Getting to "always" requires so much more effort that I'd expect it's
easier to build the reciprocal square-root function from scratch.


appreciated the insight, john.

l.

lkcl

unread,
Jul 25, 2019, 5:04:54 PM7/25/19
to RISC-V ISA Dev


On Thursday, July 25, 2019 at 9:43:14 PM UTC+1, MitchAlsup wrote:
Just some notes I have on RSQRT::

     RSQRT( ±0  ) = ±∞       , and signals the DIV ZERO exception

     RSQRT( +∞  ) = +0

     RSQRT( -  ) = Quiet NaN, and signals the Operation exception.

     RSQRT( NaN ) = Quiet NaN

On the other hand, the thread is about a reciprocal sqrt approximation.

for compliance with the Vulkan API, my fuzzy-memory recalls tht we need accuracy within 1-2 LSBs.
 
Why approximate when the real result is easily calculated in HW where all the
boundary cases can be done with <essentially> no overhead.
There is technology that enables an FMAC unit to calculate RSQRT and produce its IEEE 754-2008 correctly rounded result in low 20's clock cycles.
One can get this down to 14 cycles if one can accept faithfully rounded results.

If you want details, e-mail me.

yes please.  been meaning to ask on comp.arch about this, if you (or anyone else) knows of good hardware algorithms.  we absolutely have to have pipelines due to the hard requirement of 1 RSQRT per pixel [which eliminates anything that does not have guaranteed completion time].

l.

MitchAlsup

unread,
Jul 25, 2019, 5:28:13 PM7/25/19
to RISC-V ISA Dev
This smells of single precision--in any event the good literature on SP <simple> transcendentals is ATI + nVidia
Pierno, Oberman on the nVidia side, Matula and B??? on the ATI side. 

In any event, these SP things use Quadratic evaluation and resolve in 5-ish cycles and have about 1.5 ULP accuracy (a few years ago) and are fully pipelined.

High-Speed Function Approximation Using a Minimax Quadratic Interpolator; Pierno and Oberman
A High-Performance Area-Efficient Multifunction Interpolator; Oberman and Siu

Should get everyone started.

Bruce Hoult

unread,
Jul 25, 2019, 5:30:14 PM7/25/19
to lkcl, RISC-V ISA Dev
On Thu, Jul 25, 2019 at 2:04 PM lkcl <luke.l...@gmail.com> wrote:
>
> On Thursday, July 25, 2019 at 9:43:14 PM UTC+1, MitchAlsup wrote:
>>>
>>> Just some notes I have on RSQRT::
>>
>>
>> RSQRT( ±0 ) = ±∞ , and signals the DIV ZERO exception
>>
>> RSQRT( +∞ ) = +0
>>
>> RSQRT( -Â ) = Quiet NaN, and signals the Operation exception.
>>
>> RSQRT( NaN ) = Quiet NaN
>>
>> On the other hand, the thread is about a reciprocal sqrt approximation.
>
> for compliance with the Vulkan API, my fuzzy-memory recalls tht we need accuracy within 1-2 LSBs.

Which is FAR easier than correctly rounded, and is easily achievable
with Newton-Raphson improvement from a low precision estimate.

MitchAlsup

unread,
Jul 25, 2019, 6:07:04 PM7/25/19
to RISC-V ISA Dev, luke.l...@gmail.com
For SP transcendentals with 1.5-2.0 ULP accuracy, one can implement a quadratic calculation scheme using a table lookup and pipeline it into no more than 5 stages::

     Result = C0 + C1×X + C2×X^2

where
Index = argument<23:17>
C0 = Table0[index]
C1 = Table1[index]
C2 = Table2[index]
X = argument<16:0>
X^2 is produced by a squaring multiplier tree (highly degenerate)
Matul explains in his patent that C0 will need 26-bits of precision in order for the result to have a chance at 1ULP.
Also note: X does not have any significance in its higher order 7-bits
                 X^2 does not have any significance it its higher order 14-bits

C0 has been adjusted so that the hidden bit of the original argument is taken into account without having to participate in the calculations.

But you might want to look at my patent, here:: USPTO 9,471,305 for a different strategy with smaller tables.

But I started interacting with this thread thinking people wanted to get this done fast in DOUBLE PRECISION.
Which is where I started and figured out how to do these "easy transcendentals" in "about" the same clock count as DP FDIV.

lkcl

unread,
Jul 25, 2019, 6:44:55 PM7/25/19
to RISC-V ISA Dev, luke.l...@gmail.com
I love patents. They're so obtuse. Oo cubic equations, so the tables can be smaller. I like it. Selecting the larger number to route, based on the exponent, neat trick.

Hey a spelling mistake, "operands with lesser exponents thought the alignment shifters".


>
>
> But I started interacting with this thread thinking people wanted to get this done fast in DOUBLE PRECISION.

For our project, guaranteed completion time is a priority, as is power efficiency. 64 bit is still needed however the priorities are different: efficiency is the (only) priority.

Others here have expressed an interest in fprsqrt and have different application requirements.

> Which is where I started and figured out how to do these "easy transcendentals" in "about" the same clock count as DP FDIV.

Using cubic interpolation for *all* major functions, is pretty damn neat: good reuse of hardware. Microcode makes sense as it's basically moving algorithms from software into hardware, and keeping the efficiency of reprogrammable design. Like it.

L.


MitchAlsup

unread,
Jul 25, 2019, 7:06:00 PM7/25/19
to RISC-V ISA Dev, luke.l...@gmail.com


On Thursday, July 25, 2019 at 5:44:55 PM UTC-5, lkcl wrote:
On Friday, July 26, 2019 at 6:07:04 AM UTC+8, MitchAlsup wrote:

>
> But I started interacting with this thread thinking people wanted to get this done fast in DOUBLE PRECISION.

<snip>64 bit is still needed however the priorities are different: efficiency is the (only) priority.

When you need/want fast accurate DP transcendentals, you know where to look. 
I even have a numerics proof that SIN and COS are correct, including argument reduction.

Rogier Brussee

unread,
Jul 26, 2019, 9:04:28 AM7/26/19
to RISC-V ISA Dev, luke.l...@gmail.com
There is an abstruse hack to compute 1/sqrt(x) with low precision using only integer manipulations that may start your Newton Raphson.



Rogier



Op vrijdag 26 juli 2019 00:44:55 UTC+2 schreef lkcl:

lkcl

unread,
Jul 28, 2019, 8:46:26 AM7/28/19
to RISC-V ISA Dev, luke.l...@gmail.com
On Friday, July 26, 2019 at 2:04:28 PM UTC+1, Rogier Brussee wrote:
There is an abstruse hack to compute 1/sqrt(x) with low precision using only integer manipulations that may start your Newton Raphson.


the "wtf??" algorithm - i love this one.  subtract from a magic constant, it gives an approximation that's close enough to begin NR.

we decided against using it because the number of iterations required gives so many (large) multipliers that it begins to get alarming in the gate count.

another factor: we're adding early-in and early-out pipeline bypassing (safe to do with a 6600-style dependency matrix system) in order to be able to use the *same* hardware for INT as well as FP.  that in turn means that the bitwidth needs to be 16, 32 and 64, not 10, 23 and 52.

unfortunately that rules out many pre-designed algorithms which have been carefully optimised to IEEE754 FP standard widths, requiring us to do some significant analysis.

we'll be going with a simpler initial design (have something "working") that Jacob has designed to *dynamically* run DIV, SQRT *or* RSQRT, and have done unit tests @ 20,000 random values plus INF/Zero/NaN/Subnormal/Random permutations coverage @ around 10,000 tests, for each of FPDIV, FPSQRT and FPRSQRT, at FP16, FP32 and FP64 (except for FP64 FPSQRT).

no failures, all accurate according to what softfloat-3 produces.

we will be applying for a second NLNet Grant to do "optimisations" as well as formal proofs, at a later date.

with many thanks for everyones' input here, on the technical side, esp. for the references.

l.

lkcl

unread,
Aug 4, 2019, 1:33:48 AM8/4/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org


On Thursday, July 11, 2019 at 11:45:38 AM UTC+1, Jacob Lifshay wrote:
 
For the encoding, I think using an encoding similar to both the
fsqrt.* and fdiv.* encodings is a good idea, since frsqrt is similar
to both fdiv and fsqrt; Therefore, as an initial proposal, I think
using a funct7 value of 0111100 and the rest of the instruction
identical to fsqrt is a good idea, since, as far as I'm aware, that
doesn't conflict with anything currently.

ah! rats, i just started looking at the RV32/64G instruction set listings, chap 25 V20190621-draft, and hadn't noticed that FSQRT.S is funct7=0b0101100 where rs2=00000, FSQRT.D is funct7=0b0101101 and FSQRT.Q is funct7=0b0101111

so correspondingly:
FRSQRT.S would need to be funct7=0b011100
FRSQRT.D would need to be funct7=0b011101
FRSQRT.Q would need to be funct7=0b011111

also: where would FSQRT.H go, and likewise FRSQRT.H?  *checks table 11.3* section 11.6.... ok so those would be:

FRSQRT.H would need to be funct7=0b011110 and
FSQRT.H would need to be funct7=0b0101110

so the first 2 bits of funct7 select S/D/H/Q, the other 5 select the "function type".



Jacob Lifshay

unread,
Aug 4, 2019, 1:55:42 AM8/4/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

lkcl

unread,
Aug 4, 2019, 1:58:24 AM8/4/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org


On Sunday, August 4, 2019 at 6:55:42 AM UTC+1, Jacob Lifshay wrote:

just creating a new table to make it easier to spot the brownfield encodings in the 01010011 major opcode.
l.

lkcl

unread,
Aug 4, 2019, 2:39:50 AM8/4/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
https://libre-riscv.org/rv_major_opcode_1010011/

okaaay, so that's just done, what i did was to split out the use of funct5 (when either rs2 or funct3 are "specifiers") and you can see that, actually, rs2=00000 is used for FSQRT.X, and it might be better, instead of using a new funct5, to just use rs2=00001 for FRSQRT.X

funct3 in both cases is the "rounding mode".

l.

lkcl

unread,
Aug 8, 2019, 2:46:39 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
Just a thought: should this be merged into Ztrans? A case could be nade either way.

* The Libre RISCV FPDIV pipeline also computes SQRT and RSQRT
* Some OTFC pipeline designs do RSQRT "for free"
* However, if doing RSQRT chances are high it will be part of a design that needs EXP, LOG etc anyway.
* CORDIC can do a ton of algorithms including SQRT, RSQRT, LOG, SIN etc.

Thoughts?

L.

MitchAlsup

unread,
Aug 8, 2019, 12:01:16 PM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
CORDIC is SLOW
The Moto CORDIC unit took 80-cycles or more
The Intel CORDIC was similarly slow
Modern SQRT/RSQRT hardware will deliver results in 23-30 cycles. 

Allen Baum

unread,
Aug 8, 2019, 12:54:05 PM8/8/19
to MitchAlsup, RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
CO rdic is great for extremely low gate count implementations. But for highest performance, there are likely much better ways.

-Allen
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
Reply all
Reply to author
Forward
0 new messages