Is there any work being done on decimal floating-point?
Will it follow the IEEE standard? Or are other options possible?
If the same load / store instructions are used to load regular floating-point and decimal floating-point how is the hardware going to know the difference? The decimal floating-point needs to be unpacked at load time and packed at store time.
I have been working on decimal floating-point for another project and use an expanded out 128-bit DPD format to 152 bits to perform calculations using BCD. This means internally registers need to be 152 bits wide. DPD is used only at load / store time.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/61c9912a-bec1-461f-9662-e1ee23de14e5n%40groups.riscv.org.
\chapter{``L'' Standard Extension for Decimal Floating-Point, Version 0.0}
{\bf This chapter is a draft proposal that has not been ratified by
the Foundation.}
This chapter is a placeholder for the specification of a standard
extension named ``L'' designed to support decimal floating-point
arithmetic as defined in the IEEE 754-2008 standard.
\section{Decimal Floating-Point Registers}
Existing floating-point registers are used to hold 64-bit and 128-bit
decimal floating-point values, and the existing floating-point load
and store instructions are used to move values to and from memory.
\begin{commentary}
Due to the large opcode space required by the fused multiply-add
instructions, the decimal floating-point instruction extension will
require five 25-bit major opcodes in a 30-bit encoding space.
\end{commentary}
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CABeNrKUygRZs_u-Rn-T35xMkU24VviXfuxiz3g2JUVi0CQih2Q%40mail.gmail.com.
I was thinking that the regular floating-point opcode space could be reused for decimal floating-point with a decimal mode bit indicator in an fp status register somewhere. I do not think that regular floating-point and decimal floating-point would be mixed in the same application that often. Decimal floating-point is already using the load store ops and registers of regular floating-point. Why not reuse the instructions too? Probably one or the other is desired. With multiple cores, floating-point support could be mixed.
Is there any work being done on decimal floating-point?
Will it follow the IEEE standard? Or are other options possible?
If the same load / store instructions are used to load regular floating-point and decimal floating-point how is the hardware going to know the difference? The decimal floating-point needs to be unpacked at load time and packed at store time.
I have been working on decimal floating-point for another project and use an expanded out 128-bit DPD format to 152 bits to perform calculations using BCD. This means internally registers need to be 152 bits wide. DPD is used only at load / store time.
I was thinking that the regular floating-point opcode space could be reused for decimal floating-point with a decimal mode bit indicator in an fp status register somewhere.
I do not think that regular floating-point and decimal floating-point would be mixed in the same application that often.
Decimal floating-point is already using the load store ops and registers of regular floating-point. Why not reuse the instructions too? Probably one or the other is desired. With multiple cores, floating-point support could be mixed.
On 5/23/2022 5:01 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>
> Wide Decimal fixed point is about 1/50 the cost of DFP, and all you
> need is to figure out how to encode 32-256-bit BCD in ISA and how to
> sequence BCD arithmetic. {256-bits encodes 63 decimal digits and a
> sign} This is enough to calculate the world GDP in the least valuable
> currency {Zimbabwe}
Yeah...
(Out of scope for the mailing list, but 256-bit BCD ADD in my ISA):
CLRT
BCDADC R20, R4
BCDADC R21, R5
BCDADC R22, R6
BCDADC R23, R7
Exercise to the reader to figure out how to do similar in an ISA without
any status bits.
Doing it in C would likely end up being a fair bit more expensive since
the C ABI does 256-bit types as pass-by-reference, so the operation
would effectively also require 4x 128-bit memory loads and 2x 128-bit
memory stores. This is unlike 128-bit value-types which are pass-by-value.
In ASM code, one could "bend the rules" in cases like this though, and
treat 256-bits as pass-by-value rather than pass-by-reference.
Multiply/divide is still a place where "the crap hits the fan" though,
generally seems faster to go BCD->Binary->BCD in this case...
Granted, an operation on a pass-by-reference type means needing to
provide storage space for the result, and associated loads/stores for
the operation itself (and for every operation on that type).
Does save a little cost in that (typically) operations on larger types
are performed using runtime calls.
For example, one really doesn't want to inline the code for something
like a matrix multiply every time they want to perform a matrrix
multiply; better to perform the operation via a runtime call and
pass/return values by reference, which is what is what is done in these
cases.
>
>
> In ASM code, one could "bend the rules" in cases like this though,
> and
> treat 256-bits as pass-by-value rather than pass-by-reference.
>
>
> Multiply/divide is still a place where "the crap hits the fan"
> though,
> generally seems faster to go BCD->Binary->BCD in this case...
>
> Highly dependent on fast BCD-binary conversion {like a dedicated or
> obvious way to perform repeated strings of '×10 + digit shift by digit'.}
> Here you are using a "whole bunch" of multiplies to get to the point you
> can do an <ahem> multiply.
>
There are various options here, one of the faster options being:
b=(a<<3)+(a<<1);
Which can be done in ~ 2 or 3 clock cycles.
Or:
c=(a<<3)+(a<<1)+b;
Roughly 4 cycles.
A full loop might look something like:
a=0; b=src;
for(i=0; i<16; i++)
{ a=(a<<3)+(a<<1)+(b>>60); b<<=4; }
Generic multiply is less ideal, since the relevant (64-bit) multiply
instruction takes around 66 clock-cycles in this case.
Doing Binary=>BCD is a lot faster via the ROTCL + BCDADC trick, as this
means one can do a 16-digit conversion in around 128 cycles, vs several
kilocycles.
> Decimal multiplication can be done "on-line" with small look tables,
> (pair of 8-bit in 8-bit out) and then use the decimal adders of ADD-SUB.
> {Tables only have 100 enums not the whole 256}
>
> I think you should be cognizant of the latency of well implemented DFP;
> a 30 cycle multiply is completely competitive.........
Admittedly, I was thinking more like around 500 cycles vs around 10k
cycles...
But, having types where fundamental operations are measured in
kilocycles is almost a sure-fire way to make it to where no one uses them.