Decimal Floating-point

440 views
Skip to first unread message

Robert Finch

unread,
Dec 23, 2020, 10:04:33 PM12/23/20
to RISC-V ISA Dev

Is there any work being done on decimal floating-point?

Will it follow the IEEE standard? Or are other options possible?

If the same load / store instructions are used to load regular floating-point and decimal floating-point how is the hardware going to know the difference? The decimal floating-point needs to be unpacked at load time and packed at store time.

I have been working on decimal floating-point for another project and use an expanded out 128-bit DPD format to 152 bits to perform calculations using BCD. This means internally registers need to be 152 bits wide. DPD is used only at load / store time.

K. York

unread,
Dec 25, 2020, 10:39:42 AM12/25/20
to Robert Finch, RISC-V ISA Dev
There are folks seriously exploring Posits in the Alternate floating point group. Don't think I've spotted any work on decimal FP yet? Probably best to search the member archives yourself.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/61c9912a-bec1-461f-9662-e1ee23de14e5n%40groups.riscv.org.

Nick Knight

unread,
Dec 25, 2020, 8:33:52 PM12/25/20
to K. York, Robert Finch, RISC-V ISA Dev
The only prior work on decimal FP that I'm aware of is what appears in Chapter 15 of the ISA Manual (Vol. I):

\chapter{``L'' Standard Extension for Decimal Floating-Point, Version 0.0}

{\bf This chapter is a draft proposal that has not been ratified by
the Foundation.}

This chapter is a placeholder for the specification of a standard
extension named ``L'' designed to support decimal floating-point
arithmetic as defined in the IEEE 754-2008 standard.

\section{Decimal Floating-Point Registers}

Existing floating-point registers are used to hold 64-bit and 128-bit
decimal floating-point values, and the existing floating-point load
and store instructions are used to move values to and from memory.

\begin{commentary}
Due to the large opcode space required by the fused multiply-add
instructions, the decimal floating-point instruction extension will
require five 25-bit major opcodes in a 30-bit encoding space.
\end{commentary}


It looks to have been last touched in August 2018.

Best,
Nick Knight


Robert Finch

unread,
Apr 13, 2022, 1:50:48 AM4/13/22
to RISC-V ISA Dev, Nick Knight, Robert Finch, RISC-V ISA Dev, kane...@gmail.com

I was thinking that the regular floating-point opcode space could be reused for decimal floating-point with a decimal mode bit indicator in an fp status register somewhere. I do not think that regular floating-point and decimal floating-point would be mixed in the same application that often. Decimal floating-point is already using the load store ops and registers of regular floating-point. Why not reuse the instructions too? Probably one or the other is desired. With multiple cores, floating-point support could be mixed.

MitchAlsup

unread,
May 23, 2022, 6:01:54 PM5/23/22
to RISC-V ISA Dev, Robert Finch
On Wednesday, December 23, 2020 at 9:04:33 PM UTC-6 Robert Finch wrote:

Is there any work being done on decimal floating-point?

Some people hope so, and some people are working on it. 

Will it follow the IEEE standard? Or are other options possible?

IEEE standard allows for both DPD (IBM) and BED (Intel) formats. IBM uses a densely packed decimal encoding, Intel uses binary mantissa with a decimal exponent (IIRC). 

If the same load / store instructions are used to load regular floating-point and decimal floating-point how is the hardware going to know the difference? The decimal floating-point needs to be unpacked at load time and packed at store time.

Why should the LD or ST instruction need to know ? it is just bringing in or pushing out containers of bits. Or are you trying to embed these instructions with some notion of knowing what kind of data they are moving about ?

I have been working on decimal floating-point for another project and use an expanded out 128-bit DPD format to 152 bits to perform calculations using BCD. This means internally registers need to be 152 bits wide. DPD is used only at load / store time.

Wide Decimal fixed point is about 1/50 the cost of DFP, and all you need is to figure out how to encode 32-256-bit BCD in ISA and how to sequence BCD arithmetic. {256-bits encodes 63 decimal digits and a sign} This is enough to calculate the world GDP in the least valuable currency {Zimbabwe}

MitchAlsup

unread,
May 23, 2022, 6:05:36 PM5/23/22
to RISC-V ISA Dev, Robert Finch, Nick Knight, RISC-V ISA Dev, kane...@gmail.com
On Wednesday, April 13, 2022 at 12:50:48 AM UTC-5 Robert Finch wrote:

I was thinking that the regular floating-point opcode space could be reused for decimal floating-point with a decimal mode bit indicator in an fp status register somewhere.

Great:: more mode bits.......... 

I do not think that regular floating-point and decimal floating-point would be mixed in the same application that often.

Do you actually HAVE an application for which DFP is the proper solution ? and where a slightly wider decimal fixed point presentation would not only be appropriate but faster and with better arithmetic properties ? 

Decimal floating-point is already using the load store ops and registers of regular floating-point. Why not reuse the instructions too? Probably one or the other is desired. With multiple cores, floating-point support could be mixed.

This brings the requirement that the FP registers need to be saved before the mode bit can be altered in context switching. 

BGB

unread,
May 23, 2022, 6:57:31 PM5/23/22
to isa...@groups.riscv.org
On 5/23/2022 5:01 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>
>
> On Wednesday, December 23, 2020 at 9:04:33 PM UTC-6 Robert Finch wrote:
>
> Is there any work being done on decimal floating-point?
>
> Some people hope so, and some people are working on it.
>
> Will it follow the IEEE standard? Or are other options possible?
>
> IEEE standard allows for both DPD (IBM) and BED (Intel) formats. IBM
> uses a densely packed decimal encoding, Intel uses binary mantissa
> with a decimal exponent (IIRC).

Pros/cons either way.

I will state my doubts that DFP in hardware makes sense from a
cost/benefit POV, though in software, it is probably OK.
Only ways I can think of to approach implementing DFP in hardware would
likely be implausibly expensive.

Probably DPD makes a little more sense if one has BCD helper
instructions, BED if one does not have BCD helper instructions.


BCD ADD/SUB instructions make a little more sense though, and are not
super expensive.

A 16-digits in 64-bit BCD adder can also be used to implement a "fast"
binary-to-decimal converter (at least if compared with a more
traditional divide-and-modulo-by-10 approach).

This then leaves a question of DPD<->BCD pack/unpack helper ops, ...


Ironically, of the use-cases, I suspect "using BCD ops to make printf's
%d formatting and similar faster" is probably one of the more
immediately useful ones.


> If the same load / store instructions are used to load regular
> floating-point and decimal floating-point how is the hardware
> going to know the difference? The decimal floating-point needs to
> be unpacked at load time and packed at store time.
>
> Why should the LD or ST instruction need to know ? it is just bringing
> in or pushing out containers of bits. Or are you trying to embed these
> instructions with some notion of knowing what kind of data they are
> moving about ?
>
> I have been working on decimal floating-point for another project
> and use an expanded out 128-bit DPD format to 152 bits to perform
> calculations using BCD. This means internally registers need to be
> 152 bits wide. DPD is used only at load / store time.
>
> Wide Decimal fixed point is about 1/50 the cost of DFP, and all you
> need is to figure out how to encode 32-256-bit BCD in ISA and how to
> sequence BCD arithmetic. {256-bits encodes 63 decimal digits and a
> sign} This is enough to calculate the world GDP in the least valuable
> currency {Zimbabwe}

Yeah...

(Out of scope for the mailing list, but 256-bit BCD ADD in my ISA):
  CLRT
  BCDADC R20, R4
  BCDADC R21, R5
  BCDADC R22, R6
  BCDADC R23, R7

Exercise to the reader to figure out how to do similar in an ISA without
any status bits.


Doing it in C would likely end up being a fair bit more expensive since
the C ABI does 256-bit types as pass-by-reference, so the operation
would effectively also require 4x 128-bit memory loads and 2x 128-bit
memory stores. This is unlike 128-bit value-types which are pass-by-value.

In ASM code, one could "bend the rules" in cases like this though, and
treat 256-bits as pass-by-value rather than pass-by-reference.


Multiply/divide is still a place where "the crap hits the fan" though,
generally seems faster to go BCD->Binary->BCD in this case...
  For the 128-bit case, this is also a place where having 128-bit ALU
ops and similar can come in handy...
  Still no ideal solution for 256 bit though (these cases will almost
invariably be pretty slow).


> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/b31a4ad2-fdbe-4d54-a172-3cc9742aad0cn%40groups.riscv.org
> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/b31a4ad2-fdbe-4d54-a172-3cc9742aad0cn%40groups.riscv.org?utm_medium=email&utm_source=footer>.

MitchAlsup

unread,
May 23, 2022, 7:40:30 PM5/23/22
to RISC-V ISA Dev, cr8...@gmail.com
On Monday, May 23, 2022 at 5:57:31 PM UTC-5 cr8...@gmail.com wrote:
On 5/23/2022 5:01 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>

> Wide Decimal fixed point is about 1/50 the cost of DFP, and all you
> need is to figure out how to encode 32-256-bit BCD in ISA and how to
> sequence BCD arithmetic. {256-bits encodes 63 decimal digits and a
> sign} This is enough to calculate the world GDP in the least valuable
> currency {Zimbabwe}

Yeah...

(Out of scope for the mailing list, but 256-bit BCD ADD in my ISA):
  CLRT
  BCDADC R20, R4
  BCDADC R21, R5
  BCDADC R22, R6
  BCDADC R23, R7


Exercise to the reader to figure out how to do similar in an ISA without
any status bits.

 
My 66000 ISA using binary integer arithmetic

          CARRY     Rt,{{I}{IO}{IO}{O}}
          ADD          R10,R20,R25
          ADD          R11,R21,R26
          ADD          R12,R22,R27
          ADD          R13,R23,R28

CARRY is an instruction-modifier that does what various status bits would/could 
do*, but without any status bits. CARRY can be used with ADD, SUB, MUL, DIV, SL, 
SR, FADD, FMUL, FDIV, SQRT. (*) and without inventing a cartesian product of
OpCodes. In essence, CARRY adds 2 bits to a 32-bit instruction to allow for a
variety of purposes. Carry has a range of up to the next 8 instructions.

When CARRY is used in FP calculations we get exact FP arithmetics:: all of the
bits of the intermediate result are delivered as the actual result. My 66000
goes out of its way to actually NOT set the inexact flag when the FP result 
contains all of the intermediate calculation bits. Kahan–Babuška summation
in 1 instruction.

Doing it in C would likely end up being a fair bit more expensive since
the C ABI does 256-bit types as pass-by-reference, so the operation
would effectively also require 4x 128-bit memory loads and 2x 128-bit
memory stores. This is unlike 128-bit value-types which are pass-by-value.
 
My 66000 ABI allows for arguments of up to 8 registers in size to be passed
in registers. 


In ASM code, one could "bend the rules" in cases like this though, and
treat 256-bits as pass-by-value rather than pass-by-reference.


Multiply/divide is still a place where "the crap hits the fan" though,
generally seems faster to go BCD->Binary->BCD in this case...
 
Highly dependent on fast BCD-binary conversion {like a dedicated or
obvious way to perform repeated strings of '×10 + digit shift by digit'.}
Here you are using a "whole bunch" of multiplies to get to the point you 
can do an <ahem> multiply.

Decimal multiplication can be done "on-line" with small look tables, 
(pair of 8-bit in 8-bit out) and then use the decimal adders of ADD-SUB. 
{Tables only have 100 enums not the whole 256}

I think you should be cognizant of the latency of well implemented DFP;
a 30 cycle multiply is completely competitive.........

BGB

unread,
May 23, 2022, 9:27:19 PM5/23/22
to isa...@groups.riscv.org
Yeah.

Carry-chain is pretty rare in my case:
  ADC/SBB (AKA: ADDC/SUBC)
  BCDADC/BCDSBB

And, a few misc ops, like ROTCL and ROTCR.


>
> Doing it in C would likely end up being a fair bit more expensive
> since
> the C ABI does 256-bit types as pass-by-reference, so the operation
> would effectively also require 4x 128-bit memory loads and 2x 128-bit
> memory stores. This is unlike 128-bit value-types which are
> pass-by-value.
>
> My 66000 ABI allows for arguments of up to 8 registers in size to be
> passed
> in registers.

My case, it is limited to 2 registers.
Everything bigger is pass-by-reference, which is "usually" faster, if
one assumes that the majority of activity on a type is to pass it around
from place-to-place, and only occasionally operate on it.

Granted, an operation on a pass-by-reference type means needing to
provide storage space for the result, and associated loads/stores for
the operation itself (and for every operation on that type).

Does save a little cost in that (typically) operations on larger types
are performed using runtime calls.
For example, one really doesn't want to inline the code for something
like a matrix multiply every time they want to perform a matrrix
multiply; better to perform the operation via a runtime call and
pass/return values by reference, which is what is what is done in these
cases.


>
>
> In ASM code, one could "bend the rules" in cases like this though,
> and
> treat 256-bits as pass-by-value rather than pass-by-reference.
>
>
> Multiply/divide is still a place where "the crap hits the fan"
> though,
> generally seems faster to go BCD->Binary->BCD in this case...
>
> Highly dependent on fast BCD-binary conversion {like a dedicated or
> obvious way to perform repeated strings of '×10 + digit shift by digit'.}
> Here you are using a "whole bunch" of multiplies to get to the point you
> can do an <ahem> multiply.
>

There are various options here, one of the faster options being:
  b=(a<<3)+(a<<1);
Which can be done in ~ 2 or 3 clock cycles.

Or:
  c=(a<<3)+(a<<1)+b;
Roughly 4 cycles.

A full loop might look something like:
  a=0; b=src;
  for(i=0; i<16; i++)
    { a=(a<<3)+(a<<1)+(b>>60); b<<=4; }

Generic multiply is less ideal, since the relevant (64-bit) multiply
instruction takes around 66 clock-cycles in this case.

Doing Binary=>BCD is a lot faster via the ROTCL + BCDADC trick, as this
means one can do a 16-digit conversion in around 128 cycles, vs several
kilocycles.


> Decimal multiplication can be done "on-line" with small look tables,
> (pair of 8-bit in 8-bit out) and then use the decimal adders of ADD-SUB.
> {Tables only have 100 enums not the whole 256}
>
> I think you should be cognizant of the latency of well implemented DFP;
> a 30 cycle multiply is completely competitive.........

Admittedly, I was thinking more like around 500 cycles vs around 10k
cycles...

But, having types where fundamental operations are measured in
kilocycles is almost a sure-fire way to make it to where no one uses them.

MitchAlsup

unread,
May 23, 2022, 10:00:46 PM5/23/22
to RISC-V ISA Dev, cr8...@gmail.com
If the user wants if passed by reference they can use &argument, if by
value 'argument'.

On the other hand, say a subroutine receives a multi-register argument 
and just needs to call another function with it. careful choice of the
argument list means the second call can be performed without argument 
overhead.

Granted, an operation on a pass-by-reference type means needing to
provide storage space for the result, and associated loads/stores for
the operation itself (and for every operation on that type).

My 66000 ABI provides means so the space for the argument to be dumped
into memory (say varargs) is easily performed during prologue. When used 
to support varargs, the register passed arguments end up concatenated with 
the memory based arguments as a single vector of arguments.
 
Does save a little cost in that (typically) operations on larger types
are performed using runtime calls.
For example, one really doesn't want to inline the code for something
like a matrix multiply every time they want to perform a matrrix
multiply; better to perform the operation via a runtime call and
pass/return values by reference, which is what is what is done in these
cases.

I guess actually obeying the programmer has fallen out of fashion for compilers. 

>
>
> In ASM code, one could "bend the rules" in cases like this though,
> and
> treat 256-bits as pass-by-value rather than pass-by-reference.
>
>
> Multiply/divide is still a place where "the crap hits the fan"
> though,
> generally seems faster to go BCD->Binary->BCD in this case...
>
> Highly dependent on fast BCD-binary conversion {like a dedicated or
> obvious way to perform repeated strings of '×10 + digit shift by digit'.}
> Here you are using a "whole bunch" of multiplies to get to the point you
> can do an <ahem> multiply.
>

There are various options here, one of the faster options being:
  b=(a<<3)+(a<<1);
Which can be done in ~ 2 or 3 clock cycles.

Or:
  c=(a<<3)+(a<<1)+b;
Roughly 4 cycles.

I am a HW guy, "current = current×10+new;" should be 1 cycle. 
One reason these should be done in HW.


A full loop might look something like:
  a=0; b=src;
  for(i=0; i<16; i++)
    { a=(a<<3)+(a<<1)+(b>>60); b<<=4; }

Generic multiply is less ideal, since the relevant (64-bit) multiply
instruction takes around 66 clock-cycles in this case.

Even my first CPU design had 32×32 IMUL in 3 cycles. 66 means you aren't
even bothering to try. 


Doing Binary=>BCD is a lot faster via the ROTCL + BCDADC trick, as this
means one can do a 16-digit conversion in around 128 cycles, vs several
kilocycles.

HW can do 16 digit conversion in 16-17 cycles, depending on how much logic
you want to throw at the problem. 

> Decimal multiplication can be done "on-line" with small look tables,
> (pair of 8-bit in 8-bit out) and then use the decimal adders of ADD-SUB.
> {Tables only have 100 enums not the whole 256}
>
> I think you should be cognizant of the latency of well implemented DFP;
> a 30 cycle multiply is completely competitive.........

Admittedly, I was thinking more like around 500 cycles vs around 10k
cycles...

While I am thinking 64 cycles. 


But, having types where fundamental operations are measured in
kilocycles is almost a sure-fire way to make it to where no one uses them.

BTW: you mentioned 1K cycles. My 66000 can perform a context switch
from a task running under Guest OS[0] into a thread running under Guest_ OS[k] 
(k != 0) in 10 cycles. These are measured on x86 CPUs in the 10K range, where 
context switches within a Guest OS are in the 1K cycle range.

But there is a big unexpected trick required.

Samuel Falvo II

unread,
May 25, 2022, 1:46:39 AM5/25/22
to MitchAlsup, RISC-V ISA Dev, cr8...@gmail.com
Can we please just focus on the technical issues, and not be so
dismissive of others? Whether a 64-bit multiply takes 66 cycles or 3
is a matter of engineering priorities. Maybe 66 is "good enough" for
the application of their chip, and they'd rather devote more
real-estate for other things.
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/38a84d4b-e802-4342-908f-e7b23f7e850cn%40groups.riscv.org.



--
Samuel A. Falvo II

BGB

unread,
May 25, 2022, 4:27:04 AM5/25/22
to Samuel Falvo II, MitchAlsup, RISC-V ISA Dev
On 5/25/2022 12:46 AM, Samuel Falvo II wrote:
> Can we please just focus on the technical issues, and not be so
> dismissive of others? Whether a 64-bit multiply takes 66 cycles or 3
> is a matter of engineering priorities. Maybe 66 is "good enough" for
> the application of their chip, and they'd rather devote more
> real-estate for other things.

Yeah.

Until fairly recently, there was no 64-bit multiply, nor an integer
divider, and so these operations were being faked in software.
This was "mostly acceptable" as these operations tend to happen
relatively infrequently (so even if slow, their impact tends to be small).

The same unit was also be used to implement a 122 cycle FDIV (Binary64)
via a little bit of trickery. Generally, floating-point divide was also
being handled in software (still is in many cases).

While not particularly fast, the use of Shift-Add / Shift-Subtract unit
is at least reasonably cost-effective.
In my case, I am mostly targeting FPGAs (mostly Spartan and Artix
family), and resource cost is a concern.


There is a decided limits as to how much one can fit onto an Artix-7
FPGA, and what can pass timing constraints at a reasonable clock
frequency (I am mostly stuck at 50 MHz because one needs to make some
pretty severe compromises to pass timing at 75 or 100 MHz).



In the RISC-V alt-mode, it also allows for supporting the 'M' extension;
though much beyond RV64IM, the two ISAs diverge considerably in terms of
basic feature-set.


I do spend resource budget on lots of stuff, but usually stuff that pays
off in some way. This has led to a lot of "kind of wonky" feature
trade-offs though, given the metrics were often "how common is it" and
"how much does it cost in terms of LUT budget and timing" (as opposed to
either "elegance" or "orthogonality").


If an instruction is only eating around 0.05% or 0.01% of the
clock-cycle budget (this is where 64-bit multiply and divide are
currently at), it isn't a high priority.

I have a lot of other arguably more wonky/niche edge cases that eat a
lot more clock cycles than this.


For example, I recently added a special-purpose instruction to load and
extract a texel from a compressed texture (from memory) using an ST
coordinate vector as an index (with the base register encoding the
texture size and format via "tag bits"), collapsing a series of 5
instructions (which already included some special-purpose helper ops)
into a single instruction (3-cycle latency, 1-cycle throughput).

But, as wonky as this instruction is, when doing software-rasterized
OpenGL it still seems to end up hanging around in the top 20 list (in
terms of a ranking of which instructions are eating most of the
clock-cycle budget). (Other common operations in this case being those
related to interpolation, modulation, and blending; mostly SIMD ops).

Though, in general, memory load/store ops are at the top of this
ranking, they are almost always at the top of this ranking.


As for whether or not any of this makes sense for a "general purpose"
ISA is debatable.

In some cases, one may find that cost/benefit tradeoffs can get a bit
counterintuitive.

James Cloos

unread,
May 27, 2022, 9:48:48 AM5/27/22
to isa...@groups.riscv.org, MitchAlsup, Robert Finch, Nick Knight, kane...@gmail.com
btw,

i suspect a more useful idea is to make L be the first extension to use
48bit ops. a 32bit chunk could be set aside for alternate math such as
L and a posit-in-parallel-with-FDQ extension. (as opposed to a posit-
intead-of-ieee extension.) the notes about yet more state and context
swtching suggest a separate opcode range would be better. and a start
into 48bit can only help hw devs make sure their code works beyond 32bit
ops.

L, as a concept, does have (at least) some value.

-JimC
--
James Cloos <cl...@jhcloos.com> OpenPGP: 0x997A9F17ED7DAEA6
Reply all
Reply to author
Forward
0 new messages