434 views

Skip to first unread message

Dec 2, 2022, 8:28:55 AM12/2/22

to

Suppose you are implementing a BCD multiplier. Google says there are reasonably efficient known circuits for this. Say 8x8 -> 16 digits, and trying for a reasonable compromise between speed and number of logic gates.

And suppose the CPU in which this unit is located, also wants reasonably fast binary 32x32 -> 64 bit multiplication. Same input and output registers, same width operands, as the BCD counterpart.

Can the BCD and binary multipliers share circuitry? I know the answer to the corresponding question for adders is yes, because that was done in the 6502, but I have no idea whether the same is true for multipliers.

And suppose the CPU in which this unit is located, also wants reasonably fast binary 32x32 -> 64 bit multiplication. Same input and output registers, same width operands, as the BCD counterpart.

Can the BCD and binary multipliers share circuitry? I know the answer to the corresponding question for adders is yes, because that was done in the 6502, but I have no idea whether the same is true for multipliers.

Dec 2, 2022, 7:04:01 PM12/2/22

to

<

Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic

<

No, because it makes the component from which you built the multiplier tree (array) bigger and slower.

<

Most designs that need both have a tree dedicated to binary multiplication and a tree dedicated to decimal multiplication.

<

Some of these decimal things are optimized for 128-bit Decimal arithmetic, while few binary multipliers are optimized for anything more than IEE 754 double precision multiplications.

<

And finally, we are in an era where we can put several handfuls of of cores on a die, each core can have just about as much calculation logic as the designers can figure out how to utilize. So, why not both--the size of the additional multiplication tree is hardly a rounding error in the die size of a GBOoO design.

Dec 3, 2022, 8:28:52 AM12/3/22

to

On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:

> Yes, and no.

> <

> Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic

> <

> No, because it makes the component from which you built the multiplier tree (array) bigger and slower.

Can you put even approximate numbers on the costs?
> Yes, and no.

> <

> Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic

> <

> No, because it makes the component from which you built the multiplier tree (array) bigger and slower.

Like, say a 32x32 -> 64 binary multiplier takes area 1.0.

And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.

And a combined multiplier takes area 1.6 (compared to the sum of the dedicated ones that would be 2.2) and is 3 gates delays slower.

I just guessed those figures as examples; what would more accurate ones look like?

> And finally, we are in an era where we can put several handfuls of of cores on a die, each core can have just about as much calculation logic as the designers can figure out how to utilize. So, why not both--the size of the additional multiplication tree is hardly a rounding error in the die size of a GBOoO design.

Dec 3, 2022, 12:42:04 PM12/3/22

to

On Saturday, December 3, 2022 at 7:28:52 AM UTC-6, Russell Wallace wrote:

> On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:

> > Yes, and no.

> > <

> > Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic

> > <

> > No, because it makes the component from which you built the multiplier tree (array) bigger and slower.

> Can you put even approximate numbers on the costs?

>

> Like, say a 32x32 -> 64 binary multiplier takes area 1.0.

>

> And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.

>

> And a combined multiplier takes area 1.6 (compared to the sum of the dedicated ones that would be 2.2) and is 3 gates delays slower.

<

The first two are "close enough"; good intuition.
> On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:

> > Yes, and no.

> > <

> > Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic

> > <

> > No, because it makes the component from which you built the multiplier tree (array) bigger and slower.

> Can you put even approximate numbers on the costs?

>

> Like, say a 32x32 -> 64 binary multiplier takes area 1.0.

>

> And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.

>

> And a combined multiplier takes area 1.6 (compared to the sum of the dedicated ones that would be 2.2) and is 3 gates delays slower.

<

<

The combined multiplier takes area 2.0 and is ½ as fast (as a starting point guess).

Dec 3, 2022, 1:01:28 PM12/3/22

to

Russell Wallace <russell...@gmail.com> writes:

>On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:

>> Yes, and no.=20
>On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:

>> <=20

>> Yes, because it is theoretically possible to make a handful of Cary Save =

>Adders do both 1:15 and 1:9 arithmetic=20

>> <=20

>> No, because it makes the component from which you built the multiplier tr=

>ee (array) bigger and slower.=20

>

>Can you put even approximate numbers on the costs?

>

>Like, say a 32x32 -> 64 binary multiplier takes area 1.0.

>

>And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.

>

>And a combined multiplier takes area 1.6 (compared to the sum of the dedica=
>Can you put even approximate numbers on the costs?

>

>Like, say a 32x32 -> 64 binary multiplier takes area 1.0.

>

>And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.

>

>ted ones that would be 2.2) and is 3 gates delays slower.

>

>I just guessed those figures as examples; what would more accurate ones loo=
>

>k like?

>

>> And finally, we are in an era where we can put several handfuls of of cor=

>es on a die, each core can have just about as much calculation logic as the=

> designers can figure out how to utilize. So, why not both--the size of the=

> additional multiplication tree is hardly a rounding error in the die size =

>of a GBOoO design.

>

>Right, that's definitely true. I'm just thinking about a hypothetical situa=

>tion in which a design is heavily constrained by die area, so it's difficul=

>t to find enough for both. (Specifically, 1980s alternate history; it seems=

> to me that around the mid eighties, transistor count would be at the point=

> where a multiplier tree was possible but still expensive.)

The Burroughs medium systems line could perform an arithmetic operation
on up to 100 digit/nibble (BCD) operands with a single instruction. This was

implemented originally in discrete (transistor) logic in the B3500 circa

1964-5, then custom SSI logic in the second generation, and

standard TTL dip logic in the third and finally in gate arrays (1984-5)

for the fourth (and final) generation V530 system.

Addition was performed starting from the most significant digit which

allowed it to detect overflow before altering the receiving field. The

algorithm was basically to zero extend the shorter of the two operands

and then start adding each digit starting from the MSD. If the first

digits sum to greater than 9, overflow has occurred and the operation

terminates with the Overflow Flag set. Otherwise, if the value is

nine, a counter is incremented and the algorithm proceeds to the next

digit. If the value is less than 9, the counter is cleared. If

a subsequent digit position sums to greater than 9, and the 9's

counter is non-zero, an overflow has occurred and the operation

is terminated with OFL set. Otherwise the result digit is written

to the receiveing field.

ADD AFBF A B C

Where AF is the A operand field length (2 BCD digits) and

BF is the B operand field length. The receiving field length

is MAX(AF, BF).

ADD/SUB/MUL/DIV were all supported for UN (unsigned numeric),

SN (signed number - MSD was sign digit) or UA (unsigned alphanumeric).

UA data was 8-bit (and aligned on even address boundaries); for UA

operands, the arithmetic logic ignored the zone digit and performed

the arithmetic on the numeric digits. The receiving field could be

designated differently from the operand fields. So a UA operand could be

multiplied by a UN operand yielding a SN result. The UA fields

(EBCDIC or ASCII encoding) could then be placed directly to the correct

position in a 132-byte long receiving field. This was, after all,

designed specifically to efficiently support COBOL.

Note that addressing in this system was to the 4-bit digit (nibble)

(middle generations used a 16-bit (4-digit) memory access size, later generations

used a 40-bit (10 digit) memory access size to processor cache).

1025475_B2500_B3500_RefMan_Oct69.pdf is on bitsavers, with the

algorithm flowchart on page 51.

Dec 3, 2022, 1:21:09 PM12/3/22

to

Russell Wallace <russell...@gmail.com> writes:

> (Specifically, 1980s alternate history; it seems=

> to me that around the mid eighties, transistor count would be at the point=

think it inherited this from the R2000 (1986), so I don't think they

have a multiplier in the modern sense. Early SPARC and HPPA CPUs do

not have an integer multiplier at all. On the 80386 (1985) a 32x32

multiply takes 9-38 cycles, and on a 486 (1989) 13-42 cycles, so they

do not have a modern multiplier, either.

The time for that was slightly later. The 88100 (1988) has a

four-stage multiplier pipeline (for integer and FP), where the integer

part is done after two cycles.

Concerning your question, if you need BCD numbers and need

multiplication of that, the best way IMO is to have a fast BCD->binary

conversion and fast binary->BCD conversion (maybe with hardware

assist), and do the multiplication (and other computations) in binary.

- anton

--

'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'

Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

> (Specifically, 1980s alternate history; it seems=

> to me that around the mid eighties, transistor count would be at the point=

> where a multiplier tree was possible but still expensive.)

The MIPS R3000 (1988) has a 12-cycle non-pipelined multiply (and I
think it inherited this from the R2000 (1986), so I don't think they

have a multiplier in the modern sense. Early SPARC and HPPA CPUs do

not have an integer multiplier at all. On the 80386 (1985) a 32x32

multiply takes 9-38 cycles, and on a 486 (1989) 13-42 cycles, so they

do not have a modern multiplier, either.

The time for that was slightly later. The 88100 (1988) has a

four-stage multiplier pipeline (for integer and FP), where the integer

part is done after two cycles.

Concerning your question, if you need BCD numbers and need

multiplication of that, the best way IMO is to have a fast BCD->binary

conversion and fast binary->BCD conversion (maybe with hardware

assist), and do the multiplication (and other computations) in binary.

- anton

--

'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'

Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Dec 3, 2022, 1:23:33 PM12/3/22

to

On Saturday, December 3, 2022 at 5:42:04 PM UTC, MitchAlsup wrote:

> The combined multiplier takes area 2.0 and is ½ as fast (as a starting point guess).

Ah, then I see what you mean about it being not worthwhile to combine them. Thanks!
> The combined multiplier takes area 2.0 and is ½ as fast (as a starting point guess).

Dec 3, 2022, 1:29:28 PM12/3/22

to

On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:

> Concerning your question, if you need BCD numbers and need

> multiplication of that, the best way IMO is to have a fast BCD->binary

> conversion and fast binary->BCD conversion (maybe with hardware

> assist), and do the multiplication (and other computations) in binary.

Hmm! It seems to me that each of those conversions is about as expensive as a multiplication, so I'm not seeing the overall saving. Unless you're assuming there will be many calculations to be done before converting back to BCD? That would make sense, except that part of the point of BCD is being able to specify exact decimal rounding, which would require back conversion after each step.
> Concerning your question, if you need BCD numbers and need

> multiplication of that, the best way IMO is to have a fast BCD->binary

> conversion and fast binary->BCD conversion (maybe with hardware

> assist), and do the multiplication (and other computations) in binary.

An exception would be something like a transcendental function, where exact decimal rounding is moot, and there are indeed many calculations to be done in the middle. Then it could make sense to convert to binary for the calculation.

Dec 3, 2022, 1:58:03 PM12/3/22

to

Russell Wallace <russell...@gmail.com> writes:

>On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:

>> Concerning your question, if you need BCD numbers and need=20
>On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:

>> multiplication of that, the best way IMO is to have a fast BCD->binary=20

>> conversion and fast binary->BCD conversion (maybe with hardware=20

>> assist), and do the multiplication (and other computations) in binary.=20

>

>Hmm! It seems to me that each of those conversions is about as expensive as=

> a multiplication, so I'm not seeing the overall saving. Unless you're assu=

>ming there will be many calculations to be done before converting back to B=

>CD? That would make sense, except that part of the point of BCD is being ab=

>le to specify exact decimal rounding, which would require back conversion a=

>fter each step.

The conversions aren't cheap. Here's a patent by one of my ex-collegues

to do bcd to binary conversions:

https://patents.google.com/patent/US4331951

(The machine Laury was working on for this patent was the

third generation system, the B4900; the D2B and B2D instructions

weren't part of the first or second generations).

Dec 3, 2022, 3:21:03 PM12/3/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

carry-save adders (plus shifting).

Converting binary to decimal is much more time-consuming. I believe

the standard way is to use the "multiply by the inverse" method to

divide by 5, which sort of defeats the purpose of having it behind

a multiplier.

If somebody has invented a faster way, I'd like to know about it :-)

(And yes, I know the method of first calculating the remainder of

the division by 5 by shifting and adding, which could also be done

with carry-save adders, but it would still need one multiplication

per digit).

> Russell Wallace <russell...@gmail.com> writes:

>> (Specifically, 1980s alternate history; it seems=

>> to me that around the mid eighties, transistor count would be at the point=

>> where a multiplier tree was possible but still expensive.)

>

> The MIPS R3000 (1988) has a 12-cycle non-pipelined multiply (and I

> think it inherited this from the R2000 (1986), so I don't think they

> have a multiplier in the modern sense. Early SPARC and HPPA CPUs do

> not have an integer multiplier at all. On the 80386 (1985) a 32x32

> multiply takes 9-38 cycles, and on a 486 (1989) 13-42 cycles, so they

> do not have a modern multiplier, either.

>

> The time for that was slightly later. The 88100 (1988) has a

> four-stage multiplier pipeline (for integer and FP), where the integer

> part is done after two cycles.

>

> Concerning your question, if you need BCD numbers and need

> multiplication of that, the best way IMO is to have a fast BCD->binary

> conversion and fast binary->BCD conversion (maybe with hardware

> assist), and do the multiplication (and other computations) in binary.

Converting decimal to binary is straightfordward, you need quite
>> (Specifically, 1980s alternate history; it seems=

>> to me that around the mid eighties, transistor count would be at the point=

>> where a multiplier tree was possible but still expensive.)

>

> The MIPS R3000 (1988) has a 12-cycle non-pipelined multiply (and I

> think it inherited this from the R2000 (1986), so I don't think they

> have a multiplier in the modern sense. Early SPARC and HPPA CPUs do

> not have an integer multiplier at all. On the 80386 (1985) a 32x32

> multiply takes 9-38 cycles, and on a 486 (1989) 13-42 cycles, so they

> do not have a modern multiplier, either.

>

> The time for that was slightly later. The 88100 (1988) has a

> four-stage multiplier pipeline (for integer and FP), where the integer

> part is done after two cycles.

>

> Concerning your question, if you need BCD numbers and need

> multiplication of that, the best way IMO is to have a fast BCD->binary

> conversion and fast binary->BCD conversion (maybe with hardware

> assist), and do the multiplication (and other computations) in binary.

carry-save adders (plus shifting).

Converting binary to decimal is much more time-consuming. I believe

the standard way is to use the "multiply by the inverse" method to

divide by 5, which sort of defeats the purpose of having it behind

a multiplier.

If somebody has invented a faster way, I'd like to know about it :-)

(And yes, I know the method of first calculating the remainder of

the division by 5 by shifting and adding, which could also be done

with carry-save adders, but it would still need one multiplication

per digit).

Dec 4, 2022, 4:14:35 AM12/4/22

to

Russell Wallace <russell...@gmail.com> writes:

>Hmm! It seems to me that each of those conversions is about as expensive as=

> a multiplication

Possibly.

>Unless you're assu=

>ming there will be many calculations to be done before converting back to B=

>CD?

Yes.

>That would make sense, except that part of the point of BCD is being ab=

>le to specify exact decimal rounding, which would require back conversion a=

>fter each step.

It's not clear to me what you mean with "exact decimal rounding", but

I am sure that one can also do it on the binary representation. The

typical rounding rules in commerce were designed for finite-precision

fixed point, which can be represented by binary integers just fine.

The only reason to convert between binary and decimal is I/O.

>Hmm! It seems to me that each of those conversions is about as expensive as=

> a multiplication

Possibly.

>Unless you're assu=

>ming there will be many calculations to be done before converting back to B=

>CD?

Yes.

>That would make sense, except that part of the point of BCD is being ab=

>le to specify exact decimal rounding, which would require back conversion a=

>fter each step.

It's not clear to me what you mean with "exact decimal rounding", but

I am sure that one can also do it on the binary representation. The

typical rounding rules in commerce were designed for finite-precision

fixed point, which can be represented by binary integers just fine.

The only reason to convert between binary and decimal is I/O.

Dec 4, 2022, 7:40:23 AM12/4/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

Like I wrote, decimal->binary is just a carry-save tree.

Binary->decimal is much more difficult. You almost have to do

several rounds of (quotient, remainder) operations, probably using

packages of more than one digit. Remainder is relatively cheap

and can done with a carry-save adder, conditionally summing up the

residuals (2^n) mod (10^m) for bit n and m the number of decimal

digits you want to do at a time.

But you need a division by 10^m, and if anybody has found a better

way than using the high bits of long multiplication, I haven't

seen it described anywhere.

This makes me understand that people did decimal computers for a time

:-)

> Russell Wallace <russell...@gmail.com> writes:

>>Hmm! It seems to me that each of those conversions is about as expensive as=

>> a multiplication

>

> Possibly.

Much more so.
>>Hmm! It seems to me that each of those conversions is about as expensive as=

>> a multiplication

>

> Possibly.

Like I wrote, decimal->binary is just a carry-save tree.

Binary->decimal is much more difficult. You almost have to do

several rounds of (quotient, remainder) operations, probably using

packages of more than one digit. Remainder is relatively cheap

and can done with a carry-save adder, conditionally summing up the

residuals (2^n) mod (10^m) for bit n and m the number of decimal

digits you want to do at a time.

But you need a division by 10^m, and if anybody has found a better

way than using the high bits of long multiplication, I haven't

seen it described anywhere.

This makes me understand that people did decimal computers for a time

:-)

Dec 4, 2022, 7:56:51 AM12/4/22

to

Using a binary multiplier with groups of 4x4 input bits generating 8-bit

products, those same groups can obviously also take BCD inputs. :-)

Terje

--

- <Terje.Mathisen at tmsw.no>

"almost all programming can be viewed as an exercise in caching"

Dec 4, 2022, 8:16:44 AM12/4/22

to

On Friday, December 2, 2022 at 6:28:55 AM UTC-7, Russell Wallace wrote:

> Can the BCD and binary multipliers share circuitry? I know the answer to the

> corresponding question for adders is yes, because that was done in the 6502,

> but I have no idea whether the same is true for multipliers.

Since it is true for adders, it's also true for multipliers if multiplication is only
> Can the BCD and binary multipliers share circuitry? I know the answer to the

> corresponding question for adders is yes, because that was done in the 6502,

> but I have no idea whether the same is true for multipliers.

being done by addition.

If you are using a faster multiplier design, though, like a tree of carry-save adders,

it's harder to use shared circuitry.

John Savard

Dec 4, 2022, 11:22:13 AM12/4/22

to

simple since the naive algorithm is fast (2-4 clock cycles/digit, but

for long inputs you want SIMD which can at least double the speed.

The reverse operation traditionally requires a DIV & MOD operation for

every digit, but I posted an algorithm here more than two decades ago

that does it far faster, at around 20-50 clock cycles in total depending

upon the maximum number of digits allowed.

For BCD with binary mantissa you maintain a decimal exponent which makes

it easy to detect when you need to perform any rescaling and/or decimal

rounding. This is the allowed BID format according to the DFP part of

the IEEE754-2019 standard.

Dec 4, 2022, 11:41:59 AM12/4/22

to

Scott Lurndal wrote:

> Russell Wallace <russell...@gmail.com> writes:

>> On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:

>>> Concerning your question, if you need BCD numbers and need=20

>>> multiplication of that, the best way IMO is to have a fast BCD->binary=20

>>> conversion and fast binary->BCD conversion (maybe with hardware=20

>>> assist), and do the multiplication (and other computations) in binary.=20

>>

>> Hmm! It seems to me that each of those conversions is about as expensive as=

>> a multiplication, so I'm not seeing the overall saving. Unless you're assu=

>> ming there will be many calculations to be done before converting back to B=

>> CD? That would make sense, except that part of the point of BCD is being ab=

>> le to specify exact decimal rounding, which would require back conversion a=

>> fter each step.

>

> The conversions aren't cheap. Here's a patent by one of my ex-collegues

> to do bcd to binary conversions:

>

> https://patents.google.com/patent/US4331951

I did not and will not read that patent, even assuming it is long expired.
> Russell Wallace <russell...@gmail.com> writes:

>> On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:

>>> Concerning your question, if you need BCD numbers and need=20

>>> multiplication of that, the best way IMO is to have a fast BCD->binary=20

>>> conversion and fast binary->BCD conversion (maybe with hardware=20

>>> assist), and do the multiplication (and other computations) in binary.=20

>>

>> Hmm! It seems to me that each of those conversions is about as expensive as=

>> a multiplication, so I'm not seeing the overall saving. Unless you're assu=

>> ming there will be many calculations to be done before converting back to B=

>> CD? That would make sense, except that part of the point of BCD is being ab=

>> le to specify exact decimal rounding, which would require back conversion a=

>> fter each step.

>

> The conversions aren't cheap. Here's a patent by one of my ex-collegues

> to do bcd to binary conversions:

>

> https://patents.google.com/patent/US4331951

>

> (The machine Laury was working on for this patent was the

> third generation system, the B4900; the D2B and B2D instructions

> weren't part of the first or second generations).

>

Today you can convert a 32-digit ascii (or BCD) value to binary using a
> (The machine Laury was working on for this patent was the

> third generation system, the B4900; the D2B and B2D instructions

> weren't part of the first or second generations).

>

single AVX (32-byte) register to hold the input:

0) If BCD, expand the nybbles to bytes, will still fit inside that reg.

1) Expand the 32 bytes into a hi/lo pair of 16 u16 slots: The first will

be 1e16 more worth than the second when combining them back together at

the end.

2) For each half, multiply successive u16 slots by (10,1,10,1...), then

add each pair together with a horizontal add. This gives us 32 bits

available for each 2-digit pair.

3) Multiply those u32 slots by (10000,1,10000,1...) and add

horizontally. The results will all be less than 1e8 so they fit nicely

4) Now we have 64 bits available so we can safely multiply by

(1e8,1,1e8,1...) and give us 16-digit intermediate result.

5) Extract the high u64 from the SIMD reg and multiply by 1e16 (using a

64x64->128 MUL, then ADD/ADC the low u64 from the SIMD reg.

To me this looks like 4 multiplications each taking about 5 cycles, plus

some overhead so maybe 30+ clock cycles?

BTW, if the input actually had 34 digits, as in a full DFP128 register,

then I would simply convert those top two digits using integer regs,

which would almost certainly run in parallel with/overlap the SIMD calcs

for no extra real cost.

Dec 4, 2022, 11:52:46 AM12/4/22

to

working modulo 1e5 for u32 and 1e10 for u64. For a full 34-digit DFP

value, the mantissa can be split into three or four parts.

The way it works is by scaling each binary chunk which effectively

converts it into fixed-point value with the top digit in the top 4 bits

of a register and the remaining bits containing the fraction. After

extracting the top digit you mask those bits, multiply the remainder by

5(! not 10) and move the extraction window one bit down. These bits now

contain the next digit. Mul-by-5 happens with LEA or shift+add, so one

or two cycles.

Since you have multiple chunks to work on, all the execution units can

be kept buzy spitting out 2-4 digits every 2-3 cycles.

Dec 4, 2022, 6:21:25 PM12/4/22

to

Thomas Koenig <tko...@netcologne.de> writes:

>Converting binary to decimal is much more time-consuming.

Apart from Terje Mathisen's approach, there is also the following
>Converting binary to decimal is much more time-consuming.

approach:

First, we need an operation for doubling a decimal number d, i.e.,

2*d. This can be implemented based on decimal addition and should

work in one cycle (right?). To convert a binary number to decimal, we

do:

/* input: binary */ uint64_t b=...

/* output: */ udecimal d=0;

while (int i=0; i<64; i++) {

d = 2*d + (udecimal)(b>>63); /* decimal * and + */

b <<= 1;

}

Once this loop is finished, d contains the decimal representation of

b. Ok, this takes at least 64 cycles if 2*d+[0..1] takes one cycle.

How can we make it faster?

The structure of 64 staggered additions is like in multiplication (but

the problem is probably simpler), so I would not be surprised if the

same techniques as for full multipliers could be applied here to get

something that runs in maybe 4 cycles.

Even if that is infeasible or if we don't want to pay the area price

for that, one may be able to transfer more than one bit per cycle from

b to d, reducing the number of cycles necessary.

Also, b could be split into b/10^8 and b%10^8, and the parts could be

converted in parallel, and combined in the end.

Dec 4, 2022, 7:35:55 PM12/4/22

to

https://groups.google.com/g/comp.arch/c/APbcPjoqs_g/m/pAjyaKjAsDIJ

28 clocks on computer that was considered fast 10 years ago.

Several almost portable variants in GNU dialect of C from 5 years ago:

https://github.com/already5chosen/bin2ascii

The fastest variant runs in 52 clocks on the same computer that was considered

fast 10 years ago. It runs in 43 cycles on computer that was considered fast

(in IPC, not in absolute speed) 5 years ago.

All variants assume full 64-bit binary input == 20-digit output.

I never bothered to upgrade SSE to AVX2, because I can't imagine

a situation where 28 clocks are not sufficiently fast.

Dec 5, 2022, 4:33:22 AM12/5/22

to

>/* input: binary */ uint64_t b=...

>/* output: */ udecimal d=0;

b<<=i;

while (; i<64; i++) {

d = 2*d + (udecimal)(b>>63); /* decimal * and + */

b <<= 1;

}

b <<= 1;

}

Dec 5, 2022, 4:58:03 PM12/5/22

to

On 12/4/2022 6:56 AM, Terje Mathisen wrote:

> Russell Wallace wrote:

>> Suppose you are implementing a BCD multiplier. Google says there are

>> reasonably efficient known circuits for this. Say 8x8 -> 16 digits,

>> and trying for a reasonable compromise between speed and number of

>> logic gates.

>>

>> And suppose the CPU in which this unit is located, also wants

>> reasonably fast binary 32x32 -> 64 bit multiplication. Same input and

>> output registers, same width operands, as the BCD counterpart.

>>

>> Can the BCD and binary multipliers share circuitry? I know the answer

>> to the corresponding question for adders is yes, because that was

>> done in the 6502, but I have no idea whether the same is true for

>> multipliers.

>

> The answer must be yes, since even I can see at least one way to do it:

> Using a binary multiplier with groups of 4x4 input bits generating 8-bit

> products, those same groups can obviously also take BCD inputs. :-)

>

Though, adding the intermediate results together, and everything from
> Russell Wallace wrote:

>> Suppose you are implementing a BCD multiplier. Google says there are

>> reasonably efficient known circuits for this. Say 8x8 -> 16 digits,

>> and trying for a reasonable compromise between speed and number of

>> logic gates.

>>

>> And suppose the CPU in which this unit is located, also wants

>> reasonably fast binary 32x32 -> 64 bit multiplication. Same input and

>> output registers, same width operands, as the BCD counterpart.

>>

>> Can the BCD and binary multipliers share circuitry? I know the answer

>> to the corresponding question for adders is yes, because that was

>> done in the 6502, but I have no idea whether the same is true for

>> multipliers.

>

> The answer must be yes, since even I can see at least one way to do it:

> Using a binary multiplier with groups of 4x4 input bits generating 8-bit

> products, those same groups can obviously also take BCD inputs. :-)

>

there on, would need to be different...

AFAIK, you can't just multiply two large BCD numbers together as binary

numbers and then fixup the results afterwards (though, this would be

convenient).

In this case, the BCD adder tree would likely be a bigger issue than the

digit-for-digit BCD multipliers (which, also annoyingly, probably

wouldn't fit all that efficiently in the available LUT sizes on a

typical FPGA).

Most efficient configuration I can currently think up would need ~ 11

LUTs per BCD digit pair. This would first mash the pattern from 8 to 7

bits, and could then use 8x LUT7->1, noting as that { A[3:1], B[3:1] }

can be mashed from 6-bits to 5-bits (as they only cover the range 000..100).

Most other patterns I can think up would need significantly more LUTs

per digit.

Though, can note that I have noted that a "BCDADD" instruction could be

potentially used to implement a BCD multiplier Shift-ADD style.

Say:

R2=Output, R4=Value A, R5=Value B

MOV 0, R2 //init to zeroes

SHLD.Q R4, 0, R16 //first position

SHLD.Q R5, 0, R17 //first digit

TEST R17, R17

BT .done //no more digits remain

TEST 1, R17

BCDADD?F R16, R2

BCDADD R16, R16

TEST 2, R17

BCDADD?F R16, R2

BCDADD R16, R16

TEST 4, R17

BCDADD?F R16, R2

BCDADD R16, R16

TEST 8, R17

BCDADD?F R16, R2

BCDADD R16, R16

SHLD.Q R4, 4, R16 //second position

SHLD.Q R5, -4, R17 //second digit

TEST R17, R17

BT .done //no more digits remain

TEST 1, R17

BCDADD?F R16, R2

BCDADD R16, R16

TEST 2, R17

BCDADD?F R16, R2

BCDADD R16, R16

TEST 4, R17

BCDADD?F R16, R2

BCDADD R16, R16

TEST 8, R17

BCDADD?F R16, R2

BCDADD R16, R16

... //for each digit of the second operand

Could also be turned into a loop (for each BCD digit).

Here, R16 would need to be reset every digit mostly because 4x BCDADD

would effectively multiply the value by 16 (decimal), but this could be

skipped if multiplying a BCD number with a binary number.

This process could maybe also be done in hardware.

Would get more convoluted if one needs to deal with numbers larger than

16 decimal digits.

> Terje

>

Dec 6, 2022, 5:57:07 PM12/6/22

to

is possible to convert a 64-bit number to 16-digit BCD in around 130

clock-cycles or so (using a sequence of ROTCL and BCDADC instructions).

Seems like it could also be possible to implement a 2x2 or 4x4 digit BCD

multiplier in a "reasonable" cost if needed, which could allow for

"semi-fast" BCD multiply (as 4x4->8 digit steps).

More debatable if it is actually likely to be needed though.

It is likely to be hardly (if ever) used, so it would probably not be

worthwhile even if it were fairly cheap...

> - anton

Dec 6, 2022, 6:05:32 PM12/6/22

to

Anyone care to speculate on the size of the fixed.point BCD market ?

Dec 6, 2022, 7:37:19 PM12/6/22

to

MitchAlsup <Mitch...@aol.com> writes:

>Anyone care to speculate on the size of the fixed.point BCD market ?

"New research on the global scale of the COBOL programming language
>Anyone care to speculate on the size of the fixed.point BCD market ?

suggests that there are upwards of 800 billion lines of COBOL code

being used by organizations and institutes worldwide, some three

times larger than previously estimated."

Feb 9, 2022

Dec 6, 2022, 9:10:16 PM12/6/22

to

<

That is:: are the COBOL users not locked into whatever they are currently using ?

<

How many of those lines are dependent on BCD ?

Dec 7, 2022, 12:20:05 AM12/7/22

to

>"New research on the global scale of the COBOL programming language

>suggests that there are upwards of 800 billion lines of COBOL code

>being used by organizations and institutes worldwide, some three

>times larger than previously estimated."

>Feb 9, 2022

Looks like an opportunity to design a COBOL oriented processing core.
>suggests that there are upwards of 800 billion lines of COBOL code

>being used by organizations and institutes worldwide, some three

>times larger than previously estimated."

>Feb 9, 2022

It has been many years since I took a course in COBOL programming.

Unfortunately, I have lost my COBOL textbook. The PIC statement stands out

in my memory. Having worked on business apps (non COBOL), I have some

interest in a processor geared towards supporting COBOL. I think it may make

for an interesting design. Just looked up on the web limits for numeric fields

which is 18 digits. So, an 18.18 fixed point BCD may be enough. To be safe and

have a couple of extra digits for rounding a 20.20 fixed BCD format could be

used. That means processing using 160 plus bit registers assuming values can

fit entirely in registers. To keep things simple, the decimal point would be in a

fixed position all the time. Going away to the drawing board now.

Dec 7, 2022, 4:48:52 AM12/7/22

to

Cobol processing only, or a commodity computer that may or may not

process the COBOL stuff a little slower.

>How many of those lines are dependent on BCD ?

fixed-point numbers internally as (scaled) binary integers, and only

convert to decimal representation on I/O. But my knowledge of COBOL

is very limited, so my confidence in this statement is low.

Anyway, the fact that Intel added DAA to 8086 (which works on a single

byte), but did not include any additional BCD support (or at least

better support for binary<->BCD conversion) despite adding a huge

number of instructions over the years, some for very specialized uses,

indicates that Intel thinks that the fixed.point BCD market outside of

traditional mainframes is to small for adding even something cheap

like DCOR and IDCOR below.

Similar for other microprocessor manufacturers since 1980: Most did

not add anything, HP's PA included DCOR and IDCOR (seem to be a

word-wide versions of DAA, except that the operand of the addition has

to be pre-biased; DCOR produces an unbiased result, IDCOR biased; page

204 (labeled page 5-113) of

http://www.bitsavers.org/pdf/hp/pa-risc/09740-90039_PA_RISC_1.1_Architecture_and_Instruction_Set_Reference_Manual_Ed3.pdf

and page 163 (7-37) of

https://parisc.wiki.kernel.org/images-parisc/7/73/Parisc2.0.pdf gives

an example, the same one). IIRC IA-64 did not get such instructions,

so apparently the need for BCD had evaporated in the decade between

the design of HPPA and IA-64.

Dec 7, 2022, 7:02:46 AM12/7/22

to

On Wednesday, December 7, 2022 at 11:48:52 AM UTC+2, Anton Ertl wrote:

> MitchAlsup <Mitch...@aol.com> writes:

> >On Tuesday, December 6, 2022 at 6:37:19 PM UTC-6, Scott Lurndal wrote:

> >> MitchAlsup <Mitch...@aol.com> writes:

> >> >Anyone care to speculate on the size of the fixed.point BCD market ?

> ><

> >> "New research on the global scale of the COBOL programming language

> >> suggests that there are upwards of 800 billion lines of COBOL code

> >> being used by organizations and institutes worldwide, some three

> >> times larger than previously estimated."

> >> Feb 9, 2022

> ><

> >What percent of these would be available to non-Mainframe competition ?

> ><

> >That is:: are the COBOL users not locked into whatever they are currently using ?

> And even if they can switch, would they buy a specialty computer for

> Cobol processing only, or a commodity computer that may or may not

> process the COBOL stuff a little slower.

> >How many of those lines are dependent on BCD ?

> I think that COBOL is abstract enough that it can store the

> fixed-point numbers internally as (scaled) binary integers, and only

> convert to decimal representation on I/O. But my knowledge of COBOL

> is very limited, so my confidence in this statement is low.

>

If recommended limits for scaled decimal type are indeed 1.18.18 I'd take it as
> MitchAlsup <Mitch...@aol.com> writes:

> >On Tuesday, December 6, 2022 at 6:37:19 PM UTC-6, Scott Lurndal wrote:

> >> MitchAlsup <Mitch...@aol.com> writes:

> >> >Anyone care to speculate on the size of the fixed.point BCD market ?

> ><

> >> "New research on the global scale of the COBOL programming language

> >> suggests that there are upwards of 800 billion lines of COBOL code

> >> being used by organizations and institutes worldwide, some three

> >> times larger than previously estimated."

> >> Feb 9, 2022

> ><

> >What percent of these would be available to non-Mainframe competition ?

> ><

> >That is:: are the COBOL users not locked into whatever they are currently using ?

> And even if they can switch, would they buy a specialty computer for

> Cobol processing only, or a commodity computer that may or may not

> process the COBOL stuff a little slower.

> >How many of those lines are dependent on BCD ?

> I think that COBOL is abstract enough that it can store the

> fixed-point numbers internally as (scaled) binary integers, and only

> convert to decimal representation on I/O. But my knowledge of COBOL

> is very limited, so my confidence in this statement is low.

>

a strong hint toward standard body's preference for binary representation.

Or, possibly, DPD, but certainly not BCD.

With the pair of binary/DPD 64-bit words one can do 1.18.19, but may be they

were envisioning smaller machines, where 4x32 is more convenient than 2x64.

Or, may be, they just don't like 1.18.19 for reasons of aesthetic.

Dec 7, 2022, 8:04:58 AM12/7/22

to

Assuming that it takes at least 10 seconds to write & debug a line of

COBOL (likely too optimistic by at least one order of magnitude), those

800E9 lines would have taken at least 8E12 seconds or 254K man-years

(working 24/7). With a more sustainable 2000 hours of programming/year

we can multiply that by 4+ to reach at least 1M man-years.

The only possible way to reach that figure is by counting the same code

lines many, many times, as in every installation have some local

modifications, so we count their entire code base as a unique set.

Dec 7, 2022, 9:17:42 AM12/7/22

to

>I call bullshit on that number!

programmers working for 10 years. That was a world-wide number.

Block-copy and paste approach generates a lot of LOC fast.

>If recommended limits for scaled decimal type are indeed 1.18.18

128-bit DPD floating-point would work. Gives about 34 digits.

Is there a significant difference in results between scaled integers and

decimal floating point?

Dec 7, 2022, 10:03:09 AM12/7/22

to

"robf...@gmail.com" <robf...@gmail.com> writes:

>>"New research on the global scale of the COBOL programming language

>>suggests that there are upwards of 800 billion lines of COBOL code

>>being used by organizations and institutes worldwide, some three

>>times larger than previously estimated."

>>Feb 9, 2022

>

>Looks like an opportunity to design a COBOL oriented processing core.

Like this one?
>>"New research on the global scale of the COBOL programming language

>>suggests that there are upwards of 800 billion lines of COBOL code

>>being used by organizations and institutes worldwide, some three

>>times larger than previously estimated."

>>Feb 9, 2022

>

>Looks like an opportunity to design a COBOL oriented processing core.

http://www.bitsavers.org/pdf/burroughs/B2500_B3500/1025475_B2500_B3500_RefMan_Oct69.pdf

This system was designed _specifically_ to run COBOL code.

It was easy to use, easy to debug and relatively efficient for the day.

As pointed out earlier, the final generation of these systems were running

until circa 2010-2012.

>

>It has been many years since I took a course in COBOL programming.

>Unfortunately, I have lost my COBOL textbook. The PIC statement stands out

>in my memory.

for the PIC clause.

> Having worked on business apps (non COBOL), I have some

>interest in a processor geared towards supporting COBOL. I think it may make

>for an interesting design. Just looked up on the web limits for numeric fields

>which is 18 digits.

point, implied, could be anywhere within the field.

====== Edit (EDT)/OP=49 ======

==== Format ====

^ OP ^ AF ^ BF ^ A Syllable ^ B Syllable ^ C Syllable ^

''OP = 49''

**AF** Not used as //A Syllable// length. **AF** may be indirect or

may specify that the //A Syllable// is a literal.\\

**BF** Number of eight-bit edit-operators and in-line literals in the //B

Syllable//. A value of __00__ is equal to a length of 100 characters. **BF**

may be indirect.

The //A Syllable// is the address of the __source__ field to be edited. Address may be indexed, indirect or extended.

The final address controller data type may be **UN**, **SN** or **UA**.\\

The //B Syllable// is the address of the edit-operator field. Address may be

indexed, indirect or extended. The final address controller data type is

ignored and is treated as **UA**.\\

The //C Syllable// is the address of the __destination__ field. Address may be indexed, indirect or extended.

The final address controller data type must be **UA** or **UN**. Use of

**SN** data type will cause an //Invalid Instruction fault (**IEX = 03**)//. See

[[compatibility_notes:a.13.1|Compatibility Notes A.13.1]].

==== Function ====

The Edit instruction moves digits or characters (depending on the **A**

address controller) from the **A** field to the **C** field under control of

the edit-operators in the **B** field. Characters may be moved, inserted or

deleted according to the edit-operators. Data movement and editing are

stopped by the exhaustion of edit-operators in the **B** field.

At the start of the Edit operation, the comparison flags are set to **EQUAL** and the [[processor_state::overflow_flag|Overflow Flag]] is unconditionally reset. The comparison flags may be set to **HIGH** or **LOW** if any non-zero digits are moved from the source to the destination field.

The source or **A** field is considered positive for unsigned numeric

(**UN**) format. For unsigned alpha (**UA**), the most significant digit of

the most significant character is interpreted as the sign. For signed

numeric (**SN**), the most significant digit of the field is the sign (which

is otherwise ignored).

If the **C** address controller is other than **UA**, only insert the low

order digit of each character transferred into the destination field. Therefore, whenever a blank (**40**) is specified, a 0 will be inserted.

The edit instruction uses an edit table that is located in memory locations

**48**-**63** relative to Base #0. This table may be initialized to any

desired set of insertion characters. By convention all compilers build a default insertion table containing:

^ Entry # ^ Base 0 Address ^ Character ^ Description ^

^ 0 | 48 | **+** | Positive Sign |

^ 1 | 50 | **-** | Negative Sign |

^ 2 | 52 | ***** | Check Suppress Character |

^ 3 | 54 | **.** | Decimal Point |

^ 4 | 56 | **,** | Thousands Separator |

^ 5 | 58 | **$** | Currency Symbol |

^ 6 | 60 | **0** | Leading zero fill Character |

^ 7 | 62 | <blank> | Blank Fill Character |

The edit-operator field consists of a

string of two-digit instructions.

Each instruction is of the form **M**//Av//. The **M** digit is the operation

code portion of the edit-operator. The //Av//

digit is the variant position of the edit-operator.

==== Micro Operators ====

^ Instruction ^^ Variant ^^

^ M ^ Name ^ Av ^ Action ^

| 0 | Move Digit | 0 thru 9 | T <= 1 (Significance)\\ MOVE Av + 1 DIGITS |

| 1 | Move Characters | 0 thru 9 | T <= 1 (Significance)\\ MOVE Av + 1 BYTES |

| 2 | Move Suppress | 0 thru 9 | If T = 1, M <= 0\\ If T = 0, READ each A-Digit then\\ If A-Digit NOT = 0, M <= 0\\ If A-Digit = 0, then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2 |

| 3 | Insert Unconditionally | 0 - 7 | Insert Table Entry 0 - 7 |

| ::: | ::: | 8 | If A = +, Insert table entry 0\\ If A = -, Insert Table Entry 1 |

| ::: | ::: | 9 | If A = +, Insert Blank\\ If A = -, Insert Table Entry 1 |

| ::: | ::: | A | If A = +, Insert Table Entry 0\\ If A = -, Insert blank |

| ::: | ::: | B | Insert Next B character |

| 4 | Insert on Plus | 0 thru B | If A = +, M <= 3\\ If A = - Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |

| 5 | Insert on Minus | 0 thru B | If A = -, M <= 3\\ If A = + Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |

| 6 | Insert Suppress | 0 thru B | If T = 1, M <= 3\\ If T = 0, Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |

| 7 | Insert Float | 0 thru B | If T = 1, Move one digit, If Av = B, Skip Next B\\ If T = 0, Read one A-digit, then\\ If A-digit NOT = 0, then T <= 1,\\ If Av = 0 thru 7, Insert Table Entry 0 - 7 and move one digit.\\ If Av = 8 AND A = + then, Insert Table Entry 0 and Move one digit.\\ If Av = 8 and A = - then, Insert Table Entry 1 and Move one digit.\\ If Av = 9 and A = + Then Insert Blank and Move one digit.\\ If Av = 9 and A = -, Then Insert Table Entry 1 and move one digit.\\ If Av = A and A = +, Then Insert Table Entry 0 and move one digit.\\ If Av = A and A = -, Then Insert Blank and move one digit.\\ If Av = B, then Insert Next B character and move one digit.\\ If A-digit = 0, then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, skip next B character. |

| 8 | End Float | 0 thru B | If T = 1, Then\\ If Av NOT = B, No Operation\\ If Av = B, Skip next B character.\\ If T = 0, M <= 3 |

| 9 | Control | 0 | T <= 0 |

| ::: | ::: |