Can BCD and binary multipliers share circuitry?

434 views
Skip to first unread message

Russell Wallace

unread,
Dec 2, 2022, 8:28:55 AM12/2/22
to
Suppose you are implementing a BCD multiplier. Google says there are reasonably efficient known circuits for this. Say 8x8 -> 16 digits, and trying for a reasonable compromise between speed and number of logic gates.

And suppose the CPU in which this unit is located, also wants reasonably fast binary 32x32 -> 64 bit multiplication. Same input and output registers, same width operands, as the BCD counterpart.

Can the BCD and binary multipliers share circuitry? I know the answer to the corresponding question for adders is yes, because that was done in the 6502, but I have no idea whether the same is true for multipliers.

MitchAlsup

unread,
Dec 2, 2022, 7:04:01 PM12/2/22
to
Yes, and no.
<
Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic
<
No, because it makes the component from which you built the multiplier tree (array) bigger and slower.
<
Most designs that need both have a tree dedicated to binary multiplication and a tree dedicated to decimal multiplication.
<
Some of these decimal things are optimized for 128-bit Decimal arithmetic, while few binary multipliers are optimized for anything more than IEE 754 double precision multiplications.
<
And finally, we are in an era where we can put several handfuls of of cores on a die, each core can have just about as much calculation logic as the designers can figure out how to utilize. So, why not both--the size of the additional multiplication tree is hardly a rounding error in the die size of a GBOoO design.

Russell Wallace

unread,
Dec 3, 2022, 8:28:52 AM12/3/22
to
On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:
> Yes, and no.
> <
> Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic
> <
> No, because it makes the component from which you built the multiplier tree (array) bigger and slower.

Can you put even approximate numbers on the costs?

Like, say a 32x32 -> 64 binary multiplier takes area 1.0.

And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.

And a combined multiplier takes area 1.6 (compared to the sum of the dedicated ones that would be 2.2) and is 3 gates delays slower.

I just guessed those figures as examples; what would more accurate ones look like?

> And finally, we are in an era where we can put several handfuls of of cores on a die, each core can have just about as much calculation logic as the designers can figure out how to utilize. So, why not both--the size of the additional multiplication tree is hardly a rounding error in the die size of a GBOoO design.

Right, that's definitely true. I'm just thinking about a hypothetical situation in which a design is heavily constrained by die area, so it's difficult to find enough for both. (Specifically, 1980s alternate history; it seems to me that around the mid eighties, transistor count would be at the point where a multiplier tree was possible but still expensive.)

MitchAlsup

unread,
Dec 3, 2022, 12:42:04 PM12/3/22
to
On Saturday, December 3, 2022 at 7:28:52 AM UTC-6, Russell Wallace wrote:
> On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:
> > Yes, and no.
> > <
> > Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic
> > <
> > No, because it makes the component from which you built the multiplier tree (array) bigger and slower.
> Can you put even approximate numbers on the costs?
>
> Like, say a 32x32 -> 64 binary multiplier takes area 1.0.
>
> And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.
>
> And a combined multiplier takes area 1.6 (compared to the sum of the dedicated ones that would be 2.2) and is 3 gates delays slower.
<
The first two are "close enough"; good intuition.
<
The combined multiplier takes area 2.0 and is ½ as fast (as a starting point guess).

Scott Lurndal

unread,
Dec 3, 2022, 1:01:28 PM12/3/22
to
Russell Wallace <russell...@gmail.com> writes:
>On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:
>> Yes, and no.=20
>> <=20
>> Yes, because it is theoretically possible to make a handful of Cary Save =
>Adders do both 1:15 and 1:9 arithmetic=20
>> <=20
>> No, because it makes the component from which you built the multiplier tr=
>ee (array) bigger and slower.=20
>
>Can you put even approximate numbers on the costs?
>
>Like, say a 32x32 -> 64 binary multiplier takes area 1.0.
>
>And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.
>
>And a combined multiplier takes area 1.6 (compared to the sum of the dedica=
>ted ones that would be 2.2) and is 3 gates delays slower.
>
>I just guessed those figures as examples; what would more accurate ones loo=
>k like?
>
>> And finally, we are in an era where we can put several handfuls of of cor=
>es on a die, each core can have just about as much calculation logic as the=
> designers can figure out how to utilize. So, why not both--the size of the=
> additional multiplication tree is hardly a rounding error in the die size =
>of a GBOoO design.
>
>Right, that's definitely true. I'm just thinking about a hypothetical situa=
>tion in which a design is heavily constrained by die area, so it's difficul=
>t to find enough for both. (Specifically, 1980s alternate history; it seems=
> to me that around the mid eighties, transistor count would be at the point=
> where a multiplier tree was possible but still expensive.)

The Burroughs medium systems line could perform an arithmetic operation
on up to 100 digit/nibble (BCD) operands with a single instruction. This was
implemented originally in discrete (transistor) logic in the B3500 circa
1964-5, then custom SSI logic in the second generation, and
standard TTL dip logic in the third and finally in gate arrays (1984-5)
for the fourth (and final) generation V530 system.

Addition was performed starting from the most significant digit which
allowed it to detect overflow before altering the receiving field. The
algorithm was basically to zero extend the shorter of the two operands
and then start adding each digit starting from the MSD. If the first
digits sum to greater than 9, overflow has occurred and the operation
terminates with the Overflow Flag set. Otherwise, if the value is
nine, a counter is incremented and the algorithm proceeds to the next
digit. If the value is less than 9, the counter is cleared. If
a subsequent digit position sums to greater than 9, and the 9's
counter is non-zero, an overflow has occurred and the operation
is terminated with OFL set. Otherwise the result digit is written
to the receiveing field.

ADD AFBF A B C

Where AF is the A operand field length (2 BCD digits) and
BF is the B operand field length. The receiving field length
is MAX(AF, BF).

ADD/SUB/MUL/DIV were all supported for UN (unsigned numeric),
SN (signed number - MSD was sign digit) or UA (unsigned alphanumeric).

UA data was 8-bit (and aligned on even address boundaries); for UA
operands, the arithmetic logic ignored the zone digit and performed
the arithmetic on the numeric digits. The receiving field could be
designated differently from the operand fields. So a UA operand could be
multiplied by a UN operand yielding a SN result. The UA fields
(EBCDIC or ASCII encoding) could then be placed directly to the correct
position in a 132-byte long receiving field. This was, after all,
designed specifically to efficiently support COBOL.

Note that addressing in this system was to the 4-bit digit (nibble)
(middle generations used a 16-bit (4-digit) memory access size, later generations
used a 40-bit (10 digit) memory access size to processor cache).

1025475_B2500_B3500_RefMan_Oct69.pdf is on bitsavers, with the
algorithm flowchart on page 51.

Anton Ertl

unread,
Dec 3, 2022, 1:21:09 PM12/3/22
to
Russell Wallace <russell...@gmail.com> writes:
> (Specifically, 1980s alternate history; it seems=
> to me that around the mid eighties, transistor count would be at the point=
> where a multiplier tree was possible but still expensive.)

The MIPS R3000 (1988) has a 12-cycle non-pipelined multiply (and I
think it inherited this from the R2000 (1986), so I don't think they
have a multiplier in the modern sense. Early SPARC and HPPA CPUs do
not have an integer multiplier at all. On the 80386 (1985) a 32x32
multiply takes 9-38 cycles, and on a 486 (1989) 13-42 cycles, so they
do not have a modern multiplier, either.

The time for that was slightly later. The 88100 (1988) has a
four-stage multiplier pipeline (for integer and FP), where the integer
part is done after two cycles.

Concerning your question, if you need BCD numbers and need
multiplication of that, the best way IMO is to have a fast BCD->binary
conversion and fast binary->BCD conversion (maybe with hardware
assist), and do the multiplication (and other computations) in binary.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Russell Wallace

unread,
Dec 3, 2022, 1:23:33 PM12/3/22
to
On Saturday, December 3, 2022 at 5:42:04 PM UTC, MitchAlsup wrote:
> The combined multiplier takes area 2.0 and is ½ as fast (as a starting point guess).

Ah, then I see what you mean about it being not worthwhile to combine them. Thanks!

Russell Wallace

unread,
Dec 3, 2022, 1:29:28 PM12/3/22
to
On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:
> Concerning your question, if you need BCD numbers and need
> multiplication of that, the best way IMO is to have a fast BCD->binary
> conversion and fast binary->BCD conversion (maybe with hardware
> assist), and do the multiplication (and other computations) in binary.

Hmm! It seems to me that each of those conversions is about as expensive as a multiplication, so I'm not seeing the overall saving. Unless you're assuming there will be many calculations to be done before converting back to BCD? That would make sense, except that part of the point of BCD is being able to specify exact decimal rounding, which would require back conversion after each step.

An exception would be something like a transcendental function, where exact decimal rounding is moot, and there are indeed many calculations to be done in the middle. Then it could make sense to convert to binary for the calculation.

Scott Lurndal

unread,
Dec 3, 2022, 1:58:03 PM12/3/22
to
Russell Wallace <russell...@gmail.com> writes:
>On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:
>> Concerning your question, if you need BCD numbers and need=20
>> multiplication of that, the best way IMO is to have a fast BCD->binary=20
>> conversion and fast binary->BCD conversion (maybe with hardware=20
>> assist), and do the multiplication (and other computations) in binary.=20
>
>Hmm! It seems to me that each of those conversions is about as expensive as=
> a multiplication, so I'm not seeing the overall saving. Unless you're assu=
>ming there will be many calculations to be done before converting back to B=
>CD? That would make sense, except that part of the point of BCD is being ab=
>le to specify exact decimal rounding, which would require back conversion a=
>fter each step.

The conversions aren't cheap. Here's a patent by one of my ex-collegues
to do bcd to binary conversions:

https://patents.google.com/patent/US4331951

(The machine Laury was working on for this patent was the
third generation system, the B4900; the D2B and B2D instructions
weren't part of the first or second generations).

Thomas Koenig

unread,
Dec 3, 2022, 3:21:03 PM12/3/22
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Russell Wallace <russell...@gmail.com> writes:
>> (Specifically, 1980s alternate history; it seems=
>> to me that around the mid eighties, transistor count would be at the point=
>> where a multiplier tree was possible but still expensive.)
>
> The MIPS R3000 (1988) has a 12-cycle non-pipelined multiply (and I
> think it inherited this from the R2000 (1986), so I don't think they
> have a multiplier in the modern sense. Early SPARC and HPPA CPUs do
> not have an integer multiplier at all. On the 80386 (1985) a 32x32
> multiply takes 9-38 cycles, and on a 486 (1989) 13-42 cycles, so they
> do not have a modern multiplier, either.
>
> The time for that was slightly later. The 88100 (1988) has a
> four-stage multiplier pipeline (for integer and FP), where the integer
> part is done after two cycles.
>
> Concerning your question, if you need BCD numbers and need
> multiplication of that, the best way IMO is to have a fast BCD->binary
> conversion and fast binary->BCD conversion (maybe with hardware
> assist), and do the multiplication (and other computations) in binary.

Converting decimal to binary is straightfordward, you need quite
carry-save adders (plus shifting).

Converting binary to decimal is much more time-consuming. I believe
the standard way is to use the "multiply by the inverse" method to
divide by 5, which sort of defeats the purpose of having it behind
a multiplier.

If somebody has invented a faster way, I'd like to know about it :-)
(And yes, I know the method of first calculating the remainder of
the division by 5 by shifting and adding, which could also be done
with carry-save adders, but it would still need one multiplication
per digit).

Anton Ertl

unread,
Dec 4, 2022, 4:14:35 AM12/4/22
to
Russell Wallace <russell...@gmail.com> writes:
>Hmm! It seems to me that each of those conversions is about as expensive as=
> a multiplication

Possibly.

>Unless you're assu=
>ming there will be many calculations to be done before converting back to B=
>CD?

Yes.

>That would make sense, except that part of the point of BCD is being ab=
>le to specify exact decimal rounding, which would require back conversion a=
>fter each step.

It's not clear to me what you mean with "exact decimal rounding", but
I am sure that one can also do it on the binary representation. The
typical rounding rules in commerce were designed for finite-precision
fixed point, which can be represented by binary integers just fine.
The only reason to convert between binary and decimal is I/O.

Thomas Koenig

unread,
Dec 4, 2022, 7:40:23 AM12/4/22
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Russell Wallace <russell...@gmail.com> writes:
>>Hmm! It seems to me that each of those conversions is about as expensive as=
>> a multiplication
>
> Possibly.

Much more so.

Like I wrote, decimal->binary is just a carry-save tree.

Binary->decimal is much more difficult. You almost have to do
several rounds of (quotient, remainder) operations, probably using
packages of more than one digit. Remainder is relatively cheap
and can done with a carry-save adder, conditionally summing up the
residuals (2^n) mod (10^m) for bit n and m the number of decimal
digits you want to do at a time.

But you need a division by 10^m, and if anybody has found a better
way than using the high bits of long multiplication, I haven't
seen it described anywhere.

This makes me understand that people did decimal computers for a time
:-)

Terje Mathisen

unread,
Dec 4, 2022, 7:56:51 AM12/4/22
to
The answer must be yes, since even I can see at least one way to do it:
Using a binary multiplier with groups of 4x4 input bits generating 8-bit
products, those same groups can obviously also take BCD inputs. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Quadibloc

unread,
Dec 4, 2022, 8:16:44 AM12/4/22
to
On Friday, December 2, 2022 at 6:28:55 AM UTC-7, Russell Wallace wrote:

> Can the BCD and binary multipliers share circuitry? I know the answer to the
> corresponding question for adders is yes, because that was done in the 6502,
> but I have no idea whether the same is true for multipliers.

Since it is true for adders, it's also true for multipliers if multiplication is only
being done by addition.

If you are using a faster multiplier design, though, like a tree of carry-save adders,
it's harder to use shared circuitry.

John Savard

Terje Mathisen

unread,
Dec 4, 2022, 11:22:13 AM12/4/22
to
We've discussed fast ascii/bcd->binary and v.v.: The first is relatively
simple since the naive algorithm is fast (2-4 clock cycles/digit, but
for long inputs you want SIMD which can at least double the speed.

The reverse operation traditionally requires a DIV & MOD operation for
every digit, but I posted an algorithm here more than two decades ago
that does it far faster, at around 20-50 clock cycles in total depending
upon the maximum number of digits allowed.

For BCD with binary mantissa you maintain a decimal exponent which makes
it easy to detect when you need to perform any rescaling and/or decimal
rounding. This is the allowed BID format according to the DFP part of
the IEEE754-2019 standard.

Terje Mathisen

unread,
Dec 4, 2022, 11:41:59 AM12/4/22
to
Scott Lurndal wrote:
> Russell Wallace <russell...@gmail.com> writes:
>> On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:
>>> Concerning your question, if you need BCD numbers and need=20
>>> multiplication of that, the best way IMO is to have a fast BCD->binary=20
>>> conversion and fast binary->BCD conversion (maybe with hardware=20
>>> assist), and do the multiplication (and other computations) in binary.=20
>>
>> Hmm! It seems to me that each of those conversions is about as expensive as=
>> a multiplication, so I'm not seeing the overall saving. Unless you're assu=
>> ming there will be many calculations to be done before converting back to B=
>> CD? That would make sense, except that part of the point of BCD is being ab=
>> le to specify exact decimal rounding, which would require back conversion a=
>> fter each step.
>
> The conversions aren't cheap. Here's a patent by one of my ex-collegues
> to do bcd to binary conversions:
>
> https://patents.google.com/patent/US4331951

I did not and will not read that patent, even assuming it is long expired.
>
> (The machine Laury was working on for this patent was the
> third generation system, the B4900; the D2B and B2D instructions
> weren't part of the first or second generations).
>
Today you can convert a 32-digit ascii (or BCD) value to binary using a
single AVX (32-byte) register to hold the input:

0) If BCD, expand the nybbles to bytes, will still fit inside that reg.
1) Expand the 32 bytes into a hi/lo pair of 16 u16 slots: The first will
be 1e16 more worth than the second when combining them back together at
the end.
2) For each half, multiply successive u16 slots by (10,1,10,1...), then
add each pair together with a horizontal add. This gives us 32 bits
available for each 2-digit pair.
3) Multiply those u32 slots by (10000,1,10000,1...) and add
horizontally. The results will all be less than 1e8 so they fit nicely
4) Now we have 64 bits available so we can safely multiply by
(1e8,1,1e8,1...) and give us 16-digit intermediate result.
5) Extract the high u64 from the SIMD reg and multiply by 1e16 (using a
64x64->128 MUL, then ADD/ADC the low u64 from the SIMD reg.

To me this looks like 4 multiplications each taking about 5 cycles, plus
some overhead so maybe 30+ clock cycles?

BTW, if the input actually had 34 digits, as in a full DFP128 register,
then I would simply convert those top two digits using integer regs,
which would almost certainly run in parallel with/overlap the SIMD calcs
for no extra real cost.

Terje Mathisen

unread,
Dec 4, 2022, 11:52:46 AM12/4/22
to
See my other message: Yes, you can do it far faster by effectively
working modulo 1e5 for u32 and 1e10 for u64. For a full 34-digit DFP
value, the mantissa can be split into three or four parts.

The way it works is by scaling each binary chunk which effectively
converts it into fixed-point value with the top digit in the top 4 bits
of a register and the remaining bits containing the fraction. After
extracting the top digit you mask those bits, multiply the remainder by
5(! not 10) and move the extraction window one bit down. These bits now
contain the next digit. Mul-by-5 happens with LEA or shift+add, so one
or two cycles.

Since you have multiple chunks to work on, all the execution units can
be kept buzy spitting out 2-4 digits every 2-3 cycles.

Anton Ertl

unread,
Dec 4, 2022, 6:21:25 PM12/4/22
to
Thomas Koenig <tko...@netcologne.de> writes:
>Converting binary to decimal is much more time-consuming.

Apart from Terje Mathisen's approach, there is also the following
approach:

First, we need an operation for doubling a decimal number d, i.e.,
2*d. This can be implemented based on decimal addition and should
work in one cycle (right?). To convert a binary number to decimal, we
do:

/* input: binary */ uint64_t b=...
/* output: */ udecimal d=0;
while (int i=0; i<64; i++) {
d = 2*d + (udecimal)(b>>63); /* decimal * and + */
b <<= 1;
}

Once this loop is finished, d contains the decimal representation of
b. Ok, this takes at least 64 cycles if 2*d+[0..1] takes one cycle.
How can we make it faster?

The structure of 64 staggered additions is like in multiplication (but
the problem is probably simpler), so I would not be surprised if the
same techniques as for full multipliers could be applied here to get
something that runs in maybe 4 cycles.

Even if that is infeasible or if we don't want to pay the area price
for that, one may be able to transfer more than one bit per cycle from
b to d, reducing the number of cycles necessary.

Also, b could be split into b/10^8 and b%10^8, and the parts could be
converted in parallel, and combined in the end.

Michael S

unread,
Dec 4, 2022, 7:35:55 PM12/4/22
to
SSEx variant from 10 years ago:
https://groups.google.com/g/comp.arch/c/APbcPjoqs_g/m/pAjyaKjAsDIJ
28 clocks on computer that was considered fast 10 years ago.

Several almost portable variants in GNU dialect of C from 5 years ago:
https://github.com/already5chosen/bin2ascii
The fastest variant runs in 52 clocks on the same computer that was considered
fast 10 years ago. It runs in 43 cycles on computer that was considered fast
(in IPC, not in absolute speed) 5 years ago.

All variants assume full 64-bit binary input == 20-digit output.

I never bothered to upgrade SSE to AVX2, because I can't imagine
a situation where 28 clocks are not sufficiently fast.

Anton Ertl

unread,
Dec 5, 2022, 4:33:22 AM12/5/22
to
Plus, we can speed it up for small numbers:

>/* input: binary */ uint64_t b=...
>/* output: */ udecimal d=0;
int i=clz(b); /* count leading zeros, 64 for b==0 */
b<<=i;
while (; i<64; i++) {
d = 2*d + (udecimal)(b>>63); /* decimal * and + */
b <<= 1;
}

BGB

unread,
Dec 5, 2022, 4:58:03 PM12/5/22
to
On 12/4/2022 6:56 AM, Terje Mathisen wrote:
> Russell Wallace wrote:
>> Suppose you are implementing a BCD multiplier. Google says there are
>> reasonably efficient known circuits for this. Say 8x8 -> 16 digits,
>> and trying for a reasonable compromise between speed and number of
>> logic gates.
>>
>> And suppose the CPU in which this unit is located, also wants
>> reasonably fast binary 32x32 -> 64 bit multiplication. Same input and
>> output registers, same width operands, as the BCD counterpart.
>>
>> Can the BCD and binary multipliers share circuitry? I know the answer
>> to the corresponding question for adders is yes, because that was
>> done in the 6502, but I have no idea whether the same is true for
>> multipliers.
>
> The answer must be yes, since even I can see at least one way to do it:
> Using a binary multiplier with groups of 4x4 input bits generating 8-bit
> products, those same groups can obviously also take BCD inputs. :-)
>

Though, adding the intermediate results together, and everything from
there on, would need to be different...

AFAIK, you can't just multiply two large BCD numbers together as binary
numbers and then fixup the results afterwards (though, this would be
convenient).


In this case, the BCD adder tree would likely be a bigger issue than the
digit-for-digit BCD multipliers (which, also annoyingly, probably
wouldn't fit all that efficiently in the available LUT sizes on a
typical FPGA).

Most efficient configuration I can currently think up would need ~ 11
LUTs per BCD digit pair. This would first mash the pattern from 8 to 7
bits, and could then use 8x LUT7->1, noting as that { A[3:1], B[3:1] }
can be mashed from 6-bits to 5-bits (as they only cover the range 000..100).

Most other patterns I can think up would need significantly more LUTs
per digit.




Though, can note that I have noted that a "BCDADD" instruction could be
potentially used to implement a BCD multiplier Shift-ADD style.

Say:
R2=Output, R4=Value A, R5=Value B
MOV 0, R2 //init to zeroes
SHLD.Q R4, 0, R16 //first position
SHLD.Q R5, 0, R17 //first digit

TEST R17, R17
BT .done //no more digits remain

TEST 1, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 2, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 4, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 8, R17
BCDADD?F R16, R2
BCDADD R16, R16

SHLD.Q R4, 4, R16 //second position
SHLD.Q R5, -4, R17 //second digit

TEST R17, R17
BT .done //no more digits remain

TEST 1, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 2, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 4, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 8, R17
BCDADD?F R16, R2
BCDADD R16, R16

... //for each digit of the second operand

Could also be turned into a loop (for each BCD digit).
Here, R16 would need to be reset every digit mostly because 4x BCDADD
would effectively multiply the value by 16 (decimal), but this could be
skipped if multiplying a BCD number with a binary number.


This process could maybe also be done in hardware.

Would get more convoluted if one needs to deal with numbers larger than
16 decimal digits.


> Terje
>

BGB

unread,
Dec 6, 2022, 5:57:07 PM12/6/22
to
I have an instruction in BJX2 that can be used for BCD addition, so it
is possible to convert a 64-bit number to 16-digit BCD in around 130
clock-cycles or so (using a sequence of ROTCL and BCDADC instructions).


Seems like it could also be possible to implement a 2x2 or 4x4 digit BCD
multiplier in a "reasonable" cost if needed, which could allow for
"semi-fast" BCD multiply (as 4x4->8 digit steps).

More debatable if it is actually likely to be needed though.

It is likely to be hardly (if ever) used, so it would probably not be
worthwhile even if it were fairly cheap...

> - anton

MitchAlsup

unread,
Dec 6, 2022, 6:05:32 PM12/6/22
to
Anyone care to speculate on the size of the fixed.point BCD market ?

Scott Lurndal

unread,
Dec 6, 2022, 7:37:19 PM12/6/22
to
MitchAlsup <Mitch...@aol.com> writes:
>Anyone care to speculate on the size of the fixed.point BCD market ?


"New research on the global scale of the COBOL programming language
suggests that there are upwards of 800 billion lines of COBOL code
being used by organizations and institutes worldwide, some three
times larger than previously estimated."
Feb 9, 2022

MitchAlsup

unread,
Dec 6, 2022, 9:10:16 PM12/6/22
to
What percent of these would be available to non-Mainframe competition ?
<
That is:: are the COBOL users not locked into whatever they are currently using ?
<
How many of those lines are dependent on BCD ?

robf...@gmail.com

unread,
Dec 7, 2022, 12:20:05 AM12/7/22
to
>"New research on the global scale of the COBOL programming language
>suggests that there are upwards of 800 billion lines of COBOL code
>being used by organizations and institutes worldwide, some three
>times larger than previously estimated."
>Feb 9, 2022

Looks like an opportunity to design a COBOL oriented processing core.

It has been many years since I took a course in COBOL programming.
Unfortunately, I have lost my COBOL textbook. The PIC statement stands out
in my memory. Having worked on business apps (non COBOL), I have some
interest in a processor geared towards supporting COBOL. I think it may make
for an interesting design. Just looked up on the web limits for numeric fields
which is 18 digits. So, an 18.18 fixed point BCD may be enough. To be safe and
have a couple of extra digits for rounding a 20.20 fixed BCD format could be
used. That means processing using 160 plus bit registers assuming values can
fit entirely in registers. To keep things simple, the decimal point would be in a
fixed position all the time. Going away to the drawing board now.

Anton Ertl

unread,
Dec 7, 2022, 4:48:52 AM12/7/22
to
And even if they can switch, would they buy a specialty computer for
Cobol processing only, or a commodity computer that may or may not
process the COBOL stuff a little slower.

>How many of those lines are dependent on BCD ?

I think that COBOL is abstract enough that it can store the
fixed-point numbers internally as (scaled) binary integers, and only
convert to decimal representation on I/O. But my knowledge of COBOL
is very limited, so my confidence in this statement is low.

Anyway, the fact that Intel added DAA to 8086 (which works on a single
byte), but did not include any additional BCD support (or at least
better support for binary<->BCD conversion) despite adding a huge
number of instructions over the years, some for very specialized uses,
indicates that Intel thinks that the fixed.point BCD market outside of
traditional mainframes is to small for adding even something cheap
like DCOR and IDCOR below.

Similar for other microprocessor manufacturers since 1980: Most did
not add anything, HP's PA included DCOR and IDCOR (seem to be a
word-wide versions of DAA, except that the operand of the addition has
to be pre-biased; DCOR produces an unbiased result, IDCOR biased; page
204 (labeled page 5-113) of
http://www.bitsavers.org/pdf/hp/pa-risc/09740-90039_PA_RISC_1.1_Architecture_and_Instruction_Set_Reference_Manual_Ed3.pdf
and page 163 (7-37) of
https://parisc.wiki.kernel.org/images-parisc/7/73/Parisc2.0.pdf gives
an example, the same one). IIRC IA-64 did not get such instructions,
so apparently the need for BCD had evaporated in the decade between
the design of HPPA and IA-64.

Michael S

unread,
Dec 7, 2022, 7:02:46 AM12/7/22
to
On Wednesday, December 7, 2022 at 11:48:52 AM UTC+2, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Tuesday, December 6, 2022 at 6:37:19 PM UTC-6, Scott Lurndal wrote:
> >> MitchAlsup <Mitch...@aol.com> writes:
> >> >Anyone care to speculate on the size of the fixed.point BCD market ?
> ><
> >> "New research on the global scale of the COBOL programming language
> >> suggests that there are upwards of 800 billion lines of COBOL code
> >> being used by organizations and institutes worldwide, some three
> >> times larger than previously estimated."
> >> Feb 9, 2022
> ><
> >What percent of these would be available to non-Mainframe competition ?
> ><
> >That is:: are the COBOL users not locked into whatever they are currently using ?
> And even if they can switch, would they buy a specialty computer for
> Cobol processing only, or a commodity computer that may or may not
> process the COBOL stuff a little slower.
> >How many of those lines are dependent on BCD ?
> I think that COBOL is abstract enough that it can store the
> fixed-point numbers internally as (scaled) binary integers, and only
> convert to decimal representation on I/O. But my knowledge of COBOL
> is very limited, so my confidence in this statement is low.
>

If recommended limits for scaled decimal type are indeed 1.18.18 I'd take it as
a strong hint toward standard body's preference for binary representation.
Or, possibly, DPD, but certainly not BCD.
With the pair of binary/DPD 64-bit words one can do 1.18.19, but may be they
were envisioning smaller machines, where 4x32 is more convenient than 2x64.
Or, may be, they just don't like 1.18.19 for reasons of aesthetic.

Terje Mathisen

unread,
Dec 7, 2022, 8:04:58 AM12/7/22
to
I call bullshit on that number!

Assuming that it takes at least 10 seconds to write & debug a line of
COBOL (likely too optimistic by at least one order of magnitude), those
800E9 lines would have taken at least 8E12 seconds or 254K man-years
(working 24/7). With a more sustainable 2000 hours of programming/year
we can multiply that by 4+ to reach at least 1M man-years.

The only possible way to reach that figure is by counting the same code
lines many, many times, as in every installation have some local
modifications, so we count their entire code base as a unique set.

robf...@gmail.com

unread,
Dec 7, 2022, 9:17:42 AM12/7/22
to

>I call bullshit on that number!

IDK. Seems like it might be reasonable. 1M man years is only 100k
programmers working for 10 years. That was a world-wide number.
Block-copy and paste approach generates a lot of LOC fast.

>If recommended limits for scaled decimal type are indeed 1.18.18

I think that was 18 total digits including the fraction. 1.14.4? I think
128-bit DPD floating-point would work. Gives about 34 digits.

Is there a significant difference in results between scaled integers and
decimal floating point?

Scott Lurndal

unread,
Dec 7, 2022, 10:03:09 AM12/7/22
to
"robf...@gmail.com" <robf...@gmail.com> writes:
>>"New research on the global scale of the COBOL programming language
>>suggests that there are upwards of 800 billion lines of COBOL code
>>being used by organizations and institutes worldwide, some three
>>times larger than previously estimated."
>>Feb 9, 2022
>
>Looks like an opportunity to design a COBOL oriented processing core.

Like this one?

http://www.bitsavers.org/pdf/burroughs/B2500_B3500/1025475_B2500_B3500_RefMan_Oct69.pdf

This system was designed _specifically_ to run COBOL code.

It was easy to use, easy to debug and relatively efficient for the day.

As pointed out earlier, the final generation of these systems were running
until circa 2010-2012.

>
>It has been many years since I took a course in COBOL programming.
>Unfortunately, I have lost my COBOL textbook. The PIC statement stands out
>in my memory.

The system included an instruction, EDT, which handled formatting
for the PIC clause.



> Having worked on business apps (non COBOL), I have some
>interest in a processor geared towards supporting COBOL. I think it may make
>for an interesting design. Just looked up on the web limits for numeric fields
>which is 18 digits.

The B3500 and successors supported 1 to 100 digit fields. The decimal
point, implied, could be anywhere within the field.


====== Edit (EDT)/OP=49 ======

==== Format ====

^ OP ^ AF ^ BF ^ A Syllable ^ B Syllable ^ C Syllable ^

''OP = 49''

**AF** Not used as //A Syllable// length. **AF** may be indirect or
may specify that the //A Syllable// is a literal.\\
**BF** Number of eight-bit edit-operators and in-line literals in the //B
Syllable//. A value of __00__ is equal to a length of 100 characters. **BF**
may be indirect.

The //A Syllable// is the address of the __source__ field to be edited. Address may be indexed, indirect or extended.
The final address controller data type may be **UN**, **SN** or **UA**.\\
The //B Syllable// is the address of the edit-operator field. Address may be
indexed, indirect or extended. The final address controller data type is
ignored and is treated as **UA**.\\
The //C Syllable// is the address of the __destination__ field. Address may be indexed, indirect or extended.
The final address controller data type must be **UA** or **UN**. Use of
**SN** data type will cause an //Invalid Instruction fault (**IEX = 03**)//. See
[[compatibility_notes:a.13.1|Compatibility Notes A.13.1]].

==== Function ====

The Edit instruction moves digits or characters (depending on the **A**
address controller) from the **A** field to the **C** field under control of
the edit-operators in the **B** field. Characters may be moved, inserted or
deleted according to the edit-operators. Data movement and editing are
stopped by the exhaustion of edit-operators in the **B** field.

At the start of the Edit operation, the comparison flags are set to **EQUAL** and the [[processor_state::overflow_flag|Overflow Flag]] is unconditionally reset. The comparison flags may be set to **HIGH** or **LOW** if any non-zero digits are moved from the source to the destination field.

The source or **A** field is considered positive for unsigned numeric
(**UN**) format. For unsigned alpha (**UA**), the most significant digit of
the most significant character is interpreted as the sign. For signed
numeric (**SN**), the most significant digit of the field is the sign (which
is otherwise ignored).

If the **C** address controller is other than **UA**, only insert the low
order digit of each character transferred into the destination field. Therefore, whenever a blank (**40**) is specified, a 0 will be inserted.

The edit instruction uses an edit table that is located in memory locations
**48**-**63** relative to Base #0. This table may be initialized to any
desired set of insertion characters. By convention all compilers build a default insertion table containing:

^ Entry # ^ Base 0 Address ^ Character ^ Description ^
^ 0 | 48 | **+** | Positive Sign |
^ 1 | 50 | **-** | Negative Sign |
^ 2 | 52 | ***** | Check Suppress Character |
^ 3 | 54 | **.** | Decimal Point |
^ 4 | 56 | **,** | Thousands Separator |
^ 5 | 58 | **$** | Currency Symbol |
^ 6 | 60 | **0** | Leading zero fill Character |
^ 7 | 62 | <blank> | Blank Fill Character |

The edit-operator field consists of a
string of two-digit instructions.
Each instruction is of the form **M**//Av//. The **M** digit is the operation
code portion of the edit-operator. The //Av//
digit is the variant position of the edit-operator.

==== Micro Operators ====

^ Instruction ^^ Variant ^^
^ M ^ Name ^ Av ^ Action ^
| 0 | Move Digit | 0 thru 9 | T <= 1 (Significance)\\ MOVE Av + 1 DIGITS |
| 1 | Move Characters | 0 thru 9 | T <= 1 (Significance)\\ MOVE Av + 1 BYTES |
| 2 | Move Suppress | 0 thru 9 | If T = 1, M <= 0\\ If T = 0, READ each A-Digit then\\ If A-Digit NOT = 0, M <= 0\\ If A-Digit = 0, then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2 |
| 3 | Insert Unconditionally | 0 - 7 | Insert Table Entry 0 - 7 |
| ::: | ::: | 8 | If A = +, Insert table entry 0\\ If A = -, Insert Table Entry 1 |
| ::: | ::: | 9 | If A = +, Insert Blank\\ If A = -, Insert Table Entry 1 |
| ::: | ::: | A | If A = +, Insert Table Entry 0\\ If A = -, Insert blank |
| ::: | ::: | B | Insert Next B character |
| 4 | Insert on Plus | 0 thru B | If A = +, M <= 3\\ If A = - Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |
| 5 | Insert on Minus | 0 thru B | If A = -, M <= 3\\ If A = + Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |
| 6 | Insert Suppress | 0 thru B | If T = 1, M <= 3\\ If T = 0, Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |
| 7 | Insert Float | 0 thru B | If T = 1, Move one digit, If Av = B, Skip Next B\\ If T = 0, Read one A-digit, then\\ If A-digit NOT = 0, then T <= 1,\\ If Av = 0 thru 7, Insert Table Entry 0 - 7 and move one digit.\\ If Av = 8 AND A = + then, Insert Table Entry 0 and Move one digit.\\ If Av = 8 and A = - then, Insert Table Entry 1 and Move one digit.\\ If Av = 9 and A = + Then Insert Blank and Move one digit.\\ If Av = 9 and A = -, Then Insert Table Entry 1 and move one digit.\\ If Av = A and A = +, Then Insert Table Entry 0 and move one digit.\\ If Av = A and A = -, Then Insert Blank and move one digit.\\ If Av = B, then Insert Next B character and move one digit.\\ If A-digit = 0, then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, skip next B character. |
| 8 | End Float | 0 thru B | If T = 1, Then\\ If Av NOT = B, No Operation\\ If Av = B, Skip next B character.\\ If T = 0, M <= 3 |
| 9 | Control | 0 | T <= 0 |
| ::: | ::: |