Can BCD and binary multipliers share circuitry?

Russell Wallace

unread,

Dec 2, 2022, 8:28:55 AM12/2/22

to

Suppose you are implementing a BCD multiplier. Google says there are reasonably efficient known circuits for this. Say 8x8 -> 16 digits, and trying for a reasonable compromise between speed and number of logic gates.

And suppose the CPU in which this unit is located, also wants reasonably fast binary 32x32 -> 64 bit multiplication. Same input and output registers, same width operands, as the BCD counterpart.

Can the BCD and binary multipliers share circuitry? I know the answer to the corresponding question for adders is yes, because that was done in the 6502, but I have no idea whether the same is true for multipliers.

MitchAlsup

unread,

Dec 2, 2022, 7:04:01 PM12/2/22

to

Yes, and no.
<
Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic
<
No, because it makes the component from which you built the multiplier tree (array) bigger and slower.
<
Most designs that need both have a tree dedicated to binary multiplication and a tree dedicated to decimal multiplication.
<
Some of these decimal things are optimized for 128-bit Decimal arithmetic, while few binary multipliers are optimized for anything more than IEE 754 double precision multiplications.
<
And finally, we are in an era where we can put several handfuls of of cores on a die, each core can have just about as much calculation logic as the designers can figure out how to utilize. So, why not both--the size of the additional multiplication tree is hardly a rounding error in the die size of a GBOoO design.

Russell Wallace

unread,

Dec 3, 2022, 8:28:52 AM12/3/22

to

On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:
> Yes, and no.
> <
> Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic
> <
> No, because it makes the component from which you built the multiplier tree (array) bigger and slower.

Can you put even approximate numbers on the costs?

Like, say a 32x32 -> 64 binary multiplier takes area 1.0.

And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.

And a combined multiplier takes area 1.6 (compared to the sum of the dedicated ones that would be 2.2) and is 3 gates delays slower.

I just guessed those figures as examples; what would more accurate ones look like?

> And finally, we are in an era where we can put several handfuls of of cores on a die, each core can have just about as much calculation logic as the designers can figure out how to utilize. So, why not both--the size of the additional multiplication tree is hardly a rounding error in the die size of a GBOoO design.

Right, that's definitely true. I'm just thinking about a hypothetical situation in which a design is heavily constrained by die area, so it's difficult to find enough for both. (Specifically, 1980s alternate history; it seems to me that around the mid eighties, transistor count would be at the point where a multiplier tree was possible but still expensive.)

MitchAlsup

unread,

Dec 3, 2022, 12:42:04 PM12/3/22

to

On Saturday, December 3, 2022 at 7:28:52 AM UTC-6, Russell Wallace wrote:
> On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:
> > Yes, and no.
> > <
> > Yes, because it is theoretically possible to make a handful of Cary Save Adders do both 1:15 and 1:9 arithmetic
> > <
> > No, because it makes the component from which you built the multiplier tree (array) bigger and slower.
> Can you put even approximate numbers on the costs?
>
> Like, say a 32x32 -> 64 binary multiplier takes area 1.0.
>
> And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.
>
> And a combined multiplier takes area 1.6 (compared to the sum of the dedicated ones that would be 2.2) and is 3 gates delays slower.
<

The first two are "close enough"; good intuition.
<
The combined multiplier takes area 2.0 and is ½ as fast (as a starting point guess).

Scott Lurndal

unread,

Dec 3, 2022, 1:01:28 PM12/3/22

to

Russell Wallace <russell...@gmail.com> writes:
>On Saturday, December 3, 2022 at 12:04:01 AM UTC, MitchAlsup wrote:

>> Yes, and no.=20
>> <=20
>> Yes, because it is theoretically possible to make a handful of Cary Save =
>Adders do both 1:15 and 1:9 arithmetic=20
>> <=20
>> No, because it makes the component from which you built the multiplier tr=
>ee (array) bigger and slower.=20

>
>Can you put even approximate numbers on the costs?
>
>Like, say a 32x32 -> 64 binary multiplier takes area 1.0.
>
>And its BCD counterpart (8x8 -> 16 digits) takes area 1.2.
>

>And a combined multiplier takes area 1.6 (compared to the sum of the dedica=

>ted ones that would be 2.2) and is 3 gates delays slower.
>

>I just guessed those figures as examples; what would more accurate ones loo=
>k like?
>
>> And finally, we are in an era where we can put several handfuls of of cor=
>es on a die, each core can have just about as much calculation logic as the=
> designers can figure out how to utilize. So, why not both--the size of the=
> additional multiplication tree is hardly a rounding error in the die size =
>of a GBOoO design.
>
>Right, that's definitely true. I'm just thinking about a hypothetical situa=
>tion in which a design is heavily constrained by die area, so it's difficul=
>t to find enough for both. (Specifically, 1980s alternate history; it seems=
> to me that around the mid eighties, transistor count would be at the point=

> where a multiplier tree was possible but still expensive.)

The Burroughs medium systems line could perform an arithmetic operation
on up to 100 digit/nibble (BCD) operands with a single instruction. This was
implemented originally in discrete (transistor) logic in the B3500 circa
1964-5, then custom SSI logic in the second generation, and
standard TTL dip logic in the third and finally in gate arrays (1984-5)
for the fourth (and final) generation V530 system.

Addition was performed starting from the most significant digit which
allowed it to detect overflow before altering the receiving field. The
algorithm was basically to zero extend the shorter of the two operands
and then start adding each digit starting from the MSD. If the first
digits sum to greater than 9, overflow has occurred and the operation
terminates with the Overflow Flag set. Otherwise, if the value is
nine, a counter is incremented and the algorithm proceeds to the next
digit. If the value is less than 9, the counter is cleared. If
a subsequent digit position sums to greater than 9, and the 9's
counter is non-zero, an overflow has occurred and the operation
is terminated with OFL set. Otherwise the result digit is written
to the receiveing field.

ADD AFBF A B C

Where AF is the A operand field length (2 BCD digits) and
BF is the B operand field length. The receiving field length
is MAX(AF, BF).

ADD/SUB/MUL/DIV were all supported for UN (unsigned numeric),
SN (signed number - MSD was sign digit) or UA (unsigned alphanumeric).

UA data was 8-bit (and aligned on even address boundaries); for UA
operands, the arithmetic logic ignored the zone digit and performed
the arithmetic on the numeric digits. The receiving field could be
designated differently from the operand fields. So a UA operand could be
multiplied by a UN operand yielding a SN result. The UA fields
(EBCDIC or ASCII encoding) could then be placed directly to the correct
position in a 132-byte long receiving field. This was, after all,
designed specifically to efficiently support COBOL.

Note that addressing in this system was to the 4-bit digit (nibble)
(middle generations used a 16-bit (4-digit) memory access size, later generations
used a 40-bit (10 digit) memory access size to processor cache).

1025475_B2500_B3500_RefMan_Oct69.pdf is on bitsavers, with the
algorithm flowchart on page 51.

Anton Ertl

unread,

Dec 3, 2022, 1:21:09 PM12/3/22

to

Russell Wallace <russell...@gmail.com> writes:
> (Specifically, 1980s alternate history; it seems=
> to me that around the mid eighties, transistor count would be at the point=

> where a multiplier tree was possible but still expensive.)

The MIPS R3000 (1988) has a 12-cycle non-pipelined multiply (and I
think it inherited this from the R2000 (1986), so I don't think they
have a multiplier in the modern sense. Early SPARC and HPPA CPUs do
not have an integer multiplier at all. On the 80386 (1985) a 32x32
multiply takes 9-38 cycles, and on a 486 (1989) 13-42 cycles, so they
do not have a modern multiplier, either.

The time for that was slightly later. The 88100 (1988) has a
four-stage multiplier pipeline (for integer and FP), where the integer
part is done after two cycles.

Concerning your question, if you need BCD numbers and need
multiplication of that, the best way IMO is to have a fast BCD->binary
conversion and fast binary->BCD conversion (maybe with hardware
assist), and do the multiplication (and other computations) in binary.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Russell Wallace

unread,

Dec 3, 2022, 1:23:33 PM12/3/22

to

On Saturday, December 3, 2022 at 5:42:04 PM UTC, MitchAlsup wrote:
> The combined multiplier takes area 2.0 and is ½ as fast (as a starting point guess).

Ah, then I see what you mean about it being not worthwhile to combine them. Thanks!

Russell Wallace

unread,

Dec 3, 2022, 1:29:28 PM12/3/22

to

On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:
> Concerning your question, if you need BCD numbers and need
> multiplication of that, the best way IMO is to have a fast BCD->binary
> conversion and fast binary->BCD conversion (maybe with hardware
> assist), and do the multiplication (and other computations) in binary.

Hmm! It seems to me that each of those conversions is about as expensive as a multiplication, so I'm not seeing the overall saving. Unless you're assuming there will be many calculations to be done before converting back to BCD? That would make sense, except that part of the point of BCD is being able to specify exact decimal rounding, which would require back conversion after each step.

An exception would be something like a transcendental function, where exact decimal rounding is moot, and there are indeed many calculations to be done in the middle. Then it could make sense to convert to binary for the calculation.

Scott Lurndal

unread,

Dec 3, 2022, 1:58:03 PM12/3/22

to

Russell Wallace <russell...@gmail.com> writes:
>On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:

>> Concerning your question, if you need BCD numbers and need=20
>> multiplication of that, the best way IMO is to have a fast BCD->binary=20
>> conversion and fast binary->BCD conversion (maybe with hardware=20
>> assist), and do the multiplication (and other computations) in binary.=20
>
>Hmm! It seems to me that each of those conversions is about as expensive as=
> a multiplication, so I'm not seeing the overall saving. Unless you're assu=
>ming there will be many calculations to be done before converting back to B=
>CD? That would make sense, except that part of the point of BCD is being ab=
>le to specify exact decimal rounding, which would require back conversion a=
>fter each step.

The conversions aren't cheap. Here's a patent by one of my ex-collegues
to do bcd to binary conversions:

https://patents.google.com/patent/US4331951

(The machine Laury was working on for this patent was the
third generation system, the B4900; the D2B and B2D instructions
weren't part of the first or second generations).

Thomas Koenig

unread,

Dec 3, 2022, 3:21:03 PM12/3/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Russell Wallace <russell...@gmail.com> writes:
>> (Specifically, 1980s alternate history; it seems=
>> to me that around the mid eighties, transistor count would be at the point=
>> where a multiplier tree was possible but still expensive.)
>
> The MIPS R3000 (1988) has a 12-cycle non-pipelined multiply (and I
> think it inherited this from the R2000 (1986), so I don't think they
> have a multiplier in the modern sense. Early SPARC and HPPA CPUs do
> not have an integer multiplier at all. On the 80386 (1985) a 32x32
> multiply takes 9-38 cycles, and on a 486 (1989) 13-42 cycles, so they
> do not have a modern multiplier, either.
>
> The time for that was slightly later. The 88100 (1988) has a
> four-stage multiplier pipeline (for integer and FP), where the integer
> part is done after two cycles.
>
> Concerning your question, if you need BCD numbers and need
> multiplication of that, the best way IMO is to have a fast BCD->binary
> conversion and fast binary->BCD conversion (maybe with hardware
> assist), and do the multiplication (and other computations) in binary.

Converting decimal to binary is straightfordward, you need quite
carry-save adders (plus shifting).

Converting binary to decimal is much more time-consuming. I believe
the standard way is to use the "multiply by the inverse" method to
divide by 5, which sort of defeats the purpose of having it behind
a multiplier.

If somebody has invented a faster way, I'd like to know about it :-)
(And yes, I know the method of first calculating the remainder of
the division by 5 by shifting and adding, which could also be done
with carry-save adders, but it would still need one multiplication
per digit).

Anton Ertl

unread,

Dec 4, 2022, 4:14:35 AM12/4/22

to

Russell Wallace <russell...@gmail.com> writes:
>Hmm! It seems to me that each of those conversions is about as expensive as=
> a multiplication

Possibly.

>Unless you're assu=
>ming there will be many calculations to be done before converting back to B=
>CD?

Yes.

>That would make sense, except that part of the point of BCD is being ab=
>le to specify exact decimal rounding, which would require back conversion a=
>fter each step.

It's not clear to me what you mean with "exact decimal rounding", but
I am sure that one can also do it on the binary representation. The
typical rounding rules in commerce were designed for finite-precision
fixed point, which can be represented by binary integers just fine.
The only reason to convert between binary and decimal is I/O.

Thomas Koenig

unread,

Dec 4, 2022, 7:40:23 AM12/4/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Russell Wallace <russell...@gmail.com> writes:
>>Hmm! It seems to me that each of those conversions is about as expensive as=
>> a multiplication
>
> Possibly.

Much more so.

Like I wrote, decimal->binary is just a carry-save tree.

Binary->decimal is much more difficult. You almost have to do
several rounds of (quotient, remainder) operations, probably using
packages of more than one digit. Remainder is relatively cheap
and can done with a carry-save adder, conditionally summing up the
residuals (2^n) mod (10^m) for bit n and m the number of decimal
digits you want to do at a time.

But you need a division by 10^m, and if anybody has found a better
way than using the high bits of long multiplication, I haven't
seen it described anywhere.

This makes me understand that people did decimal computers for a time
:-)

Terje Mathisen

unread,

Dec 4, 2022, 7:56:51 AM12/4/22

to

The answer must be yes, since even I can see at least one way to do it:
Using a binary multiplier with groups of 4x4 input bits generating 8-bit
products, those same groups can obviously also take BCD inputs. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Quadibloc

unread,

Dec 4, 2022, 8:16:44 AM12/4/22

to

On Friday, December 2, 2022 at 6:28:55 AM UTC-7, Russell Wallace wrote:

> Can the BCD and binary multipliers share circuitry? I know the answer to the
> corresponding question for adders is yes, because that was done in the 6502,
> but I have no idea whether the same is true for multipliers.

Since it is true for adders, it's also true for multipliers if multiplication is only
being done by addition.

If you are using a faster multiplier design, though, like a tree of carry-save adders,
it's harder to use shared circuitry.

John Savard

Terje Mathisen

unread,

Dec 4, 2022, 11:22:13 AM12/4/22

to

We've discussed fast ascii/bcd->binary and v.v.: The first is relatively
simple since the naive algorithm is fast (2-4 clock cycles/digit, but
for long inputs you want SIMD which can at least double the speed.

The reverse operation traditionally requires a DIV & MOD operation for
every digit, but I posted an algorithm here more than two decades ago
that does it far faster, at around 20-50 clock cycles in total depending
upon the maximum number of digits allowed.

For BCD with binary mantissa you maintain a decimal exponent which makes
it easy to detect when you need to perform any rescaling and/or decimal
rounding. This is the allowed BID format according to the DFP part of
the IEEE754-2019 standard.

Terje Mathisen

unread,

Dec 4, 2022, 11:41:59 AM12/4/22

to

Scott Lurndal wrote:
> Russell Wallace <russell...@gmail.com> writes:
>> On Saturday, December 3, 2022 at 6:21:09 PM UTC, Anton Ertl wrote:
>>> Concerning your question, if you need BCD numbers and need=20
>>> multiplication of that, the best way IMO is to have a fast BCD->binary=20
>>> conversion and fast binary->BCD conversion (maybe with hardware=20
>>> assist), and do the multiplication (and other computations) in binary.=20
>>
>> Hmm! It seems to me that each of those conversions is about as expensive as=
>> a multiplication, so I'm not seeing the overall saving. Unless you're assu=
>> ming there will be many calculations to be done before converting back to B=
>> CD? That would make sense, except that part of the point of BCD is being ab=
>> le to specify exact decimal rounding, which would require back conversion a=
>> fter each step.
>
> The conversions aren't cheap. Here's a patent by one of my ex-collegues
> to do bcd to binary conversions:
>
> https://patents.google.com/patent/US4331951

I did not and will not read that patent, even assuming it is long expired.

>
> (The machine Laury was working on for this patent was the
> third generation system, the B4900; the D2B and B2D instructions
> weren't part of the first or second generations).
>

Today you can convert a 32-digit ascii (or BCD) value to binary using a
single AVX (32-byte) register to hold the input:

0) If BCD, expand the nybbles to bytes, will still fit inside that reg.
1) Expand the 32 bytes into a hi/lo pair of 16 u16 slots: The first will
be 1e16 more worth than the second when combining them back together at
the end.
2) For each half, multiply successive u16 slots by (10,1,10,1...), then
add each pair together with a horizontal add. This gives us 32 bits
available for each 2-digit pair.
3) Multiply those u32 slots by (10000,1,10000,1...) and add
horizontally. The results will all be less than 1e8 so they fit nicely
4) Now we have 64 bits available so we can safely multiply by
(1e8,1,1e8,1...) and give us 16-digit intermediate result.
5) Extract the high u64 from the SIMD reg and multiply by 1e16 (using a
64x64->128 MUL, then ADD/ADC the low u64 from the SIMD reg.

To me this looks like 4 multiplications each taking about 5 cycles, plus
some overhead so maybe 30+ clock cycles?

BTW, if the input actually had 34 digits, as in a full DFP128 register,
then I would simply convert those top two digits using integer regs,
which would almost certainly run in parallel with/overlap the SIMD calcs
for no extra real cost.

Terje Mathisen

unread,

Dec 4, 2022, 11:52:46 AM12/4/22

to

See my other message: Yes, you can do it far faster by effectively
working modulo 1e5 for u32 and 1e10 for u64. For a full 34-digit DFP
value, the mantissa can be split into three or four parts.

The way it works is by scaling each binary chunk which effectively
converts it into fixed-point value with the top digit in the top 4 bits
of a register and the remaining bits containing the fraction. After
extracting the top digit you mask those bits, multiply the remainder by
5(! not 10) and move the extraction window one bit down. These bits now
contain the next digit. Mul-by-5 happens with LEA or shift+add, so one
or two cycles.

Since you have multiple chunks to work on, all the execution units can
be kept buzy spitting out 2-4 digits every 2-3 cycles.

Anton Ertl

unread,

Dec 4, 2022, 6:21:25 PM12/4/22

to

Thomas Koenig <tko...@netcologne.de> writes:
>Converting binary to decimal is much more time-consuming.

Apart from Terje Mathisen's approach, there is also the following
approach:

First, we need an operation for doubling a decimal number d, i.e.,
2*d. This can be implemented based on decimal addition and should
work in one cycle (right?). To convert a binary number to decimal, we
do:

/* input: binary */ uint64_t b=...
/* output: */ udecimal d=0;
while (int i=0; i<64; i++) {
d = 2*d + (udecimal)(b>>63); /* decimal * and + */
b <<= 1;
}

Once this loop is finished, d contains the decimal representation of
b. Ok, this takes at least 64 cycles if 2*d+[0..1] takes one cycle.
How can we make it faster?

The structure of 64 staggered additions is like in multiplication (but
the problem is probably simpler), so I would not be surprised if the
same techniques as for full multipliers could be applied here to get
something that runs in maybe 4 cycles.

Even if that is infeasible or if we don't want to pay the area price
for that, one may be able to transfer more than one bit per cycle from
b to d, reducing the number of cycles necessary.

Also, b could be split into b/10^8 and b%10^8, and the parts could be
converted in parallel, and combined in the end.

Michael S

unread,

Dec 4, 2022, 7:35:55 PM12/4/22

to

SSEx variant from 10 years ago:
https://groups.google.com/g/comp.arch/c/APbcPjoqs_g/m/pAjyaKjAsDIJ
28 clocks on computer that was considered fast 10 years ago.

Several almost portable variants in GNU dialect of C from 5 years ago:
https://github.com/already5chosen/bin2ascii
The fastest variant runs in 52 clocks on the same computer that was considered
fast 10 years ago. It runs in 43 cycles on computer that was considered fast
(in IPC, not in absolute speed) 5 years ago.

All variants assume full 64-bit binary input == 20-digit output.

I never bothered to upgrade SSE to AVX2, because I can't imagine
a situation where 28 clocks are not sufficiently fast.

Anton Ertl

unread,

Dec 5, 2022, 4:33:22 AM12/5/22

to

Plus, we can speed it up for small numbers:

>/* input: binary */ uint64_t b=...
>/* output: */ udecimal d=0;

int i=clz(b); /* count leading zeros, 64 for b==0 */
b<<=i;
while (; i<64; i++) {

d = 2*d + (udecimal)(b>>63); /* decimal * and + */
b <<= 1;
}

BGB

unread,

Dec 5, 2022, 4:58:03 PM12/5/22

to

On 12/4/2022 6:56 AM, Terje Mathisen wrote:
> Russell Wallace wrote:
>> Suppose you are implementing a BCD multiplier. Google says there are
>> reasonably efficient known circuits for this. Say 8x8 -> 16 digits,
>> and trying for a reasonable compromise between speed and number of
>> logic gates.
>>
>> And suppose the CPU in which this unit is located, also wants
>> reasonably fast binary 32x32 -> 64 bit multiplication. Same input and
>> output registers, same width operands, as the BCD counterpart.
>>
>> Can the BCD and binary multipliers share circuitry? I know the answer
>> to the corresponding question for adders is yes, because that was
>> done in the 6502, but I have no idea whether the same is true for
>> multipliers.
>
> The answer must be yes, since even I can see at least one way to do it:
> Using a binary multiplier with groups of 4x4 input bits generating 8-bit
> products, those same groups can obviously also take BCD inputs. :-)
>

Though, adding the intermediate results together, and everything from
there on, would need to be different...

AFAIK, you can't just multiply two large BCD numbers together as binary
numbers and then fixup the results afterwards (though, this would be
convenient).

In this case, the BCD adder tree would likely be a bigger issue than the
digit-for-digit BCD multipliers (which, also annoyingly, probably
wouldn't fit all that efficiently in the available LUT sizes on a
typical FPGA).

Most efficient configuration I can currently think up would need ~ 11
LUTs per BCD digit pair. This would first mash the pattern from 8 to 7
bits, and could then use 8x LUT7->1, noting as that { A[3:1], B[3:1] }
can be mashed from 6-bits to 5-bits (as they only cover the range 000..100).

Most other patterns I can think up would need significantly more LUTs
per digit.

Though, can note that I have noted that a "BCDADD" instruction could be
potentially used to implement a BCD multiplier Shift-ADD style.

Say:
R2=Output, R4=Value A, R5=Value B
MOV 0, R2 //init to zeroes
SHLD.Q R4, 0, R16 //first position
SHLD.Q R5, 0, R17 //first digit

TEST R17, R17
BT .done //no more digits remain

TEST 1, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 2, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 4, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 8, R17
BCDADD?F R16, R2
BCDADD R16, R16

SHLD.Q R4, 4, R16 //second position
SHLD.Q R5, -4, R17 //second digit

TEST R17, R17
BT .done //no more digits remain

TEST 1, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 2, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 4, R17
BCDADD?F R16, R2
BCDADD R16, R16
TEST 8, R17
BCDADD?F R16, R2
BCDADD R16, R16

... //for each digit of the second operand

Could also be turned into a loop (for each BCD digit).
Here, R16 would need to be reset every digit mostly because 4x BCDADD
would effectively multiply the value by 16 (decimal), but this could be
skipped if multiplying a BCD number with a binary number.

This process could maybe also be done in hardware.

Would get more convoluted if one needs to deal with numbers larger than
16 decimal digits.

> Terje
>

BGB

unread,

Dec 6, 2022, 5:57:07 PM12/6/22

to

I have an instruction in BJX2 that can be used for BCD addition, so it
is possible to convert a 64-bit number to 16-digit BCD in around 130
clock-cycles or so (using a sequence of ROTCL and BCDADC instructions).

Seems like it could also be possible to implement a 2x2 or 4x4 digit BCD
multiplier in a "reasonable" cost if needed, which could allow for
"semi-fast" BCD multiply (as 4x4->8 digit steps).

More debatable if it is actually likely to be needed though.

It is likely to be hardly (if ever) used, so it would probably not be
worthwhile even if it were fairly cheap...

> - anton

MitchAlsup

unread,

Dec 6, 2022, 6:05:32 PM12/6/22

to

Anyone care to speculate on the size of the fixed.point BCD market ?

Scott Lurndal

unread,

Dec 6, 2022, 7:37:19 PM12/6/22

to

MitchAlsup <Mitch...@aol.com> writes:
>Anyone care to speculate on the size of the fixed.point BCD market ?

"New research on the global scale of the COBOL programming language
suggests that there are upwards of 800 billion lines of COBOL code
being used by organizations and institutes worldwide, some three
times larger than previously estimated."
Feb 9, 2022

MitchAlsup

unread,

Dec 6, 2022, 9:10:16 PM12/6/22

to

What percent of these would be available to non-Mainframe competition ?
<
That is:: are the COBOL users not locked into whatever they are currently using ?
<
How many of those lines are dependent on BCD ?

robf...@gmail.com

unread,

Dec 7, 2022, 12:20:05 AM12/7/22

to

>"New research on the global scale of the COBOL programming language
>suggests that there are upwards of 800 billion lines of COBOL code
>being used by organizations and institutes worldwide, some three
>times larger than previously estimated."
>Feb 9, 2022

Looks like an opportunity to design a COBOL oriented processing core.

It has been many years since I took a course in COBOL programming.
Unfortunately, I have lost my COBOL textbook. The PIC statement stands out
in my memory. Having worked on business apps (non COBOL), I have some
interest in a processor geared towards supporting COBOL. I think it may make
for an interesting design. Just looked up on the web limits for numeric fields
which is 18 digits. So, an 18.18 fixed point BCD may be enough. To be safe and
have a couple of extra digits for rounding a 20.20 fixed BCD format could be
used. That means processing using 160 plus bit registers assuming values can
fit entirely in registers. To keep things simple, the decimal point would be in a
fixed position all the time. Going away to the drawing board now.

Anton Ertl

unread,

Dec 7, 2022, 4:48:52 AM12/7/22

to

And even if they can switch, would they buy a specialty computer for
Cobol processing only, or a commodity computer that may or may not
process the COBOL stuff a little slower.

>How many of those lines are dependent on BCD ?

I think that COBOL is abstract enough that it can store the
fixed-point numbers internally as (scaled) binary integers, and only
convert to decimal representation on I/O. But my knowledge of COBOL
is very limited, so my confidence in this statement is low.

Anyway, the fact that Intel added DAA to 8086 (which works on a single
byte), but did not include any additional BCD support (or at least
better support for binary<->BCD conversion) despite adding a huge
number of instructions over the years, some for very specialized uses,
indicates that Intel thinks that the fixed.point BCD market outside of
traditional mainframes is to small for adding even something cheap
like DCOR and IDCOR below.

Similar for other microprocessor manufacturers since 1980: Most did
not add anything, HP's PA included DCOR and IDCOR (seem to be a
word-wide versions of DAA, except that the operand of the addition has
to be pre-biased; DCOR produces an unbiased result, IDCOR biased; page
204 (labeled page 5-113) of
http://www.bitsavers.org/pdf/hp/pa-risc/09740-90039_PA_RISC_1.1_Architecture_and_Instruction_Set_Reference_Manual_Ed3.pdf
and page 163 (7-37) of
https://parisc.wiki.kernel.org/images-parisc/7/73/Parisc2.0.pdf gives
an example, the same one). IIRC IA-64 did not get such instructions,
so apparently the need for BCD had evaporated in the decade between
the design of HPPA and IA-64.

Michael S

unread,

Dec 7, 2022, 7:02:46 AM12/7/22

to

On Wednesday, December 7, 2022 at 11:48:52 AM UTC+2, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Tuesday, December 6, 2022 at 6:37:19 PM UTC-6, Scott Lurndal wrote:
> >> MitchAlsup <Mitch...@aol.com> writes:
> >> >Anyone care to speculate on the size of the fixed.point BCD market ?
> ><
> >> "New research on the global scale of the COBOL programming language
> >> suggests that there are upwards of 800 billion lines of COBOL code
> >> being used by organizations and institutes worldwide, some three
> >> times larger than previously estimated."
> >> Feb 9, 2022
> ><
> >What percent of these would be available to non-Mainframe competition ?
> ><
> >That is:: are the COBOL users not locked into whatever they are currently using ?
> And even if they can switch, would they buy a specialty computer for
> Cobol processing only, or a commodity computer that may or may not
> process the COBOL stuff a little slower.
> >How many of those lines are dependent on BCD ?
> I think that COBOL is abstract enough that it can store the
> fixed-point numbers internally as (scaled) binary integers, and only
> convert to decimal representation on I/O. But my knowledge of COBOL
> is very limited, so my confidence in this statement is low.
>

If recommended limits for scaled decimal type are indeed 1.18.18 I'd take it as
a strong hint toward standard body's preference for binary representation.
Or, possibly, DPD, but certainly not BCD.
With the pair of binary/DPD 64-bit words one can do 1.18.19, but may be they
were envisioning smaller machines, where 4x32 is more convenient than 2x64.
Or, may be, they just don't like 1.18.19 for reasons of aesthetic.

Terje Mathisen

unread,

Dec 7, 2022, 8:04:58 AM12/7/22

to

I call bullshit on that number!

Assuming that it takes at least 10 seconds to write & debug a line of
COBOL (likely too optimistic by at least one order of magnitude), those
800E9 lines would have taken at least 8E12 seconds or 254K man-years
(working 24/7). With a more sustainable 2000 hours of programming/year
we can multiply that by 4+ to reach at least 1M man-years.

The only possible way to reach that figure is by counting the same code
lines many, many times, as in every installation have some local
modifications, so we count their entire code base as a unique set.

robf...@gmail.com

unread,

Dec 7, 2022, 9:17:42 AM12/7/22

to

>I call bullshit on that number!

IDK. Seems like it might be reasonable. 1M man years is only 100k
programmers working for 10 years. That was a world-wide number.
Block-copy and paste approach generates a lot of LOC fast.

>If recommended limits for scaled decimal type are indeed 1.18.18

I think that was 18 total digits including the fraction. 1.14.4? I think
128-bit DPD floating-point would work. Gives about 34 digits.

Is there a significant difference in results between scaled integers and
decimal floating point?

Scott Lurndal

unread,

Dec 7, 2022, 10:03:09 AM12/7/22

to

"robf...@gmail.com" <robf...@gmail.com> writes:
>>"New research on the global scale of the COBOL programming language
>>suggests that there are upwards of 800 billion lines of COBOL code
>>being used by organizations and institutes worldwide, some three
>>times larger than previously estimated."
>>Feb 9, 2022
>
>Looks like an opportunity to design a COBOL oriented processing core.

Like this one?

http://www.bitsavers.org/pdf/burroughs/B2500_B3500/1025475_B2500_B3500_RefMan_Oct69.pdf

This system was designed _specifically_ to run COBOL code.

It was easy to use, easy to debug and relatively efficient for the day.

As pointed out earlier, the final generation of these systems were running
until circa 2010-2012.

>
>It has been many years since I took a course in COBOL programming.
>Unfortunately, I have lost my COBOL textbook. The PIC statement stands out
>in my memory.

The system included an instruction, EDT, which handled formatting
for the PIC clause.

> Having worked on business apps (non COBOL), I have some
>interest in a processor geared towards supporting COBOL. I think it may make
>for an interesting design. Just looked up on the web limits for numeric fields
>which is 18 digits.

The B3500 and successors supported 1 to 100 digit fields. The decimal
point, implied, could be anywhere within the field.

====== Edit (EDT)/OP=49 ======

==== Format ====

^ OP ^ AF ^ BF ^ A Syllable ^ B Syllable ^ C Syllable ^

''OP = 49''

**AF** Not used as //A Syllable// length. **AF** may be indirect or
may specify that the //A Syllable// is a literal.\\
**BF** Number of eight-bit edit-operators and in-line literals in the //B
Syllable//. A value of __00__ is equal to a length of 100 characters. **BF**
may be indirect.

The //A Syllable// is the address of the __source__ field to be edited. Address may be indexed, indirect or extended.
The final address controller data type may be **UN**, **SN** or **UA**.\\
The //B Syllable// is the address of the edit-operator field. Address may be
indexed, indirect or extended. The final address controller data type is
ignored and is treated as **UA**.\\
The //C Syllable// is the address of the __destination__ field. Address may be indexed, indirect or extended.
The final address controller data type must be **UA** or **UN**. Use of
**SN** data type will cause an //Invalid Instruction fault (**IEX = 03**)//. See
[[compatibility_notes:a.13.1|Compatibility Notes A.13.1]].

==== Function ====

The Edit instruction moves digits or characters (depending on the **A**
address controller) from the **A** field to the **C** field under control of
the edit-operators in the **B** field. Characters may be moved, inserted or
deleted according to the edit-operators. Data movement and editing are
stopped by the exhaustion of edit-operators in the **B** field.

At the start of the Edit operation, the comparison flags are set to **EQUAL** and the [[processor_state::overflow_flag|Overflow Flag]] is unconditionally reset. The comparison flags may be set to **HIGH** or **LOW** if any non-zero digits are moved from the source to the destination field.

The source or **A** field is considered positive for unsigned numeric
(**UN**) format. For unsigned alpha (**UA**), the most significant digit of
the most significant character is interpreted as the sign. For signed
numeric (**SN**), the most significant digit of the field is the sign (which
is otherwise ignored).

If the **C** address controller is other than **UA**, only insert the low
order digit of each character transferred into the destination field. Therefore, whenever a blank (**40**) is specified, a 0 will be inserted.

The edit instruction uses an edit table that is located in memory locations
**48**-**63** relative to Base #0. This table may be initialized to any
desired set of insertion characters. By convention all compilers build a default insertion table containing:

^ Entry # ^ Base 0 Address ^ Character ^ Description ^
^ 0 | 48 | **+** | Positive Sign |
^ 1 | 50 | **-** | Negative Sign |
^ 2 | 52 | ***** | Check Suppress Character |
^ 3 | 54 | **.** | Decimal Point |
^ 4 | 56 | **,** | Thousands Separator |
^ 5 | 58 | **$** | Currency Symbol |
^ 6 | 60 | **0** | Leading zero fill Character |
^ 7 | 62 | <blank> | Blank Fill Character |

The edit-operator field consists of a
string of two-digit instructions.
Each instruction is of the form **M**//Av//. The **M** digit is the operation
code portion of the edit-operator. The //Av//
digit is the variant position of the edit-operator.

==== Micro Operators ====

^ Instruction ^^ Variant ^^
^ M ^ Name ^ Av ^ Action ^
| 0 | Move Digit | 0 thru 9 | T <= 1 (Significance)\\ MOVE Av + 1 DIGITS |
| 1 | Move Characters | 0 thru 9 | T <= 1 (Significance)\\ MOVE Av + 1 BYTES |
| 2 | Move Suppress | 0 thru 9 | If T = 1, M <= 0\\ If T = 0, READ each A-Digit then\\ If A-Digit NOT = 0, M <= 0\\ If A-Digit = 0, then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2 |
| 3 | Insert Unconditionally | 0 - 7 | Insert Table Entry 0 - 7 |
| ::: | ::: | 8 | If A = +, Insert table entry 0\\ If A = -, Insert Table Entry 1 |
| ::: | ::: | 9 | If A = +, Insert Blank\\ If A = -, Insert Table Entry 1 |
| ::: | ::: | A | If A = +, Insert Table Entry 0\\ If A = -, Insert blank |
| ::: | ::: | B | Insert Next B character |
| 4 | Insert on Plus | 0 thru B | If A = +, M <= 3\\ If A = - Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |
| 5 | Insert on Minus | 0 thru B | If A = -, M <= 3\\ If A = + Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |
| 6 | Insert Suppress | 0 thru B | If T = 1, M <= 3\\ If T = 0, Then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, Skip next B character |
| 7 | Insert Float | 0 thru B | If T = 1, Move one digit, If Av = B, Skip Next B\\ If T = 0, Read one A-digit, then\\ If A-digit NOT = 0, then T <= 1,\\ If Av = 0 thru 7, Insert Table Entry 0 - 7 and move one digit.\\ If Av = 8 AND A = + then, Insert Table Entry 0 and Move one digit.\\ If Av = 8 and A = - then, Insert Table Entry 1 and Move one digit.\\ If Av = 9 and A = + Then Insert Blank and Move one digit.\\ If Av = 9 and A = -, Then Insert Table Entry 1 and move one digit.\\ If Av = A and A = +, Then Insert Table Entry 0 and move one digit.\\ If Av = A and A = -, Then Insert Blank and move one digit.\\ If Av = B, then Insert Next B character and move one digit.\\ If A-digit = 0, then\\ If Q = 0, Insert Blank\\ If Q = 1, Insert Table Entry 2\\ If Av = B, skip next B character. |
| 8 | End Float | 0 thru B | If T = 1, Then\\ If Av NOT = B, No Operation\\ If Av = B, Skip next B character.\\ If T = 0, M <= 3 |
| 9 | Control | 0 | T <= 0 |
| ::: | ::: | 1 | T <= 1 |
| ::: | ::: | 2 | Q <= NOT Q |
| ::: | ::: | 3 | Skip A digit or character |

T denotes a flag that is set to zero initially and is set to one
(significance) if a digit or character is moved from the source
data field to the destination data field or if the control
edit-op (MAv = 91) is executed. If T is 1, zero suppression is
inhibited.

Q denotes a flag that is set to zero initially. It is set to one
with the control edit-op (MAv = 92) if a check protect or other
character is to be repeated.

=== M = 0, Move Digit (Av = 0-9) ===

**T** is set to 1 (significance).

When both the A- and C-address controllers specify 4-bit format (UN
or SN), Av+1 digits are moved from the source data field to the
destination data field.

When the A- and C-address controllers both specify 8-bit format
(UA), the numeric portion of Av+1 characters in the source data
field are moved to the destination data field with the zone digit
set to the EBCDIC numeric subset code (F).

When the A- and C-address controllers specify UA and UN
respectively, then only the numeric portion of Av +1 characters in
the source data field are moved to the destination data field.

When the A- and C-address controllers specify either UN or SN and
UA, respectively, then Av+1 digits in the source data field are
moved to the destination data field and the zone (high order digit)
of each character to be written is set to the EBCDIC numeric subset
code (F).

=== M = 1, Move character (Av = 0-9) ===

**T** is set to 1 (significance).

When the A- and C-address controllers both specify 4-bit format (UN
or SN), Av+1 digits are moved from the source data field to the
destination data field.

When the A- and C-address controllers both specify 8-bit format
(UA), Av+1 characters are moved from the source data field
unchanged to the destination data field.

When the A- and C-address controllers specify UA and UN
respectively, only the numeric portion of Av +1 characters in the
source data field is moved to the destination data field.

When the A-address and C-address controllers specify either UN or
SN and UA, respectively, Av+1 digits in the source data field are
moved to the destination data field and the zone (high order digit)
of each character to be written is set to the EBCDIC numeric subset
code (F).

=== M = 2, Move suppress (Av = 0-9) ===

If **T** is 1 (significance), the operation move digit (M = 0) is
performed.

If **T** is zero and the first source digit or the low order digit of
the first character has a value of zero, and **Q** is zero, a blank
(40) is inserted into the destination data field.

If **Q** is 1, the edit table value at base+52 is inserted into the
destination data field. Av+1 indicates the number of
digits/characters to be examined.

If **T** is zero and the first source digit, and the low-order digit of
the first character has a value other than zero (significance),
then the operation move digit (M = 0) is performed.

=== M = 3, Insert unconditionally (Av = 0-9, A, or B) ===

If Av equals 0-7, insert a character from the edit table at
base 0+48+2Av into the destination data field.

If Av equals 8 and the sign of the source data field is positive
(+), the edit table entry at base 0+48 is inserted into the
destination data field.

If Av equals 8 and the sign of the source data field is minus (-),
the edit table entry at base 0+50 is inserted into the destination
data field.

If Av equals 9 and the sign of the source data field is positive
(+), a blank (40) is inserted into the destination data field.

If Av equals 9 and the sign of the source data field is minus (-),
the edit table entry at base 0+50 is inserted into the destination
data field.

If Av equals A and the sign of the source data field is positive
(+), the edit table entry at base 0+48 is inserted into the
destination data field.

If Av equals A and the sign of the source data field is minus (-),
a blank (40) is inserted into the destination data field.

If Av equals B, the next character in the edit-op field is inserted
into the destination data field.

=== M = 4, Insert on plus (Av = 0-9, A, or B) ===

If the sign of the source data field is positive (+) the operation
Insert Unconditionally (M = 3) is performed.

If the sign of the source data field is minus (-) and **Q** is zero, a
blank (40) is inserted into the destination data field, and if
Av=B, the next character in the edit-op field is skipped. However,
if there are no characters left to skip in the edit-op field, an
invalid instruction fault (IEX=07) occurs.

If the sign of the source data field is minus (-) and **Q** is one, the
edit table entry at base 0+52 is inserted into the destination data
field, and if Av=B, the next character in the edit-op field is
skipped. However, if there are no characters left to skip in the
edit-op field, an invalid instruction fault (IEX=07) occurs.

=== M = 5, Insert on minus (Av = 0-9, A, or B) ===

If the sign of the source data field is minus (-) the insert
unconditionally (M=3) operation is performed.

If the sign of the source data field is positive (+) and **Q** is zero,
a blank (40) is inserted into the destination data field, and if
Av=B, the next character in the edit-op field is skipped. However,
if there are no characters left to skip in the edit-op field, then
an invalid instruction fault (IEX=07) occurs.

If the sign of the source data field is positive (+) and **Q** is one,
the edit table entry at base 0+52 is inserted into the destination
data field, and if Av=B, the next character in the edit-op field is
skipped. However, if there are no characters left to skip in the
edit-op field, an invalid instruction fault (IEX=07) occurs.

=== M = 6, Insert suppress (Av = 0-9 A, or B) ===

If **T** is one (significance), the insert unconditionally (M=3)
operation is performed.

If **T** is zero and **Q** is zero, a blank (40) is inserted into the
destination data field, and if Av=B, the next character in the
edit-op field is skipped. However, if there are no characters left
to skip in the edit-op field, then an invalid instruction fault
(IEX=07) occurs.

If **T** is zero and **Q** is one, a character from the edit table at
base 0+52 is inserted into the destination data field, and if Av=B,
the next character in the edit-op field is skipped. However, if
there are no characters left to skip in the edit-op field, then an
invalid instruction fault (IEX=07) occurs.

If **T** is zero and Av equals B, the next character in the edit-op
field is skipped. However, if there are no characters left to skip
in the edit-op field, an invalid instruction fault (IEX=07) occurs.

=== M = 7, Insert float (Av = 0-9, A, or B) ===

If **T** is one (significance), the operation move digit (M=0, Av=0) is
performed.

If **T** is one (significance) and Av is a B, the next character in the
edit-op field is skipped. However, if there are no characters left
to skip in the edit-op field, an invalid instruction fault (IEX=07)
occurs.

If **T** is zero and the source digit (AC=SN or UN) or the low order
digit of the source character (AC=UA) has a value of zero and **Q** is
zero, a blank (40) is inserted into the destination data field, and
if Av=B, the next character in the edit-op field is skipped.
However, if there are no characters left to skip in the edit-op
field, then an invalid instruction fault (IEX=07) occurs.

If **Q** is one, the edit table entry at base 0+52 is inserted into the
destination data field, and if Av=B, the next character in the
edit-op field is skipped. However, if there are no characters left
to skip in the edit-op field, then an invalid instruction fault
(IEX=07) occurs.

If **T** is zero and the source digit (AC=SN or UN) or the low order
digit of the source character (AC=UA) has a value of other than
zero, the insert unconditionally (M=3) operation is performed, **T** is
set to one and then the move digit (M=0, Av=0) operation is
performed.

=== M = 8, End float (Av=0-9, A, or B) ===

If **T** is one (significance) and Av is not equal to a B, no operation
is performed.

If **T** is one (significance) and Av equals a B, the next character in
the edit-op field is skipped.

If **T** is zero, the insert unconditionally (M=3) operation is
performed.

=== M = 9, Control (Av=0-3) ===

This edit-op performs a control function based on the variant (Av).

^ Variant ^ Action ^
| 0 | Set T to 0 |
| 1 | Set T to 1 |
| 2 | Complement Q |
| 3 | Skip the Source Data Field Digit or Character |

^ Note: Use of undigits A through F for M or values for Av not specified above will cause an Invalid Instruction fault (IEX=07) (see [[compatibility_notes:A.13.2|Compatibility Notes A.13.2]]). ^

==== Overflow/Comparison Flags ====

Set the [[processor_state:comparison_flags|Comparison Flags]] to **HIGH** if
any numeric digits moved from the source data field are non-zero and the sign
of the source field is interpreted as positive.

Set the [[processor_state:comparison_flags|Comparison Flags]] to **LOW** if
any numeric digits moved from the source data field are non-zero and the sign
of the source field is interpreted as negative.

Set the [[processor_state:comparison_flags|Comparison Flags]] to **EQUAL** if
all numeric digits moved from the source data field are equal to zero or if
no character or digit was moved from the source data field.

Insertion character values do not affect the comparison flags.

Reset the [[processor_state::overflow_flag|Overflow Flag]].

==== Overlap ====

Overlap of the **A**, **B** or **C** fields in any manner
may produce incompatible results. See
[[compatibility_notes:a.13.3|Compatibility Notes A.13.3]].

==== Examples ====

=== Example (1) Edit ===

^ OP ^ AF ^ BF ^ A Syllable ^ B Syllable ^ C Syllable ^
| 49 | 00 ^ 01 | A Field (UA) | B Field (UA) | C Field (UA) |

A Field 010203
B Field 02
C Field (After) F1F2F3
Comparison (After) HIGH
Overflow (After) NO

=== Example (2) Edit ===

^ OP ^ AF ^ BF ^ A Syllable ^ B Syllable ^ C Syllable ^
| 49 | 00 ^ 22 | A Field (SN) | B Field (UA) | C Field (UA) |

A Field C0 01 30 59
B Field 4B D7 4B C1 4B D8 37 92 75 64 75
75 75 85 93 33 01 92 5B C3 5B D9
TABLE(48-62) 4E 60 5C 4B 6B 5B F0 40
+ - * . , $ 0 b
C Field (After) D7 C1 E8 40 5C 5C 5C 5B F1 F3 4B F5 F9 40 40
P A Y b * * * $ 1 3 . 5 9 b b
Comparison (After) HIGH
Overflow (After) NO

Note b = Blank

Scott Lurndal

unread,

Dec 7, 2022, 10:04:57 AM12/7/22

to

It was apparently not interesting for Intel's market, but that
doesn't mean that a business-oriented processor isn't interesting.

Note that AMD removed DAA et. al. when they designed the 64-bit
extensions that Intel subsequently adopted.

Scott Lurndal

unread,

Dec 7, 2022, 10:07:12 AM12/7/22

to

"robf...@gmail.com" <robf...@gmail.com> writes:
>
>>I call bullshit on that number!
>
>IDK. Seems like it might be reasonable. 1M man years is only 100k
>programmers working for 10 years. That was a world-wide number.
>Block-copy and paste approach generates a lot of LOC fast.

COBOL has been written for over 60 years now, and some of that
older code is still in production (with updates over the decades).

A fair fraction is generated by 4GL generators.

Stephen Fuld

unread,

Dec 7, 2022, 11:59:25 AM12/7/22

to

On 12/7/2022 1:01 AM, Anton Ertl wrote:

Snip

> I think that COBOL is abstract enough that it can store the
> fixed-point numbers internally as (scaled) binary integers, and only
> convert to decimal representation on I/O.

Sort of. The format of all variables is specified in the PIC clause,
specifically the USAGE sub clause. The usage can be DISPLAY (e.g. ASCII
or EBCDIC), or COMPx, where x specifies the specifics e.g. COMP1 is
binary fixed point, COMP2 is single precision floating point, COMP3 is
packed decimal or BCD, etc. While the internal mechanism of
calculations is not directly specified, whenever a value is stored into
a variable it must be in the format of that variable, whether or not any
I/O will be done on that variable. Of course, the compiler must assure
that the internal calculations don't loose any needed precision.

e.g. If you write COMPUTE A EQUALS B + C, the compiler is free to
convert B and C from whatever they were (even DISPLAY), to whatever it
wants internally for the addition, but the end result must be converted
to the type and size (again, even DISPLAY) of C before being stored into
the variable. Of course, efficiency concerns suggest to the programmer
to specify the usage "correctly".

> But my knowledge of COBOL
> is very limited, so my confidence in this statement is low.

Mine didn't use to be so limited, but it is decades old, so may be
obsolete. :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Stephen Fuld

unread,

Dec 7, 2022, 1:06:45 PM12/7/22

to

On 12/6/2022 9:20 PM, robf...@gmail.com wrote:
>> "New research on the global scale of the COBOL programming language
>> suggests that there are upwards of 800 billion lines of COBOL code
>> being used by organizations and institutes worldwide, some three
>> times larger than previously estimated."
>> Feb 9, 2022
>
> Looks like an opportunity to design a COBOL oriented processing core.

Perhaps. While it might be an interesting intellectual problem, I don't
think it is much of a commercial opportunity. Remember, these programs
are running under an operating system that is far from any Unix variant
(e.g. MVS, MCP or even OS2200), and are probably intimately tied to some
sort of transaction processing middleware (e.g. CICS, TIP, etc.), and
may use a database not all all compatible with any relational/SQL model
(e.g. IMS, DMS, DMS 2200). So for someone to convert to your new
hypothetical processor, the conversion costs would be so high as to
overwhelm any potential performance advantage on the COBOL part.

MitchAlsup

unread,

Dec 7, 2022, 1:22:34 PM12/7/22

to

On Wednesday, December 7, 2022 at 9:03:09 AM UTC-6, Scott Lurndal wrote:

This was my first job as a professional engineer (coder). I was given 256 bytes
of memory footprint, and all of the symbols above. My job was to create the
code which did that EDIT function (above). Turns out in 8085 ASM you can code
that EDIT in 200 (±10: its been a long time) bytes. {And the thou ',' and point'.'
could be interchanged for European markets.

Anton Ertl

unread,

Dec 8, 2022, 3:26:35 AM12/8/22

to

"robf...@gmail.com" <robf...@gmail.com> writes:
[800G lines of COBOL]

>>I call bullshit on that number!
>
>IDK. Seems like it might be reasonable. 1M man years

assumes 800k lines per man-year, about 3k lines per day. I remember a
number by Siemens of 1500 debugged and documented lines per man-year.
With that number, the 800G lines would require 533M man-years.

>is only 100k
>programmers working for 10 years.

Starting with 533M man-years, we would then have 53M COBOL
programmers.

>I think that was 18 total digits including the fraction. 1.14.4? I think
>128-bit DPD floating-point would work. Gives about 34 digits.
>
>Is there a significant difference in results between scaled integers and
>decimal floating point?

I think that DFP loses precision on overflow, while scaled integers
should trap. But of course the idea of fixed point is that the
programmer has provided enough bits that there is no overflow (did not
work so well for the Ariane 5).

Anton Ertl

unread,

Dec 8, 2022, 3:38:05 AM12/8/22

to

sc...@slp53.sl.home (Scott Lurndal) writes:

>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>Anyway, the fact that Intel added DAA to 8086 (which works on a single
>>byte), but did not include any additional BCD support (or at least
>>better support for binary<->BCD conversion) despite adding a huge
>>number of instructions over the years, some for very specialized uses,
>>indicates that Intel thinks that the fixed.point BCD market outside of
>>traditional mainframes is to small for adding even something cheap
>>like DCOR and IDCOR below.
>
>It was apparently not interesting for Intel's market, but that
>doesn't mean that a business-oriented processor isn't interesting.

Intel's CPUs are general-purpose, and that includes businesses. And
businesses buy a lot of Intel CPUs and do a lot of financial
processing with them.

There is a market for CPUs with decimal instructions: The market of
backwards-compatible mainframes.

As a thought experiment, if there was a world-wide ban of these legacy
mainframe instruction sets (including emulators), effective, say 2030,
would anyone design a new CPU with BCD support for the ports of the
mainframe software, and would they be a business success?

Maybe for cheap ports of mainframe software by transliterating the
assembly language of the mainframe applications. So maybe Intel would
add BCD support to their SIMD instructions.

If the applications were rewritten starting from the requirements, I
doubt that BCD support would provide a significant benefit.

>Note that AMD removed DAA et. al. when they designed the 64-bit
>extensions that Intel subsequently adopted.

Even if you have BCD data, I doubt that you would use the single-byte
DAA instruction to deal with it.

Terje Mathisen

unread,

Dec 8, 2022, 4:01:24 AM12/8/22

to

That really does not count, or it should not?

That's like counting all the binaries ever generated by compilers,
instead of the unique source codes used to generate them.

Terje Mathisen

unread,

Dec 8, 2022, 4:13:35 AM12/8/22

to

Anton Ertl wrote:

> sc...@slp53.sl.home (Scott Lurndal) writes:
> There is a market for CPUs with decimal instructions: The market of
> backwards-compatible mainframes.

No, there is a market for cpus which can do decimal arithmetic, which is
100% of them.

>
> As a thought experiment, if there was a world-wide ban of these legacy
> mainframe instruction sets (including emulators), effective, say 2030,
> would anyone design a new CPU with BCD support for the ports of the
> mainframe software, and would they be a business success?
>
> Maybe for cheap ports of mainframe software by transliterating the
> assembly language of the mainframe applications. So maybe Intel would
> add BCD support to their SIMD instructions.
>
> If the applications were rewritten starting from the requirements, I
> doubt that BCD support would provide a significant benefit.
>
>> Note that AMD removed DAA et. al. when they designed the 64-bit
>> extensions that Intel subsequently adopted.
>
> Even if you have BCD data, I doubt that you would use the single-byte
> DAA instruction to deal with it.

Right!

As soon as we had wider register, we started to write in-register SIMD
code to process 8 nybbles in a 32-bit register. With 16/32/64-byte
vector regs, it becomes trivial. This has the additional benefit that
with 8 (or even 16) bits per digit position you can do a lot of
additions (or even mul-acc) before you have to care about potential
overflow.

John Dallman

unread,

Dec 8, 2022, 6:21:25 AM12/8/22

to

In article <tms950$18u3$1...@gioia.aioe.org>, terje.m...@tmsw.no (Terje
Mathisen) wrote:

> Scott Lurndal wrote:
> > A fair fraction is generated by 4GL generators.
> >
> That really does not count, or it should not?
>
> That's like counting all the binaries ever generated by compilers,
> instead of the unique source codes used to generate them.

You have to consider the organisations that will take the Cobol output
from the 4GL, and start customising it. Yes, there are people dumb enough
to do this, with managers fool enough to support them.

I've seen someone try it, with a system that generated five million lines
of C, essentially being used as a high-level assembler, from four million
lines of a domain-specific language. He did this because he "really
wanted to work in Visual Studio" and that doesn't understand the
domain-specific language. He claimed he was going to put his changes in
the C back into the DSL, but found this very hard. Management had decided
to let him find out for himself that his ideas made no sense.

Since the source-management for the system in question will not let you
check in C files, he eventually complied with local working practice, but
got a job in a different department ASAP. He was not missed.

John

Scott Lurndal

unread,

Dec 8, 2022, 9:08:23 AM12/8/22

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>"robf...@gmail.com" <robf...@gmail.com> writes:
>[800G lines of COBOL]
>>>I call bullshit on that number!
>>
>>IDK. Seems like it might be reasonable. 1M man years
>
>assumes 800k lines per man-year, about 3k lines per day. I remember a
>number by Siemens of 1500 debugged and documented lines per man-year.
>With that number, the 800G lines would require 533M man-years.

A significant fraction of COBOL code is generated (by various 4GLs);
not written by individual programmers.

And COBOL programmers have been programming for more than sixty
years now.

Thomas Koenig

unread,

Dec 8, 2022, 3:04:27 PM12/8/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> sc...@slp53.sl.home (Scott Lurndal) writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>>Anyway, the fact that Intel added DAA to 8086 (which works on a single
>>>byte), but did not include any additional BCD support (or at least
>>>better support for binary<->BCD conversion) despite adding a huge
>>>number of instructions over the years, some for very specialized uses,
>>>indicates that Intel thinks that the fixed.point BCD market outside of
>>>traditional mainframes is to small for adding even something cheap
>>>like DCOR and IDCOR below.
>>
>>It was apparently not interesting for Intel's market, but that
>>doesn't mean that a business-oriented processor isn't interesting.
>
> Intel's CPUs are general-purpose, and that includes businesses. And
> businesses buy a lot of Intel CPUs and do a lot of financial
> processing with them.
>
> There is a market for CPUs with decimal instructions: The market of
> backwards-compatible mainframes.

Not only that. POWER is not a mainframe architecture, and it has
supports DFP and BCD, plus conversion between the two - probably
to support System i or whatever it is called this week.

John Levine

unread,

Dec 8, 2022, 9:15:15 PM12/8/22

to

According to Thomas Koenig <tko...@netcologne.de>:

>> There is a market for CPUs with decimal instructions: The market of
>> backwards-compatible mainframes.
>
>Not only that. POWER is not a mainframe architecture, and it has
>supports DFP and BCD, plus conversion between the two - probably
>to support System i or whatever it is called this week.

I would argue that IBM i (which I think is its current name) is a
small mainframe which means POWER indeed is a mainframe architecture.
We can have endless amusing arguments about how you define a mainframe
(I claim that a PDP-10 was a very large mini. not a mainframe) but if
you agree a mainframe has a lot of I/O capacity and reliability
features, IBM i on POWER qualifies.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Anton Ertl

unread,

Dec 9, 2022, 4:29:29 AM12/9/22

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

>> There is a market for CPUs with decimal instructions: The market of
>> backwards-compatible mainframes.
>
>Not only that. POWER is not a mainframe architecture, and it has
>supports DFP and BCD, plus conversion between the two - probably
>to support System i or whatever it is called this week.

Power is the architecture of System i or whatever it is called this
week. How is it not in the market of backwards-compatible mainframes?

Would they have added these features if Power was only used for System
p or whatever it is called this week? I doubt it.

BTW, I did not notice BCD support when I last looked at the PowerPC
instruction set. I found
https://wiki.raptorcs.com/w/images/c/cb/PowerISA_public.v3.0B.pdf
which tells me:

|Summary of Changes in Power ISA Version 3.0 B
|
|Decimal Integer Support Operations: Adds new BCD support
|instructions, including variable-length load/store instructions for
|bcd values, new format conversion instructions between BCD and
|National decimal, zoned decimal, and 128-bit signed integer formats.
|new BCDtruncate, round, and shift instructions, new BCD sign digit
|manipulation instructions. Also adds multiply-by-10 instructions to
|faciliate binary-to-decimal conversion for printf. Corrected
|functionality of Decimal Shift and Round (bcdsr.) instruction.

Power 3.0 was released in 2015, Power 3.0B in 2017. It's interesting
that Power(PC)-based CPUs have been used for AS/400 and its successors
since the 1990s (and actually identical hardware since 2008), and
apparently worked satisfyingly without BCD support for several
decades, and they decided to add BCD support eventually. The reasons
for that would be interesting.

Michael S

unread,

Dec 9, 2022, 4:53:51 AM12/9/22

to

On Friday, December 9, 2022 at 11:29:29 AM UTC+2, Anton Ertl wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
> >Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >> There is a market for CPUs with decimal instructions: The market of
> >> backwards-compatible mainframes.
> >
> >Not only that. POWER is not a mainframe architecture, and it has
> >supports DFP and BCD, plus conversion between the two - probably
> >to support System i or whatever it is called this week.
> Power is the architecture of System i or whatever it is called this
> week. How is it not in the market of backwards-compatible mainframes?
>

IBM does not call system i a mainframe.
According to what I was told (~30 years ago) the primary business-oriented
programming language of system i (still AS/400 back then) is not COBOL,
but IBM RPG.
https://en.wikipedia.org/wiki/IBM_RPG
RPG supports packed/unpacked fixed-point decimal types, but the only supported
floating-point type is binary.

Michael S

unread,

Dec 9, 2022, 5:13:18 AM12/9/22

to

On Friday, December 9, 2022 at 4:15:15 AM UTC+2, John Levine wrote:
> According to Thomas Koenig <tko...@netcologne.de>:
> >> There is a market for CPUs with decimal instructions: The market of
> >> backwards-compatible mainframes.
> >
> >Not only that. POWER is not a mainframe architecture, and it has
> >supports DFP and BCD, plus conversion between the two - probably
> >to support System i or whatever it is called this week.
> I would argue that IBM i (which I think is its current name) is a
> small mainframe which means POWER indeed is a mainframe architecture.
> We can have endless amusing arguments about how you define a mainframe
> (I claim that a PDP-10 was a very large mini. not a mainframe) but if
> you agree a mainframe has a lot of I/O capacity and reliability
> features, IBM i on POWER qualifies.
>

Before we agree about that we should agree on definition of I/O capacity.

Michael S

unread,

Dec 9, 2022, 6:03:35 AM12/9/22

to

On Friday, December 9, 2022 at 12:13:18 PM UTC+2, Michael S wrote:
> On Friday, December 9, 2022 at 4:15:15 AM UTC+2, John Levine wrote:
> > According to Thomas Koenig <tko...@netcologne.de>:
> > >> There is a market for CPUs with decimal instructions: The market of
> > >> backwards-compatible mainframes.
> > >
> > >Not only that. POWER is not a mainframe architecture, and it has
> > >supports DFP and BCD, plus conversion between the two - probably
> > >to support System i or whatever it is called this week.
> > I would argue that IBM i (which I think is its current name) is a
> > small mainframe which means POWER indeed is a mainframe architecture.
> > We can have endless amusing arguments about how you define a mainframe
> > (I claim that a PDP-10 was a very large mini. not a mainframe) but if
> > you agree a mainframe has a lot of I/O capacity and reliability
> > features, IBM i on POWER qualifies.
> >
> Before we agree about that we should agree on definition of I/O capacity.

To expand on above point, 14 years ago I counted aggregate bandwidth of
biggest SMP machines from various vendors/series. They were all very
close, but still IBM z came last.
https://www.realworldtech.com/forum/?threadid=86863&curpostid=86872

14 years later I'd guess the outcome would be different and more favorable
for IBM z, but only because all others, including IBM's own POWER group,
lost interest in this sort huge SMP computers.

Thomas Koenig

unread,

Dec 9, 2022, 7:30:03 AM12/9/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Thomas Koenig <tko...@netcologne.de> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>> There is a market for CPUs with decimal instructions: The market of
>>> backwards-compatible mainframes.
>>
>>Not only that. POWER is not a mainframe architecture, and it has
>>supports DFP and BCD, plus conversion between the two - probably
>>to support System i or whatever it is called this week.
>
> Power is the architecture of System i or whatever it is called this
> week. How is it not in the market of backwards-compatible mainframes?

It is a successor in the https://en.wikipedia.org/wiki/IBM_System/38 ,
which was a midrange computer (which others would call a mini).
These systems are not compatible with /360 and successors, and run
different operating systems. About the only similarities are EBCDIC
and commercial application :-)

POWER also runs AIX and Linux, of course.

MitchAlsup

unread,

Dec 9, 2022, 11:08:12 AM12/9/22

to

On Friday, December 9, 2022 at 4:13:18 AM UTC-6, Michael S wrote:
> On Friday, December 9, 2022 at 4:15:15 AM UTC+2, John Levine wrote:
> > According to Thomas Koenig <tko...@netcologne.de>:
> > >> There is a market for CPUs with decimal instructions: The market of
> > >> backwards-compatible mainframes.
> > >
> > >Not only that. POWER is not a mainframe architecture, and it has
> > >supports DFP and BCD, plus conversion between the two - probably
> > >to support System i or whatever it is called this week.
> > I would argue that IBM i (which I think is its current name) is a
> > small mainframe which means POWER indeed is a mainframe architecture.
> > We can have endless amusing arguments about how you define a mainframe
> > (I claim that a PDP-10 was a very large mini. not a mainframe) but if
> > you agree a mainframe has a lot of I/O capacity and reliability
> > features, IBM i on POWER qualifies.
> >
> Before we agree about that we should agree on definition of I/O capacity.
>

A mainframe can support I/O capacity of::
a) at least 50 SATA drives
b) at least 50 I/O commands in flight at once
c) I/O can consume at least 50% of total available memory (DRAM) bandwidth
d) hot plug of peripherals and other devices
e) device recovery and use simultaneously
<
And nothing prevents a well designed PCIe from being able to do all this.

Anton Ertl

unread,

Dec 9, 2022, 1:10:02 PM12/9/22

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> Thomas Koenig <tko...@netcologne.de> writes:
>>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>> There is a market for CPUs with decimal instructions: The market of
>>>> backwards-compatible mainframes.
>>>
>>>Not only that. POWER is not a mainframe architecture, and it has
>>>supports DFP and BCD, plus conversion between the two - probably
>>>to support System i or whatever it is called this week.
>>
>> Power is the architecture of System i or whatever it is called this
>> week. How is it not in the market of backwards-compatible mainframes?
>
>It is a successor in the https://en.wikipedia.org/wiki/IBM_System/38 ,
>which was a midrange computer (which others would call a mini).

System/38 has a 48-bit address space, while the contemporary S/370s
have 31-bit address space.

And there was also https://en.wikipedia.org/wiki/IBM_4300 which are
described as "mid-range systems", but they were compatible with S/370.

Anyway, the relevant part is not whether marketing calls it a
mainframe or not, but whether it has a lot of commercial legacy
software.

>POWER also runs AIX and Linux, of course.

Do you think they added the BCD-supporting instructions for AIX and
Linux?

John Levine

unread,

Dec 9, 2022, 2:15:03 PM12/9/22

to

According to Michael S <already...@yahoo.com>:

>> >> There is a market for CPUs with decimal instructions: The market of
>> >> backwards-compatible mainframes.
>> >
>> >Not only that. POWER is not a mainframe architecture, and it has

>> >supports DFP and BCD, ...

>> I would argue that IBM i (which I think is its current name) is a
>> small mainframe which means POWER indeed is a mainframe architecture.
>> We can have endless amusing arguments about how you define a mainframe
>> (I claim that a PDP-10 was a very large mini. not a mainframe) but if

>Before we agree about that we should agree on definition of I/O capacity.

Hey, I said we can have endless amusing arguments. Even in the 1970s you
could attach the same kinds of disks to a mini that you could to a mainframe
so it's not raw bandwidth. It's more things like I/O channels that can do
significant work without interrupting the CPU and had provisions for
diagnosing and recovering from errors rather than just crashing.

John Levine

unread,

Dec 9, 2022, 2:17:40 PM12/9/22

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>Power 3.0 was released in 2015, Power 3.0B in 2017. It's interesting
>that Power(PC)-based CPUs have been used for AS/400 and its successors
>since the 1990s (and actually identical hardware since 2008), and
>apparently worked satisfyingly without BCD support for several
>decades, and they decided to add BCD support eventually. The reasons
>for that would be interesting.

Probably the same reason they add instructions to zSeries, they found
hot spots and moved it into microcode. Remember that IBM i uses a
virtual machine language that is compiled into native code the first
time a program is run, so the new instructions get used automagically
with the virtual code is recompiled on the new machine.

John Levine

unread,

Dec 9, 2022, 2:23:03 PM12/9/22

to

According to Michael S <already...@yahoo.com>:

>IBM does not call system i a mainframe.
>According to what I was told (~30 years ago) the primary business-oriented
>programming language of system i (still AS/400 back then) is not COBOL,
>but IBM RPG.

RPG is a fine little language if you want to do the kind of stuff it
does well, tabular reports with summaries. It was very popular on
small IBM 360s which were definitely mainframes.

Since it is now over 60 years old, it has grown past all recognition and
I hear you can write web applications in it.

Thomas Koenig

unread,

Dec 9, 2022, 3:42:58 PM12/9/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tko...@netcologne.de> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>> There is a market for CPUs with decimal instructions: The market of
>>>>> backwards-compatible mainframes.
>>>>
>>>>Not only that. POWER is not a mainframe architecture, and it has
>>>>supports DFP and BCD, plus conversion between the two - probably
>>>>to support System i or whatever it is called this week.
>>>
>>> Power is the architecture of System i or whatever it is called this
>>> week. How is it not in the market of backwards-compatible mainframes?
>>
>>It is a successor in the https://en.wikipedia.org/wiki/IBM_System/38 ,
>>which was a midrange computer (which others would call a mini).
>
> System/38 has a 48-bit address space, while the contemporary S/370s
> have 31-bit address space.
>
> And there was also https://en.wikipedia.org/wiki/IBM_4300 which are
> described as "mid-range systems", but they were compatible with S/370.
>
> Anyway, the relevant part is not whether marketing calls it a
> mainframe or not, but whether it has a lot of commercial legacy
> software.

You could do that using GNU Cobol :-)

>
>>POWER also runs AIX and Linux, of course.
>
> Do you think they added the BCD-supporting instructions for AIX and
> Linux?

They certainly did not add them for the operating systems, they
added them for applications, which people could run.

Hm. Looking at https://en.wikipedia.org/wiki/IBM_RS64oha ,
it has a link to a nice article which details that these
processors could (hugely simplified) run in different modes
which included mode bits, apart from 32- and 64-bit
versions of POWER and PowerPC, and AS/400.

IBM probably just made these instructions available for the
POWER modes, as well. Dark silicon, indeed.

Thomas Koenig

unread,

Dec 9, 2022, 3:49:37 PM12/9/22

to

John Levine <jo...@taugh.com> schrieb:

> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>>Power 3.0 was released in 2015, Power 3.0B in 2017. It's interesting
>>that Power(PC)-based CPUs have been used for AS/400 and its successors
>>since the 1990s (and actually identical hardware since 2008), and
>>apparently worked satisfyingly without BCD support for several
>>decades, and they decided to add BCD support eventually. The reasons
>>for that would be interesting.
>
> Probably the same reason they add instructions to zSeries, they found
> hot spots and moved it into microcode. Remember that IBM i uses a
> virtual machine language that is compiled into native code the first
> time a program is run, so the new instructions get used automagically
> with the virtual code is recompiled on the new machine.

Hm, looking into the POWER9 handbook, it actually has instructions like
"Decimal Convert From Signed Qword and Record", with a throughput of
1/26 and a minimum latency of 37 cycles. It's not cracked, so
technically it is probably not microcode :-)

What I don't understand is that "Decimal Convert to Signed Qword
and Record" has a 23-cycle latency. This is something that could
be done much faster in an adder tree, but I guess they simply did
not want to spend the silicon area for it.

Scott Lurndal

unread,

Dec 9, 2022, 4:24:45 PM12/9/22

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> Thomas Koenig <tko...@netcologne.de> writes:
>>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>>> There is a market for CPUs with decimal instructions: The market of
>>>>>> backwards-compatible mainframes.
>>>>>
>>>>>Not only that. POWER is not a mainframe architecture, and it has
>>>>>supports DFP and BCD, plus conversion between the two - probably
>>>>>to support System i or whatever it is called this week.
>>>>
>>>> Power is the architecture of System i or whatever it is called this
>>>> week. How is it not in the market of backwards-compatible mainframes?
>>>
>>>It is a successor in the https://en.wikipedia.org/wiki/IBM_System/38 ,
>>>which was a midrange computer (which others would call a mini).
>>
>> System/38 has a 48-bit address space, while the contemporary S/370s
>> have 31-bit address space.
>>
>> And there was also https://en.wikipedia.org/wiki/IBM_4300 which are
>> described as "mid-range systems", but they were compatible with S/370.
>>
>> Anyway, the relevant part is not whether marketing calls it a
>> mainframe or not, but whether it has a lot of commercial legacy
>> software.
>
>You could do that using GNU Cobol :-)

Or even Nevada COBOL (I still have a copy) for the Commodore 64.

https://commodore.software/downloads/download/211-application-manuals/13865-nevada-cobol-for-the-commodore-64

MitchAlsup

unread,

Dec 9, 2022, 4:54:39 PM12/9/22

to

Decimal convert To uses ×10 which is converted into a 3-input add as
(x<<3)+(x<<1)+
Whereas Decimal convert From uses DIV10 which is mimicked by a
big multiply using an ugly constant to get the %10 pickoff.
<
So, the former uses a 3-input adder and some shifting, while the later
needs the big multiplier tree. It is pretty obvious that one should be
about 2× as long as the other.

Tim Rentsch

unread,

Dec 10, 2022, 12:51:33 PM12/10/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:

> Scott Lurndal wrote:
>
>> "robf...@gmail.com" <robf...@gmail.com> writes:
>>
>>>> I call bullshit on that number!
>>>
>>> IDK. Seems like it might be reasonable. 1M man years is only 100k
>>> programmers working for 10 years. That was a world-wide number.
>>> Block-copy and paste approach generates a lot of LOC fast.
>>
>> COBOL has been written for over 60 years now, and some of that
>> older code is still in production (with updates over the decades).
>>
>> A fair fraction is generated by 4GL generators.
>
> That really does not count, or it should not?
>
> That's like counting all the binaries ever generated by compilers,
> instead of the unique source codes used to generate them.

The two cases are not exactly analogous. People do write code in
COBOL, and a significant amount of it. For the most part people
do not work at the level of generated binaries, or even machine
or assembly language. Considering that difference I would say
the analogy doesn't hold up.

I'm not saying that generated lines of COBOL should necessarily
be counted as "lines of COBOL", only that the comparison to
generated binaries doesn't really shed light on the question.

Tim Rentsch

unread,

Dec 10, 2022, 1:01:02 PM12/10/22

to

John Levine <jo...@taugh.com> writes:

> According to Thomas Koenig <tko...@netcologne.de>:
>
>>> There is a market for CPUs with decimal instructions: The market of
>>> backwards-compatible mainframes.
>>
>> Not only that. POWER is not a mainframe architecture, and it has
>> supports DFP and BCD, plus conversion between the two - probably
>> to support System i or whatever it is called this week.
>
> I would argue that IBM i (which I think is its current name) is a
> small mainframe which means POWER indeed is a mainframe architecture.
> We can have endless amusing arguments about how you define a mainframe
> (I claim that a PDP-10 was a very large mini. not a mainframe) but if
> you agree a mainframe has a lot of I/O capacity and reliability
> features, IBM i on POWER qualifies.

Back in the day when PDP-10s were being used I think it is fair to
label them as mainframes (with the understanding that batch mode is
very different from timesharing, but that is a separate issue).
After all, if small IBM 360s count as mainframes, then surely the
PDP-10 should also count as a mainframe. My recollection is that
even early model PDP-10s had performance (and also other metrics)
comparable to a 360 model 50. (Disclaimer: the foregoing comments
are based on memories that are roughly 50 years old.)

George Neuner

unread,

Dec 10, 2022, 3:15:09 PM12/10/22

to

On Thu, 08 Dec 2022 08:11:38 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>"robf...@gmail.com" <robf...@gmail.com> writes:
>[800G lines of COBOL]
>>>I call bullshit on that number!
>>
>>IDK. Seems like it might be reasonable. 1M man years
>
>assumes 800k lines per man-year, about 3k lines per day. I remember a
>number by Siemens of 1500 debugged and documented lines per man-year.
>With that number, the 800G lines would require 533M man-years.

1500 lines per year is only 6 lines per day (assuming 50 weeks of 5
days). A programmer who actually wrote that little code wouldn't have
a job for very long.

YMMV,
George

John Levine

unread,

Dec 10, 2022, 4:10:47 PM12/10/22

to

According to Tim Rentsch <tr.1...@z991.linuxsc.com>:
w>Back in the day when PDP-10s were being used I think it is fair to

>label them as mainframes (with the understanding that batch mode is
>very different from timesharing, but that is a separate issue).
>After all, if small IBM 360s count as mainframes, then surely the
>PDP-10 should also count as a mainframe.

Not at all. As I've said a few times, it's the architecture, not the
CPU speed. The CPU of a PDP-8 was much faster than a 360/30, but
nobody would call a PDP-8 a mainframe, even when it was running TSS/8
and supporting a dozen interactive users. (A 360/30 did a 16 bit add
in 27us, a PDP-8 did a 12 bit add in 3us.)

The I/O architecture of the KA-10 was basically a wider version of the
I/O of the PDP-9 or PDP-8. It normally took an interrupt for every word
of data, including every tty character, and had only a rudimentary way
to move a block of data between memory and an I/O device, known as
three cycle data break on the -8 and -9 and BLKI or BLKO in an interrupt
location on the -10. The disk controllers were pretty rudimentary and
if there were I/O errors, the usual response was to crash.

The KL version og the -10 was somewhat more mainframe-ish, with a PDP-11/40
doing some of the front end work and Massbus for disk and tape I/O, although
of course they hooked the Massbus to PDP-11 and -15 minis, too.

Michael S

unread,

Dec 10, 2022, 5:06:04 PM12/10/22

to

On Friday, December 9, 2022 at 10:49:37 PM UTC+2, Thomas Koenig wrote:
> John Levine <jo...@taugh.com> schrieb:
> > According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
> >>Power 3.0 was released in 2015, Power 3.0B in 2017. It's interesting
> >>that Power(PC)-based CPUs have been used for AS/400 and its successors
> >>since the 1990s (and actually identical hardware since 2008), and
> >>apparently worked satisfyingly without BCD support for several
> >>decades, and they decided to add BCD support eventually. The reasons
> >>for that would be interesting.
> >
> > Probably the same reason they add instructions to zSeries, they found
> > hot spots and moved it into microcode. Remember that IBM i uses a
> > virtual machine language that is compiled into native code the first
> > time a program is run, so the new instructions get used automagically
> > with the virtual code is recompiled on the new machine.
> Hm, looking into the POWER9 handbook, it actually has instructions like
> "Decimal Convert From Signed Qword and Record", with a throughput of
> 1/26 and a minimum latency of 37 cycles. It's not cracked, so
> technically it is probably not microcode :-)

That's not bad.
For comparison, my software conversion from 10 years ago that uses 128-bit SIMD:
Ivy Bridge, i5-3450 (~3450 MHz) - 9.03/22.22 ns = 31/77 cycles
Haswell, E3-1271 v3 (4000 Mhz) - 7.05/18.79 ns = 28/75 cycles
Skylake, E-2176G (4250 MHz) - 6.14/18.45 ns = 26/78 cycles
The first number is reciprocal throughput, the 2nd is latency.
The latency measurement is likely a bit pessimistic, because my artificial dependency
chain contains memory store followed by load.

The numbers above are not directly comparable to POWER9, because my routine does
not convert so many digits as POWER's bcdcfsq does.
bcdcfsq - input in range [-10^31-1:+10^31-1], output 31 BCD digits + sign
mine - input in range [0:2^64-1], output ~19.3 BCD digits

If I would try to convert longer numbers it would be significantly slower, because
for the first split, I will have to use either long division (slow on all processors up to
and including Skylake, but not so slow on Icelake or later or on AMD Zen3/4) or
something like 4 64x64=>128bit multiplications. Both operations can not be done
on the SIMD side of machine.

Michael S

unread,

Dec 11, 2022, 4:12:02 AM12/11/22

to

I tested my old code on more modern uArch:
Zen3, EPYC 7543P (3600 MHz) - 4.99/16.68 ns = 18/60 cycles

That's much closer to POWER9, at least on the throughput side.
Digit-for-digit Zen3 SW is only 13% behind POWER9 HW.

Anton Ertl

unread,

Dec 11, 2022, 5:16:42 AM12/11/22

to

Michael S <already...@yahoo.com> writes:
>The numbers above are not directly comparable to POWER9, because my routine does
>not convert so many digits as POWER's bcdcfsq does.
>bcdcfsq - input in range [-10^31-1:+10^31-1], output 31 BCD digits + sign
>mine - input in range [0:2^64-1], output ~19.3 BCD digits
>
>If I would try to convert longer numbers it would be significantly slower, because
>for the first split, I will have to use either long division (slow on all processors up to
>and including Skylake, but not so slow on Icelake or later or on AMD Zen3/4) or
>something like 4 64x64=>128bit multiplications. Both operations can not be done
>on the SIMD side of machine.

64x64->128 multiplications can be done with a throughput of 1/cycle
and a latency of 3 cycles on a Skylake; for splitting the original
number into 18:13 or 16:15 digits, you also need the remainder (two
more multiplies and a subtract). The rest can then be done in
parallel, using 256-bit SIMD. I guess you need <20 cycles of
additional latency overall. Maybe you can save something by going for
a 16:15 split.

Michael S

unread,

Dec 11, 2022, 5:35:40 AM12/11/22

to

Michael S

unread,

Dec 11, 2022, 5:51:28 AM12/11/22

to

On Sunday, December 11, 2022 at 12:16:42 PM UTC+2, Anton Ertl wrote:

> Michael S <already...@yahoo.com> writes:
> >The numbers above are not directly comparable to POWER9, because my routine does
> >not convert so many digits as POWER's bcdcfsq does.
> >bcdcfsq - input in range [-10^31-1:+10^31-1], output 31 BCD digits + sign
> >mine - input in range [0:2^64-1], output ~19.3 BCD digits
> >
> >If I would try to convert longer numbers it would be significantly slower, because
> >for the first split, I will have to use either long division (slow on all processors up to
> >and including Skylake, but not so slow on Icelake or later or on AMD Zen3/4) or
> >something like 4 64x64=>128bit multiplications. Both operations can not be done
> >on the SIMD side of machine.
> 64x64->128 multiplications can be done with a throughput of 1/cycle
> and a latency of 3 cycles on a Skylake; for splitting the original
> number into 18:13 or 16:15 digits, you also need the remainder (two
> more multiplies and a subtract). The rest can then be done in
> parallel, using 256-bit SIMD. I guess you need <20 cycles of
> additional latency overall. Maybe you can save something by going for
> a 16:15 split.
> - anton

At the first glance, for exact equivalent of bcdcfsq,256-bit SIMD buys nothing.
It could be different at deeper look, but I am not going to take a deeper look.

What could be interesting is translation to Neon and testing on high-end
ARM64, like Apple M* or ARM Cortex-X2.
But that's the job for somebody else.

Anton Ertl

unread,

Dec 11, 2022, 6:21:41 AM12/11/22

to

Michael S <already...@yahoo.com> writes:
>> 64x64->128 multiplications can be done with a throughput of 1/cycle
>> and a latency of 3 cycles on a Skylake; for splitting the original
>> number into 18:13 or 16:15 digits, you also need the remainder (two
>> more multiplies and a subtract). The rest can then be done in
>> parallel, using 256-bit SIMD. I guess you need <20 cycles of
>> additional latency overall. Maybe you can save something by going for
>> a 16:15 split.
>> - anton
>
>At the first glance, for exact equivalent of bcdcfsq,256-bit SIMD buys nothing.

My thinking was that, after splitting, you would put the two parts in
two 128-bit halves of a 256-bit SIMD register, and then do a 256-bit
SIMD equivalent of your 128-bit SIMD implementation. Where is my
thinking wrong?

Michael S

unread,

Dec 11, 2022, 6:59:36 AM12/11/22

to

On Sunday, December 11, 2022 at 1:21:41 PM UTC+2, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >> 64x64->128 multiplications can be done with a throughput of 1/cycle
> >> and a latency of 3 cycles on a Skylake; for splitting the original
> >> number into 18:13 or 16:15 digits, you also need the remainder (two
> >> more multiplies and a subtract). The rest can then be done in
> >> parallel, using 256-bit SIMD. I guess you need <20 cycles of
> >> additional latency overall. Maybe you can save something by going for
> >> a 16:15 split.
> >> - anton
> >
> >At the first glance, for exact equivalent of bcdcfsq,256-bit SIMD buys nothing.
> My thinking was that, after splitting, you would put the two parts in
> two 128-bit halves of a 256-bit SIMD register, and then do a 256-bit
> SIMD equivalent of your 128-bit SIMD implementation. Where is my
> thinking wrong?

For 19+ BCD digits, there is only small section in the middle of the conversion
process, marked as "// scale by 1E4" that does the same operations
(multiplication and shift) on pair of 128-bit registers and can potentially
profit from 256-bit SIMD. In practice, on AVX256, shuffling that crosses
128-bit lanes is more expensive that shuffling within 128-bit lanes, so there
is a good chance that expense of shuffling will eat up all profits.

For 31 digits, at the first glance it looks the same. ATFG 31-digit processing
appear very similar to 19+ digits, except for expensive first split and
need for a pair of 2-nd level shifts (by 1E8) instead of one split by 1E10.
All differences appear to belong in GPR domain, SIMD part appears the same.
ATFG.

BGB

unread,

Dec 11, 2022, 4:52:42 PM12/11/22

to

I think my "peak output" is along the lines of several kLOC/day, but I
can't sustain this (this gets pretty tiring pretty quickly). Usually
this happens when implementing some new piece of code (and, seemingly
the rate at which I can write code is well in excess of the rate at
which I can write natural-language text, but natural language tends to
be more mentally demanding in my case).

Often, some days might have more significant amounts of new code
written, other days less.

My past measurements of my "average rate" codebase expansion seems to be
around 200 kLOC/year, so an average of around 548 lines/day.

This seems to be inversely related to the amount of time spent
"debugging stuff", where bug-hunting results in significantly less code
than writing new code.

Not sure about more recent rates, might have dropped due to spending a
fair bit of time debugging stuff in recent months.

Still have yet to make much "actually relevant" though.

Cheching "sloccount" on my BJX2 project:
SLOC: ~ 2.85 million (~ 1.7 million for C)
Man-years estimate: 845
It thinks this project took ~ 7 years with ~ 123 devs, ...

Granted, this does include some amount of code I did not write (such as
the stuff I ported over as test programs), and sloccount does not count
Verilog (but does count the generated output from Verilator), ...

Getting a more accurate stat would be a little more effort.

If this were *all* written clean, this would have been around 430k
lines/year. But, a more reasonable estimate would need to exclude any
"imported" code (such as Doom, Quake, Hexen, etc...).

At present (stuff to be subtracted):
Doom: ~ 67k lines (expanded slightly vs original)
(37k lines for a more baseline version of Linuxdoom)
Quake: ~ 200k lines
Hexen: ~ 50k lines
Heretic: ~ 37k lines
ROTT: ~ 90k lines
Quake3: ~ 282k lines
Wolf3D: ~ 26k lines (incomplete, unreleased; license reasons, *1)

*1: For whatever reason, ID never did a proper GPL release of Wolf3D.

For something I did write, eg:
BtMini2: 39k lines
So, a simplistic Minecraft like engine, with Doom like code size.

Looks like my code here is ~ 1M lines for C, so ~ 60% of the total (if
subtracting out the imported stuff).

For this project, this would be ~ 143k lines/year (~ 391 lines/day, for C).

More narrowly:
BGBCC: ~ 225k lines;
BJX2 VM: ~ 55k lines;
Misc/scratchpad tools: ~ 112k lines;
(mostly small one-off tools to test ideas);
TestKern: ~ 153k lines;
...

If I count the current version of my CPU core, ~ 155k lines of Verilog.

So, only ~ 60 lines/day average for Verilog.

My own line counter gives slightly higher line counts, but it can be
noted that my own tool also counts blank lines and comments and similar,
which "sloccount" seems to ignore (for BGBCC, this is ~ 225k lines vs ~
270k lines).

...

Not sure what is typical or reasonable for single-person projects.

Likely output is dropping in recent years.

> YMMV,
> George

Anton Ertl

unread,

Dec 11, 2022, 5:58:01 PM12/11/22

to

George Neuner <gneu...@comcast.net> writes:
>1500 lines per year is only 6 lines per day (assuming 50 weeks of 5
>days). A programmer who actually wrote that little code wouldn't have
>a job for very long.

Sure, if you work for Elon Musk and are evaluated on the lines you
write rather than on the functionality your code provides, you can
produce more.

OTOH, if you work for someone who considers lines of code to be a
liability, they will be glad if you deliver the functionality in fewer
lines.

I read somewhere (IIRC in a paper by some Smalltalk programmers): "On
a good day, the number of lines of code produced is often negative"

More to the point of the Siemens number, it includes all the other
jobs you have to do in addition to coding: requirements enginieering,
design, testing, debugging, documentation. In a similar vein,
<https://www.cs.usfca.edu/~parrt/course/601/lectures/man.month.html>
paraphrases The Mythical Man-Month:

|All programmers say "Oh, I can easily beat the 10 lines / day cited by
|industrial programmers." They are talking about just coding something,
|not building a product.

and then expains what the difference is.

MitchAlsup

unread,

Dec 11, 2022, 6:16:37 PM12/11/22

to

On Sunday, December 11, 2022 at 4:58:01 PM UTC-6, Anton Ertl wrote:
> George Neuner <gneu...@comcast.net> writes:
> >1500 lines per year is only 6 lines per day (assuming 50 weeks of 5
> >days). A programmer who actually wrote that little code wouldn't have
> >a job for very long.
> Sure, if you work for Elon Musk and are evaluated on the lines you
> write rather than on the functionality your code provides, you can
> produce more.
>
> OTOH, if you work for someone who considers lines of code to be a
> liability, they will be glad if you deliver the functionality in fewer
> lines.
<

Does snipping 18 lines of code, and pasting them somewhere else
and changing a type count as lines of code ??

>
> I read somewhere (IIRC in a paper by some Smalltalk programmers): "On
> a good day, the number of lines of code produced is often negative"
<

Does sitting on your hands, knowing a specification is going to change,
before writing code count as good or bad use of coding time ??

>
> More to the point of the Siemens number, it includes all the other
> jobs you have to do in addition to coding: requirements enginieering,
> design, testing, debugging, documentation. In a similar vein,
> <https://www.cs.usfca.edu/~parrt/course/601/lectures/man.month.html>
> paraphrases The Mythical Man-Month:
>
> |All programmers say "Oh, I can easily beat the 10 lines / day cited by
> |industrial programmers." They are talking about just coding something,
> |not building a product.
<

At one company I worked for, they only counted debugged & integrated lines
of code !! And their metric for debug and integration was that it took just
as long as the coding itself.

>
> and then expains what the difference is.
<

Coding while specifications are changing is often "appearance work" that
delivers nothing to the bottom line.

Terje Mathisen

unread,

Dec 12, 2022, 8:36:33 AM12/12/22

to

Except if they were smart enough (or it actually mattered?), then they
could use my algorithm that uses modulo 1e5, 1e9 or some other power of
ten for the reciprocal multiplication, extracting that many digits.

After just one such scaling mul you can pick up all the digits in the
chunk with just shift/mask/mul-by-5 (i.e. LEA) operations. I.e. hardware
should have absolutely no issue extracting 1-4 digits every cycle by
working on multiple chunks at the same time.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Dec 12, 2022, 9:09:00 AM12/12/22

to

BGB wrote:
> On 12/10/2022 2:15 PM, George Neuner wrote:
>> On Thu, 08 Dec 2022 08:11:38 GMT, an...@mips.complang.tuwien.ac.at
>> (Anton Ertl) wrote:
>>
>>> "robf...@gmail.com" <robf...@gmail.com> writes:
>>> [800G lines of COBOL]
>>>>> I call bullshit on that number!
>>>>
>>>> IDK. Seems like it might be reasonable. 1M man years
>>>
>>> assumes 800k lines per man-year, about 3k lines per day. I remember a
>>> number by Siemens of 1500 debugged and documented lines per man-year.
>>> With that number, the 800G lines would require 533M man-years.
>>
>> 1500 lines per year is only 6 lines per day (assuming 50 weeks of 5
>> days). A programmer who actually wrote that little code wouldn't have
>> a job for very long.
>>
>
> I think my "peak output" is along the lines of several kLOC/day, but I
> can't sustain this (this gets pretty tiring pretty quickly). Usually
> this happens when implementing some new piece of code (and, seemingly
> the rate at which I can write code is well in excess of the rate at
> which I can write natural-language text, but natural language tends to
> be more mentally demanding in my case).
>
> Often, some days might have more significant amounts of new code
> written, other days less.
>
>
> My past measurements of my "average rate" codebase expansion seems to be
> around 200 kLOC/year, so an average of around 548 lines/day.
>

I was probably close to that for a decade or two, but much less now.

I normally spend 15 min to (worst case so far) 2 hours to solve each
AdventOfCode day, typically needing 100-300 code lines to do so.

My most productive day ever happened 30+ years ago, in the MSDOS days:

I started in the morning with an empty editor screen and wrote 1700
lines of relatively tricky asm (a TSR driver which hooked timer,
network, serial and parallel port interrupts, relocating itself to use
as little space as possible at runtime, depending upon which features
were needed). On the first run through masm/tasm I found 3 syntax
errors, after fixing those it ran and was pretty much perfect, with no
serious updates needed over the next 10 years.

At the same time I also wrote a server end driver (in Object-oriented
Turbo Pascal) which would feed print data to my TSR.

Total time was 5 hours from just a mental idea to working product.

>
> This seems to be inversely related to the amount of time spent
> "debugging stuff", where bug-hunting results in significantly less code
> than writing new code.
>
> Not sure about more recent rates, might have dropped due to spending a
> fair bit of time debugging stuff in recent months.
>
>

> Not sure what is typical or reasonable for single-person projects.
>
> Likely output is dropping in recent years.

See above: Ditto. :-(

Michael S

unread,

Dec 12, 2022, 10:27:30 AM12/12/22

to

It sounds like you are talking about method of implementation of
"Decimal convert From Signed Qword and Record" == bcdcfsq a.k.a binary-to-bcd.
But bcdcfsq on POWER9 is already not bad for implementation that does not
pretends to be in style of "performance above all".
Latency of 37 clocks and reciprocal throughput = 26 both sounds reasonable for
input range [-1e31-1:+1e31-1] i.e. almost 103 bits + sign.
It is especially reasonable considering that POWER9 is throughput-oriented
design. Latency of complex instructions on POWER9 tends to be longer than on
modern high-end Intel/AMD/ARM/Apple. For example, 64/64 division has latency of
26 clocks. Even FP (double) division is 27 to 33 clocks.

Thomas Koenig was wondering about slowness of the reversed operation: "Decimal
Convert to Signed Qword and record" == bcdctsq. For this operation latency of
23 clocks indeed looks excessive.

Anton Ertl

unread,

Dec 12, 2022, 1:07:55 PM12/12/22

to

MitchAlsup <Mitch...@aol.com> writes:
>On Sunday, December 11, 2022 at 4:58:01 PM UTC-6, Anton Ertl wrote:
>> George Neuner <gneu...@comcast.net> writes:
>> >1500 lines per year is only 6 lines per day (assuming 50 weeks of 5
>> >days). A programmer who actually wrote that little code wouldn't have
>> >a job for very long.
>> Sure, if you work for Elon Musk and are evaluated on the lines you
>> write rather than on the functionality your code provides, you can
>> produce more.
>>
>> OTOH, if you work for someone who considers lines of code to be a
>> liability, they will be glad if you deliver the functionality in fewer
>> lines.
><
>Does snipping 18 lines of code, and pasting them somewhere else
>and changing a type count as lines of code ??

If you delete 18 lines and insert 18 lines, the program has not grown.
If you mean copying 18 lines without deleting them, the usual metrics
will say that 18 lines were added.

But note that "lines of code" is not necessarily a good metric.

>> I read somewhere (IIRC in a paper by some Smalltalk programmers): "On
>> a good day, the number of lines of code produced is often negative"
><
>Does sitting on your hands, knowing a specification is going to change,
>before writing code count as good or bad use of coding time ??

I would consider it a bad use. Better use the time to code something
else, or make yourself familiar with the topic by coding something
that you consider likely to stick (but don't fall into the trap to be
attached to the code when the specification turns out to make it
unnecessary).

>Coding while specifications are changing is often "appearance work" that
>delivers nothing to the bottom line.

It can be useful to hone your coding skills.

George Neuner

unread,

Dec 12, 2022, 5:31:34 PM12/12/22

to

On Sun, 11 Dec 2022 15:16:34 -0800 (PST), MitchAlsup
<Mitch...@aol.com> wrote:

>Coding while specifications are changing is often "appearance work" that
>delivers nothing to the bottom line.

Depends on the domain.

I used to do industrial QA/QC systems, and IME customers often had to
see an implementation of proposed functionality before they could
decide whether that was indeed what they wanted. Sometimes after
seeing it work they just wanted some tweaks, but other times they
realized it just wasn't going to work for them.

Any change to HRT inspection operations was a BIG DEAL, often taking
many days to implement even a "demonstrator" of a new or changed
inspection that didn't necessarily have to fit into already
established timing constraints. Then sometimes much more work to make
it fit existing timing if the customer preferred not to slow their
production lines or buy faster CPUs.

So one night, about 6:45pm, I'm preparing to leave work when my boss
walks in and tells me that - completely unplanned - some executives
from one of our existing clients were coming for the demonstration of
a new version of one of our QC vision systems the next day. Somehow
they were expecting to see a new feature sorely sought by their
engineers: a new more flexible AOI designation method that would move
from a small number of possible AOI regions to (potentially) quite a
large number of regions - each with their own quality settings.

Of course, this "enhancement" had only ever been talked about - it was
slated for the NEXT version of this system and zero work on it had
been done. It would have a huge impact on the application: requiring
changes to the UI, inspection preparation [though not the inspection
itself], result evaluation, record keeping, etc.

But we were still quite a small company at that point and could not
afford to lose potential sales, so come hell or high water, we had to
show off [some semblance of] that feature the next day.

By ~3am I had reworked the internal data structures, changed file
formats to save/restore the settings, created new UI screens for
defining AOI regions and adjusting their quality settings, changed the
pre-inspection setup and the post-inspection defect evaluation, added
new statistics to correlate defects with AOI regions, created new UI
screens for displaying statistics, and changed result log entry
formats to record them.

About ~3 KLOC new and/or modified. I tested it as best I could: our
vision lab could not [yet] replicate parts in motion, but we did have
PLCs for simulating line signals, and I was able to verify that the
expanded result evaluation would not break timing. So I built
installation media, left it on the boss's desk, and around 5am I went
home to shower and change [to go right back to work again].

Knock whatever, that midnight implementation worked perfectly - and
there were a number of sales of the system that used it before higher
camera resolutions and the need for even more AOIs required a [better]
faster re-implementation.

So much for "cowboy" development.

I also have stories of working on FDA validated QA/QC systems for
pharmaceutical production, and of clinical imaging systems to support
surgical interventions - systems where every day spent on development
meant additional days spent doing paperwork. And still averaging quite
a bit more than 6 LOC per day ;-)

George

robf...@gmail.com

unread,

Dec 12, 2022, 7:13:26 PM12/12/22

to

I would say I am well under 100 LOC per day and likely over 10. Although
I have written as much as 4,000 LOC in a day, that’s rare.
At home I go with kind of average quality code because I am interested
in experimenting with things and getting things done fast. So, there are a
lot of kludges, dead code, and commented out blocks that would not be
in a serious high quality production code. There are also bugs which might
not be acceptable in production. For the longest time the display screen on
my test system was off by one character and there was a line of garbage
text down the left-hand side. I left it that way for ages because I did not
want to knit-pick display code. I calculated I generated about 9 LOC per
day on my compiler project. But that’s just one project out of dozens. I am
also learning new things and there is a constant learning curve involved.
Writing code to do with a newly learned topic is more time consuming
than spitting out code about something one already knows.

I would say programmers are likely slightly more productive today in terms
of LOC than they were many years ago before things like color-syntax
highlighting, language grammar assist, and auto-generated code were
available.

BGB

unread,

Dec 13, 2022, 12:53:48 AM12/13/22

to

I suspect I am gradually declining in terms of coding speed, but do seem
to be getting better at debugging stuff at least.

When I first wrote BGBCC (back in my 20s), I had generally failed at
debugging it. When I dusted it off again (roughly a decade later,
originally for my BJX1 project but now used for BJX2), was able to get
everything working a bit better.

OTOH, now some years have gone by, still not entirely stable.

However, something like my BJX2 ISA project is probably not something I
could have implemented in my 20s (at the time, I was struggling enough
with trying to debug my 3D engine).

And, then there were some of my early attempts (mostly during
high-school) of trying to hack features onto the Quake Engine, but then
having it quickly decay into a state of being basically unusable (my
coding skills at the time being bad enough that something like Quake
would turn into mush; and I had found I was unable to debug it enough to
fix any of the things I had broken).

To my younger self, the Doom source was a bit of an arcane mystery
artifact, code I could see, but I had little idea how it worked.

Following high-school, I first started trying to write a 3D
modeling/mapping program, which then mutated into a 3D engine (initially
trying to imitate Doom 3, but reusing some file-formats from Half-Life
and similar). Then switched to trying to imitate Minecraft (which showed
up roughly during the time when I was taking college classes).

By the time my first 3D engine project ended, I had roughly twice as
much code as Doom 3, but comparatively it was far less polished (nor
particularly stable or high performance).

Like, say:
Would tend to eat multiple GBs of RAM;
Was often running at not much over 10 fps;
Trying to venture out all that far from the origin was likely to cause
stuff to break and/or corrupt the world (so I had a "safety wall" at the
1km mark from the origin);
...

I got a little sidetracked at this point developing video codecs for
streaming video into texture-maps (and was also using AVI videos for
animated textures).

My second 3D engine attempt fared a little better here, but I think by
this point, I was kinda burnt out on working on a 3D engine (IIRC, my
concept at the time was to try to do something like a mix of Minecraft
and Undertale, maybe with some server script-based minigame stuff sorta
like Roblox). Partly, this engine was written during the time when
Undertale was still fairly popular.

This project coexisted for a while with my BJX1 project, but when I
moved over to my BJX2 project, development effort on that 3D engine had
effectively died off.

> I normally spend 15 min to (worst case so far) 2 hours to solve each
> AdventOfCode day, typically needing 100-300 code lines to do so.
>
> My most productive day ever happened 30+ years ago, in the MSDOS days:
>
> I started in the morning with an empty editor screen and wrote 1700
> lines of relatively tricky asm (a TSR driver which hooked timer,
> network, serial and parallel port interrupts, relocating itself to use
> as little space as possible at runtime, depending upon which features
> were needed). On the first run through masm/tasm I found 3 syntax
> errors, after fixing those it ran and was pretty much perfect, with no
> serious updates needed over the next 10 years.
>

My high point I think, was managing to write a combination x86 assembler
and COFF linker (with PE/COFF) output, in a single day. This was part of
my first version of BGBCC, but this was also used some as part of my
BGBScript VM (it used a JIT which IIRC spit out textual blobs of ASM and
then assembled them, dynamically linking the resulting COFF objects into
a running program image).

Though, I remember being fairly tired out after writing this thing.

Had a few times implemented experimental video codecs and similar in
short bursts, but with a lot of them being "kind of a fail" (with a few
that were more successful seeing a little more use).

> At the same time I also wrote a server end driver (in Object-oriented
> Turbo Pascal) which would feed print data to my TSR.
>
> Total time was 5 hours from just a mental idea to working product.
>

OK.

I remember DOS, but by the time I started using computers, DOS was
already being displaced by the rise of Windows. One might still need to
do the "Restart in MS-DOS Mode" thing a lot though, because IIRC a lot
of DOS era games on PCs at the time would effectively get Win95 into a
broken state (handing off control of the display-hardware between
Windows and DOS games being seemingly a bit unstable; but worked a lot
better by Win98).

But, at this time, Windows PCs typically still had QBasic.
Then ended up jumping from QBasic to Borland C++, then from there to
Cygwin and MinGW, then over to Visual Studio (once it became effectively
freeware).

>>
>> This seems to be inversely related to the amount of time spent
>> "debugging stuff", where bug-hunting results in significantly less
>> code than writing new code.
>>
>> Not sure about more recent rates, might have dropped due to spending a
>> fair bit of time debugging stuff in recent months.
>>
>>
>> Not sure what is typical or reasonable for single-person projects.
>>
>> Likely output is dropping in recent years.
>
> See above: Ditto. :-(
>

Yeah.

Seems like a lot of people are getting a lot more popularity and success
from often much less advanced projects.

Like, I do the stuff I do and pretty much no one cares.

Then, other people do stuff, regardless of whether or not it is
"actually good", and they get lots of praise...

But, alas...

> Terje
>

Terje Mathisen

unread,

Dec 13, 2022, 6:21:59 AM12/13/22

to

Yeah, I realized while writing that the bin-to-dec was pretty good, the
opposite can use divide & conquer approach to multiply several digits by
powers of ten, simultaneously.

If we have 31 digits, then splashing them across an AVX register (32 x
byte), widen to two regs (16 x short), multiply pairs of digits by
(1,10), then horizontal adds to merge this into 32-bit slots.
Next we multiply pairs of groups by 1,100,1E4,1E6), combine into 64-bit
slots then mul by (1,1e8) before we do the final aggregation in integer
regs.

This would require 4 serially dependent multiplications, so 23+ clocks
seems hard to beat in software unless you have a better approach?

Michael S

unread,

Dec 13, 2022, 10:32:07 AM12/13/22

to

If we deeply care about performance of this conversion then, I think,
a throughput of 23 clocks can be easily matched by scalar table-driven
code on any modern core capable of 2 loads/clock.
Eight 256-entry tables, each responsible for different pair of digits.
15 upper digits and 16 lower digits converted simultaneously and
then combined by one 64x64=>128 multiplication followed by add+adc.
Initial conversion process consists of 16 mask operations, 14 shifts,
16 loads and 14 additions. Half of the tables are 32-bit, another half
is 64 bits. At cost of 4 more shifts 6 out 8 tables can be made 32-bit.
All in all, it should take 20 to 25 clocks on relatively old core,
like Skylake or Zen2, a little more on Haswell and Zen1.
On really modern Intel/AMD/Arm Inc. cores - 16-18 clocks, on Apple M1
even less. On POWER9 - hopefully closer to modern cores than to Skylake.

All that not including call overhead that could add 2-3 clocks.
Handling negative sign is another cost.

The question is, why should anybody care that much?!
Rather straight-forward conversion with 256-byte table and 14 independent
multiplications by powers of 100 (the rest is the same as above) is very
simple to code and will take only 30-35 clocks. Not fast enough to make
the point by beating IBM's HW, but other than that it's the most reasonable.

It is likely that AVX2 SIMD can do a little better than 2nd variant
although unlikely to match the 1st variant. May be, even SSEx SIMD or
ARM64 NEON SIMD can do a little better than the 2nd variant.
But why bother?

Terje Mathisen

unread,

Dec 13, 2022, 11:41:41 AM12/13/22

to

Nice! I've been trying to avoid table lookups as much as possible,
because until gather was implemented, it was very hard to SIMD such
lookups, unless they could fit in a 16-entry permute() table, with byte
outputs. In this case gather would not really help that much since we
want separate tables for each digit position.

>
> All that not including call overhead that could add 2-3 clocks.
> Handling negative sign is another cost.
>
> The question is, why should anybody care that much?!
> Rather straight-forward conversion with 256-byte table and 14 independent
> multiplications by powers of 100 (the rest is the same as above) is very
> simple to code and will take only 30-35 clocks. Not fast enough to make
> the point by beating IBM's HW, but other than that it's the most reasonable.

I really liked my idea of doing it all-inside the registers, with
mullo/mulhi to combine 16-bit slots into 32-bit results, even if that
requires two muls. Working directly with 32-bit inputs doesn't really
help that much since we need twice as many. :-(

>
> It is likely that AVX2 SIMD can do a little better than 2nd variant
> although unlikely to match the 1st variant. May be, even SSEx SIMD or
> ARM64 NEON SIMD can do a little better than the 2nd variant.
> But why bother?

Right. :-)

Michael S

unread,

Dec 14, 2022, 7:55:39 AM12/14/22

to

I tried to sketch 128-bit SIMD variant of BCD to binary conversion.
The spec is not quite the same as POWER's bcdctsq, but similar.
bcdctsq converts signed 31-digit numbers.
The steps below convert unsigned 32-digit numbers.
Also for sake of brevity I omitted input validation. It adds up to
lines of code, but hopefully does not affect latency on modern wide
cores.
3rd column is mnemonic of SSE4.1 instruction(s).
4th column is latency on Intel Skylake
5th column is latency on AMD Zen3
1: y = u8(x) >> 4 ; PSRLW+PAND ; 2 ; 2
2: y = u8(y) * 6 ; PADDB x3 ; 3 ; 3
3: x = u8(x) - u8(y) ; PSUBB ; 1 ; 1
4: y = u16(x) >> 8 ; PSRLW ; 1 ; 1
5: y = u16(y) * 156 ; PMULLW ; 5 ; 3
6: x = u16(x) - u16(y) ; PSUBW ; 1 ; 1
7: y = u32(x) >> 16 ; PSRLD ; 1 ; 1
8: y = u32(y) * 55536 ; PMULLD ; 10 ; 4
9: x = u32(x) - u32(y) ; PSUBD ; 1 ; 1
10: y = u64(x) >> 32 ; PSRLQ ; 1 ; 1
11: y = u64(y) * 4194967296 ; PMULUDQ ; 5 ; 3
12: x = u64(x) - u64(y) ; PSUBQ ; 1 ; 1
13: lo(x) = x[63:0] || MOVQ ; 2 ; 2
14: hi(x) = x[127:64] ; PEXTRQ ; 3 ; 2
15: xx = hi(x) * 1E16 ; MUL ; 4 ; 3
16: lo(res) = lo(xx)+lo(x) ; ADD ; 1 ; 1
17: hi(res) = hi(xx)+carry ; ADC ; 1 ; 1
--------
Total latency: 41 (Skylake), 29 (Zen3)
So, latency is non-competitive, as expected.
On the other hand, throughput is likely rather good,
but I wouldn't even try to predict it theoretically,
because it depends on so many uArch details many of them
not fully documented.

In less generic more x86-oriented variant lines 7,8,9 can be replaced
by PMADDWD, with even words multiplied by 1 and odd words multiplied
by 10000. PMADDWD has latency=5 on SKL, 3 on Zen3. Such change improves
the total latency to 34/26 clocks. Which still does not sound competitive
against table-driven non-SIMD conversion.

Tim Rentsch

unread,

Dec 17, 2022, 9:35:08 AM12/17/22

to

John Levine <jo...@taugh.com> writes:

> According to Tim Rentsch <tr.1...@z991.linuxsc.com>:

>> Back in the day when PDP-10s were being used I think it is fair to
>> label them as mainframes (with the understanding that batch mode is
>> very different from timesharing, but that is a separate issue).
>> After all, if small IBM 360s count as mainframes, then surely the
>> PDP-10 should also count as a mainframe.
>
> Not at all. As I've said a few times, it's the architecture, not
> the CPU speed. The CPU of a PDP-8 was much faster than a 360/30,
> but nobody would call a PDP-8 a mainframe, even when it was
> running TSS/8 and supporting a dozen interactive users. (A 360/30
> did a 16 bit add in 27us, a PDP-8 did a 12 bit add in 3us.)
>
> The I/O architecture of the KA-10 was basically a wider version of
> the I/O of the PDP-9 or PDP-8. It normally took an interrupt for
> every word of data, including every tty character, and had only a
> rudimentary way to move a block of data between memory and an I/O
> device, known as three cycle data break on the -8 and -9 and BLKI
> or BLKO in an interrupt location on the -10. The disk controllers
> were pretty rudimentary and if there were I/O errors, the usual
> response was to crash.
>
> The KL version og the -10 was somewhat more mainframe-ish, with a
> PDP-11/40 doing some of the front end work and Massbus for disk
> and tape I/O, although of course they hooked the Massbus to PDP-11
> and -15 minis, too.

With all due respect, architecture is a red herring. DEC
positioned the PDP-10 as a mainframe, and the marketplace treated
it as one. I don't think anyone used a PDP-10 in the way that
minicomputers were used; a PDP-10 was a major investment for a
large organization, never a small-scale investment used only by a
local group. A large PDP-10 installation could be used, and used
effectively, in a setting where smaller IBM 360 "mainframes" were
being used. The PDP-10 is not a supercomputer; it couldn't
compete against the faster CDC machines or a 360/91 (and probably
not even a 360/75), but supercomputers are a separate class above
mainframes.

MitchAlsup

unread,

Dec 17, 2022, 10:42:58 AM12/17/22

to

At Carnegie-Mellon U, circa 1973, we had a 360/67 serving up to 40
KSR terminals simultaneously, and any number of small card batch
jobs. We also had a KI 10 and a KL 10 which were assigned to various
comp.arch teams for various purposes and occasional classes with
other comp.arch activities. I would say the -10s were used much more
like a minicomputer than the /67 as a single group could have access
to it for several hours at a time.

Stephen Fuld

unread,

Dec 17, 2022, 11:24:46 AM12/17/22

to

As we have discussed before, I was at CMU just before you (class of
'72), but at that time, my recollection is that the /67 used 2741
terminals, not KSR (Teletypes). I mention this only because, as
discussed above, and IIRC, the Teletype terminals did require an
interrupt per character, whereas the 2741s had a controller that
accepted the individual characters and only interrupted the mainframe on
a "new line" character. It was a cost/performance tradeoff which to use.

> and any number of small card batch
> jobs. We also had a KI 10 and a KL 10 which were assigned to various
> comp.arch teams for various purposes and occasional classes with
> other comp.arch activities. I would say the -10s were used much more
> like a minicomputer than the /67 as a single group could have access
> to it for several hours at a time.

Yes. As a non-Comp Sci major (there wasn't an undergraduate one), I saw
the PDP 10s in the computer center, but never used them. There was also
a Univac 1108, which we used for a while, but was later dedicated to the
physics department for some special project.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Dec 17, 2022, 11:58:54 AM12/17/22

to

There was a room up on 4th floor (towards 5th Ave) with ~40 KSR terminals.
Room was 17-ish feet wide and 35-ish feet long. This room was at a
reasonable temperature. While I was there this was known as Science Hall,
it is now Wein Hall (IIRC)

<
> > and any number of small card batch
> > jobs. We also had a KI 10 and a KL 10 which were assigned to various
> > comp.arch teams for various purposes and occasional classes with
> > other comp.arch activities. I would say the -10s were used much more
> > like a minicomputer than the /67 as a single group could have access
> > to it for several hours at a time.
<
> Yes. As a non-Comp Sci major (there wasn't an undergraduate one), I saw
> the PDP 10s in the computer center, but never used them. There was also
> a Univac 1108, which we used for a while, but was later dedicated to the
> physics department for some special project.
<

The -10s were not in the "igloo" cold room with the -67, 1108, PDP-9, and
the PDP-8. Nor were they in the same room with C.mmp or CM*. On hot
summer afternoons, I would go into the igloo and sleep for a few hours,
cold soaking, which made August in Pittsburgh so much more bearable.
It igloo was kept near 63 degrees F and was 25-30 feet wide and about 90
feet deep with a counter (later with glass) separating the key punch area
and card readers from the machines themselves.
<
A small group of us got access to the PDP-9 for a semester. One night we
put a hack in the PDP-8 Time Sharing System, and put it on 3 backup tapes
so if it gout found out, they would not eliminate it if they went several backups
back. But that is a story for another day.

Stephen Fuld

unread,

Dec 17, 2022, 12:38:13 PM12/17/22

to

OK, I was there when Science Hall was being built. IIRC, people started
moving in late during my Junior year. All the computers were still in
Scaife Hall.

Scott Lurndal

unread,

Dec 17, 2022, 12:59:25 PM12/17/22

to

TSS 8.24 was very easy to hack, particularly with ASR-33. Just punch a tape
with login sequences for every possible password[*] for the privileged account
and run the tape every time the admin changed the password.

There were a number of "peek" (and poke) PAL-D applications available to spy
on other users, and inject ^BS, etc.

[*] TSS8 supported up to 4 character alphanumeric passwords (A-Z,0-9).

MitchAlsup

unread,

Dec 17, 2022, 1:42:33 PM12/17/22

to

We used this so we could keep files on the disk with directory names which
were not printable with the KSR terminals. Thus, when someone looked, the
files were not there, but that space was not on the free list either. A quick
poke and we could change the directory name so it was still there, and so
were our files, but we did not get charged (funny money) for storing them either.

John Levine

unread,

Dec 17, 2022, 2:26:32 PM12/17/22

to

According to Tim Rentsch <tr.1...@z991.linuxsc.com>:

>With all due respect, architecture is a red herring. DEC
>positioned the PDP-10 as a mainframe, and the marketplace treated
>it as one. I don't think anyone used a PDP-10 in the way that
>minicomputers were used; a PDP-10 was a major investment for a
>large organization, never a small-scale investment used only by a

>local group. ...

You would be surprised. I briefly used a PDP-10 at the Princeton Plasma
Physics lap which was used to do real-time monitoring of experiments.
The KA-10 had a seven level priority interrupt and user I/O mode which
let a user-mode program issue I/O instructions.

The small versions of the 10's monitor 10/10 through 10/40 didn't need
a disk and could run off DECtape. It had a bunch of real-time features
to lock jobs in core and manage device interrupts and handle (some)
interrupts in user programs.

Certainly once the PDP-11 came out, nobody was going to buy a PDP-10 to
do realtime work, and as I said later 10's got more mainframish.

> A large PDP-10 installation could be used, and used
>effectively, in a setting where smaller IBM 360 "mainframes" were
>being used.

Sure, and I used PDP-10s that way, but those DP-10s were not very good
mainframes. The GE 635 came out about the same time as the KA-10 and
had simlar CPU speed and memory, but its I/O was much more mainframe
style. A timesharing PDP-10 topped out at maybe 20 users, but DTSS on
a GE 635 provided snappy performance to 100. They did that with the
usual mainframe techniques, doing I/O in large blocks (TTYs were
managed by a front end Datanet 30) so the 635 spent its time doing
work for users, not handling I/O details.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Stephen Fuld

unread,

Dec 21, 2022, 1:29:32 PM12/21/22

to

On 12/12/2022 9:53 PM, BGB wrote:

big snip

> Following high-school, I first started trying to write a 3D
> modeling/mapping program, which then mutated into a 3D engine (initially
> trying to imitate Doom 3, but reusing some file-formats from Half-Life
> and similar). Then switched to trying to imitate Minecraft (which showed
> up roughly during the time when I was taking college classes).
>
> By the time my first 3D engine project ended, I had roughly twice as
> much code as Doom 3, but comparatively it was far less polished (nor
> particularly stable or high performance).
>
> Like, say:
> Would tend to eat multiple GBs of RAM;
> Was often running at not much over 10 fps;
> Trying to venture out all that far from the origin was likely to cause
> stuff to break and/or corrupt the world (so I had a "safety wall" at the
> 1km mark from the origin);
> ...
>
> I got a little sidetracked at this point developing video codecs for
> streaming video into texture-maps (and was also using AVI videos for
> animated textures).

From this post, and from many of your previous posts, you clearly know
a lot about the "guts" of 3D graphics. This is an area I know
essentially nothing about, but have been curious about for a long time.
So I am asking you (and, of course, by posting this, everyone else), how
did you learn about this stuff?

I have looked for books, but the ones I find seem to either be "How to
use a particular graphics library" (which I don't care about), or very
quickly get so deeply into the mathematics of some aspect of the problem
that I can't see the forest for the trees. I have also tried reading
Wikipedia articles, but again, they seem to be detailed about a
particular aspect of the problem, or a very high level overview, so I
got frustrated.

Again, I am looking for sort of a middle ground between a glossy
overview, and the real nitty gritty details, and not of any particular
graphics library or game engine.

Any suggestions?

Josh Vanderhoof

unread,

Dec 21, 2022, 4:54:42 PM12/21/22

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:

> From this post, and from many of your previous posts, you clearly know
> a lot about the "guts" of 3D graphics. This is an area I know
> essentially nothing about, but have been curious about for a long
> time. So I am asking you (and, of course, by posting this, everyone
> else), how did you learn about this stuff?
>
> I have looked for books, but the ones I find seem to either be "How to
> use a particular graphics library" (which I don't care about), or very
> quickly get so deeply into the mathematics of some aspect of the
> problem that I can't see the forest for the trees. I have also tried
> reading Wikipedia articles, but again, they seem to be detailed about
> a particular aspect of the problem, or a very high level overview, so
> I got frustrated.
>
> Again, I am looking for sort of a middle ground between a glossy
> overview, and the real nitty gritty details, and not of any particular
> graphics library or game engine.
>
> Any suggestions?

Try Michael Abrash's Black Book.

https://www.jagregory.com/abrash-black-book/#chapter-50-adding-a-dimension

Stephen Fuld

unread,

Dec 21, 2022, 6:10:17 PM12/21/22

to

I just started looking at it and it looks quite interesting. Thanks!

Terje Mathisen

unread,

Dec 22, 2022, 9:50:36 AM12/22/22

to

Abrash is great, I have signed copies of his books. :-)

Josh Vanderhoof

unread,

Dec 22, 2022, 4:14:03 PM12/22/22

to

Terje being modest here not mentioning that he's *in* the book in
multiple chapters!

BGB

unread,

Dec 22, 2022, 11:14:22 PM12/22/22

to

At first?

Well, in elementary school I had the Wolf3D and Doom source, and then by
high-school I had the Quake source.

So, mostly hacking around with the GLQuake source and figuring out how
3D renderers worked from there.

I had previously tried to do something sort of like Wolf3D, but with
little real understanding of how Wolf3D's renderer actually worked, my
first attempt was far too slow to be usable (and also predated the
"great hard-drive crash" which happened when I was in middle school).

But, by HS, we had GPUs and software rendering was old hat, so at the
time I skipped it.

Didn't actually start messing around with software rendering again until
around the mid 2010s, with my first attempt at a software rasterized
OpenGL implementation (which I then used for running Quake3).

But, I didn't read any real books on it, and my starting point was
actually based on taking inspiration from the Quake 1 and 2 software
renderers, and figuring out how to do something similar (just throwing
the OpenGL API on top, and disregarding the use of perspective-correct
texture mapping). (Typically using dynamic tessellation instead).

My implementation of the OpenGL API was based on my own understanding
from using the API, and the online help descriptions of the various
functions. Sorta also helpful to be familiar with vector and matrix math
and similar for this part (there is a lot of this; and fun like working
in terms of 4D homogeneous coordinates).

The API design is kind of bulky and kind of a pain to implement though
(though, a fair chunk of the functions are mostly for setting bits of
internal state and twiddling flags and similar).

A rendering API with significantly less "chatter" could be possible,
say, by having user code manage state flags and similar itself (and
straight up drop the whole glBegin/glEnd/glVertex3fv/... interface).

Then, did another smaller-scale version for my BGBTech2 engine, mostly
intended as a possible workaround for WebGL+WASM sucking (didn't end up
using this, as after doing a prototype of the BGBTech2 engine on WASM,
my motivation to continue messing with it was "basically dead").

And, was also an alternative for the "very terrible" Intel GMA chip in
my laptop. My software renderer was almost as fast as the Intel GMA, and
a little less prone to breaking. The GMA was "basically sufficient" to
run Quake 3 and Half-Life, but pretty much anything much beyond this,
one was looking at single-digit frame-rates at best (and so help you if
you turn on fragment shaders; good solid 0 fps).

Otherwise, the laptop has a 2.1 GHz Celeron, IIRC 4GB of RAM (2GB
original), and came with Windows Vista.

Not gotten a new laptop since then...

My later attempt (TKRA-GL) was based on trying to do a design more
optimized for the ISA quirks in my BJX2 ISA design.

All 3 attempts had a semi-common feature of using a hacked version of
RGB555 (RGB555A):
0rrrrrgggggbbbbb (RGB555)
1rrrraggggabbbba (RGB444A3, same layout as RGB555)
1rrrr0gggg0bbbb0: Fully transparent
1rrrr1gggg1bbbb1: 7/8 Opaque (A ~= 0.88)

Though, my first two attempts used RGBA32 for the output framebuffer,
with Depth24+Stencil8.

TKRA-GL uses RGB555 for the framebuffer, with Depth16 or Depth12+Stencil4.
This leads to worse quality (and Depth12 is prone to obvious Z-fighting
issues), but uses less memory and is faster in this case (cache misses
being a big factor).

Note 3 or 4 stencil bits is basically the minimum for Doom3 style
stencil shadows to work, though it is (at present) very unlikely I will
attempt a Doom3 style renderer on BJX2 (the cost to 3D render stuff
would be infeasible, *).

*: If basically requires projecting and redrawing all of the geometry
within a certain radius of each light source, for every potentially
visible light source.

A much more modest, but "much less impressive" strategy, being to
calculate the lighting contributions from each light source to each
vertex, and then periodically recalculate the vertex colors.

So, say:
A static light RGB;
A light contribution per light-style;
A sum of any dynamic lights in range
(possibly with some ray-cast checks).

I haven't kept up with more recent developments,

My first two rasterizers used uncompressed RGB555A textures.

TKRA-GL also uses compressed textures (on BJX2, via compressed texture
helper ops).

Internal format resembles the original S3TC/DXT1 format, but differs
slightly in that it uses RGB555 vs RGB565, and encodes a few different
modes (so can also do alpha blended blocks). The API dynamically
translates from DXT1 or DXT5.

> I have looked for books, but the ones I find seem to either be "How to
> use a particular graphics library" (which I don't care about), or very
> quickly get so deeply into the mathematics of some aspect of the problem
> that I can't see the forest for the trees. I have also tried reading
> Wikipedia articles, but again, they seem to be detailed about a
> particular aspect of the problem, or a very high level overview, so I
> got frustrated.
>

I have some old books of this sort, mostly from my childhood.
"Teach Yourself whatever in 21 days."
Etc.

Some games books, but at the time they were written, the "hot new thing"
was apparently to use the Windows 95 GDI to draw bitmap images into the
window, and invoke Video For Windows to get a video overlay to render on
some part of the screen. All using Windows built-in API calls for the
DIB loading and AVI playback.

Though, IIRC, the books did at least talk about the BMP and PCX file
formats and similar.

IIRC, I also have another book ("Flights of Fantasy") which talked
mostly about how to draw wireframe and flat shaded 3D graphics (mostly
in the context of 3D flight simulators).

IIRC, its algorithm was something like "Draw lines with Bresenham's
Algorithm" and then "Modify things slightly to do a per-scanline
flood-fill of the pixel color for the polygon".

Something like (very vague memory):
Walk vertices in clockwise order (via modified Bresenham's);
Record the ending X position for each scanline;
When walking upwards, record the starting X values;
For the marked scan-line range, flood-fill from starting to ending X.

Or, might have been Min/Max X per-line, and Min/Max Y, I forget.

Also went a fair bit into "Painter's Algorithm" and such, IIRC.

IIRC, didn't go into texture-mapping or lighting.

IIRC, the algo also didn't bother splitting things into triangles or
quads, but would handle full polygons directly.

Or, at least, this is what little I can remember (haven't really looked
at the book in many years).

I didn't use the same approach, and had instead worked by talking
triangle edges and stepping vectors for each Y position.

The approaches used in TKRA-GL would likely have been impractical on
early PC's though, as it would use far more state than the CPU has
registers.

That approach could maybe make sense if one wants to draw a bunch of
flat-shaded geometry (such if trying to make something that looks like
Star Fox or Virtua Fighter or similar).

Well, as opposed to texture-mapped color-modulated geometry (my current
default case).

> Again, I am looking for sort of a middle ground between a glossy
> overview, and the real nitty gritty details, and not of any particular
> graphics library or game engine.
>
> Any suggestions?
>

I don't know personally, I didn't really learn as much from books.
Still, a few old books are helpful references though.

Mostly it was a lot of fiddling and trial and error in some areas; and a
lot of gathering information via random Google searches (or, when I was
much younger, AltaVista).

Also, many years of hanging around on Usenet...

Terje Mathisen

unread,

Dec 23, 2022, 10:35:51 AM12/23/22

to

Thank You, and Merry Christmas!

I've been celebrating Advent the same way as all recent years, i.e
getting up at 0550 every day to be awake for the 0600 release of this
day's puzzles: I still know how to program and I know how to convert a
specification text into working algorithms, but I'm definitely getting
slower than I was 30-40 years ago. :-(

https://adventofcode.com/2022

Thomas Koenig

unread,

Dec 28, 2022, 4:18:25 AM12/28/22

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

Just browsing through the first chapter right now. I like the
"cycle eater" :-)

The first chapters also show severely hobbled the 8088 was and how
far we have progressed in the meantime, both in CPU design and
compiler technology.

Terje Mathisen

unread,

Dec 29, 2022, 5:08:30 AM12/29/22

to

The worst "feature" by far on the 8088 was the 8-bit bus which meant
that every byte touched (load/store/code) needed 4 clock cycles for the
bus transfer.

OTOH, the silver lining was that you could pre-calculate the speed of
almost all code by simply adding together all those bytes (i.e mostly
code) and state that it would take that many microseconds: With a 4.77
MHz cpu frequency and some overhead due to the cycles lost to DRAM
refresh, that 4.77 was in fact reduced to just a smidgen over 1 byte/us.