Packed decimals

mar...@hotmail.com

unread,

Dec 4, 2000, 3:00:00 AM12/4/00

to

Hello,

I recently have to write a program to parse some packed decimal output
from an AS/400 machine. Not having an AS/400 background, I am curious
why this format is used at all? It's certainly not compact - it takes
more bytes to store the same number than in binary format (2's
complement). Or is it for readability? But then aren't these packed
decimal numbers meant for machine rather than human consumption so why
bother making them readable?

Perplexed and regards,

mark.

Sent via Deja.com http://www.deja.com/
Before you buy.

Not Me

unread,

Dec 4, 2000, 3:00:00 AM12/4/00

to

Because for the business applications that dominate the AS/400 world,
packed decimal completely eliminates round off errors that you get
with floating point numbers.

Scott Lindstrom

unread,

Dec 4, 2000, 3:00:00 AM12/4/00

to

In article <7c6o2tsabi7d4b3tn...@4ax.com>,

Also, when viewing numbers in hex (through file utilities, or in the
old days, dumps), packed decimal is infinitely more readable than
binary.

Scott Lindstrom
Zenith Electronics

Ugo Gagliardelli

unread,

Dec 4, 2000, 8:31:07 PM12/4/00

to

mar...@hotmail.com wrote:
>
> Hello,
>
> I recently have to write a program to parse some packed decimal output
> from an AS/400 machine. Not having an AS/400 background, I am curious
> why this format is used at all? It's certainly not compact - it takes
> more bytes to store the same number than in binary format (2's
> complement). Or is it for readability? But then aren't these packed
> decimal numbers meant for machine rather than human consumption so why
> bother making them readable?

Binary numbers hold only integers, packed hold decimals too because they
remember the position of all digits just as they were written as a
string, but they need about half the number of bytes than a string.
--
Dr. Ugo Gagliardelli, Modena, Italy
Spaccamaroni andate a cagare/Spammers not welcome

Dave McKenzie

unread,

Dec 4, 2000, 9:09:29 PM12/4/00

to

Actually, you CAN define decimal positions for binary fields (in DDS
and ILE RPG), but I've never seen it done in practice.

I think packed decimal exists mainly for historical reasons. I think
it was invented by IBM in the early 60's (or maybe even 50's) for
speed. Printing or displaying a binary number would involve
binary-to-decimal conversion; so printing a page of figures would put
a significant load on the trembling little processors of the day. Of
course, today even the processor in a wrist watch could do it in a
trice.

Also, 4-byte binary allows only numbers up to 4 billion (2 billion if
signed)--not enough to calculate the US national debt; whereas packed
decimal allows up to 31 digits. (8-byte binary on AS/400 (Oops -
iSeries) was introduced only within the last year or two.)

--Dave

Thomas

unread,

Dec 4, 2000, 10:04:59 PM12/4/00

to

Yes, it's possible to have fractional positions in binary numbers, just
as with decimal numbers. And it's the compilers (or assemblers or
whatever), of course, that enforce them through the code that's
generated. I'm not sure they're technically referred to as "decimal"
positions, but that's irrelevant.

Overall, I think it evolved out of the BCD (binary-coded decimal) forms
from earlier processor architectures. We now have EBCDIC which refers to
extended BCD. A lot of processors handle forms of BCD even today. ASCII
really only came into serious widespread use with the advent of the
personal computer and the rise of UNIX; in the early 70's, ASCII was a
more marginal code.

I had understood that a major issue was that binary fractional positions
simply could not be accurately converted to/from decimal. Something like
a packed(5 2) field that accurately held a value of $.03 would be maybe
specified as whole cents in binary rather than as fractional dollars as
in packed-decimal.

I think it's nowadays simply a matter that packed-decimal has an ease of
use in business computing that binary doesn't give. Besides, I don't
think IBM could afford to switch its mainframes away from EBCDIC.

Tom Liotta

In article <e7io2t8gqh3oi75i6...@4ax.com>,

--
Tom Liotta, AS/400 Systems Programmer
The PowerTech Group, Inc.; http://www.400security.com
...and for you automated email spammers out there:
jqu...@fcc.gov sn...@fcc.gov rch...@fcc.gov mpo...@fcc.gov

Terrence Enger

unread,

Dec 4, 2000, 9:09:18 PM12/4/00

to

My C$.02 worth is below. Meanwhile, I have copied this message to
comp.arch in the hope that someone there will have another answer.

Terry.

Scott Lindstrom wrote:
>
> In article <7c6o2tsabi7d4b3tn...@4ax.com>,
> Not Me <not...@notme.com> wrote:

> > On Mon, 04 Dec 2000 22:01:11 GMT, mar...@hotmail.com wrote:
> >
> > >Hello,
> > >
> > >I recently have to write a program to parse some packed decimal
> output
> > >from an AS/400 machine. Not having an AS/400 background, I am curious
> > >why this format is used at all?

People are more used to decimal arithmetic. Now, obviously, anything is
possible in any representation, but some representations are much more
convenient for some purposes: please pick up your hammer and chisel and
calculate for us MCMXLV times MCDLXI divided by IV (just kidding). This
is a big issue at both ends of computed numbers. If the low order digit
of a result is to be rounded off--note that I am contradicting marklau's
assertion that decimal arithmetic avoids roudoff completely--most people
are more comfortable with a result rounded to the nearest dollar or
penny or hundredth of a penny; a sixteenth of a dollar or a sixty-fourth
of a penny takes a lot of decimal places on a printout or a screen.
After the loss of high-order significant digits, it is arguably better
to be "out" some multiple of $10,000,000.00 than by some multiple of
2^32 cents.

Having programmed commercial applications in assembly language (I have
*earned* these gray hairs!) I am quite happy to avoid the tedium of
extended precision binary arithmetic and decimal scaling of binary
numbers.

> > > It's certainly not compact - it takes
> > >more bytes to store the same number than in binary format (2's
> > >complement).

The issue of compactness is clouded by the fact that decimal processors
like the 360 and its descendants will handle more different sizes of
decimal numbers than binary processors will handle binary numbers. A
five-digit packed number takes three bytes; in binary it only requires
17 and a fraction bits, but how much will you probably allocate?

The redundancy in packed numbers is occasionally useful in that your bug
(okay, *my* bug; don't be so touchy!) can manifest itself sooner as a
decimal data exception rather than later as incorrect output.

> > > Or is it for readability? But then aren't these packed
> > >decimal numbers meant for machine rather than human consumption so
> why
> > >bother making them readable?
> > >

> > >Perplexed and regards,
> > >
> > >mark.

> > >
> > >
> > >Sent via Deja.com http://www.deja.com/
> > >Before you buy.

> > Because for the business applications that dominate the AS/400 world,
> > packed decimal completely eliminates round off errors that you get
> > with floating point numbers.
> >
>
> Also, when viewing numbers in hex (through file utilities, or in the
> old days, dumps), packed decimal is infinitely more readable than
> binary.
>
> Scott Lindstrom
> Zenith Electronics
>

Dave McKenzie

unread,

Dec 5, 2000, 1:04:38 AM12/5/00

to

I know this is beating a dead horse, but I can't resist trying to nail
down the details...

I think for binary fields (in DDS & RPG) the fractional positions
really are "decimal" positions, not "binary" positions. It's not as
if they set aside a certain number of bits for the fraction, which
WOULD be impossible to convert exactly. For a binary field of 9 digits
and 2 fractional digits, what is stored for 1.23 (i.e. a dollar and 23
cents) is the binary equivalent of 123, x0000007B. In other words,
the decimal point is ignored and the number is stored as an integer.
For printing, the binary integer is converted to decimal and the
decimal point is inserted; and for calculations, the operands are
shifted appropriately to align the decimal points.

But this is the same way packed decimal is stored: a packed (9,2)
value of 1.23 is stored as x000000123F, that is, as an integer.

--Dave

On Tue, 05 Dec 2000 03:04:59 GMT, Thomas <tho...@inorbit.com> wrote:
<snip>

>I had understood that a major issue was that binary fractional positions
>simply could not be accurately converted to/from decimal. Something like
>a packed(5 2) field that accurately held a value of $.03 would be maybe
>specified as whole cents in binary rather than as fractional dollars as
>in packed-decimal.

<snip>

Del Cecchi

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

In article <3A2C4E08...@idirect.com>,

(snipping)

I believe there are several reasons, and there are really two parts to the
question.

"why decimal instead of binary?" and "why packed decimal?".

I think the answers are at least partially historical in that many early
computers used decimal arithmetic. Packed decimal was invented in a time of very
expensive and limited storage. Languages such as COBOL have decimal or string
arithmetic which fits decimal operations.

And last, there is at least some merit to the notion that decimal arithmetic is a
better fit to fixed point operations such as financial calculations.

del cecchi

--

Del Cecchi
cecchi@rchland

Worley Barry

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to mar...@hotmail.com

On Mon, 04 Dec 2000 22:01:11 GMT, mar...@hotmail.com wrote:

>Hello,
>
>I recently have to write a program to parse some packed decimal output
>from an AS/400 machine. Not having an AS/400 background, I am curious

>why this format is used at all? It's certainly not compact - it takes

>more bytes to store the same number than in binary format (2's

>complement). Or is it for readability? But then aren't these packed

>decimal numbers meant for machine rather than human consumption so why
>bother making them readable?
>
>Perplexed and regards,

You may find the answer to your question in the packed
decimal section of this web site.

http://cs.senecac.on.ca/~ward/ops334/ops_datarep.html#DataRep

Regards, Worley

Worley Barry

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to mar...@hotmail.com

Alberto Moreira

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

Del Cecchi wrote:

> I believe there are several reasons, and there are really two parts to the
> question.
>
> "why decimal instead of binary?" and "why packed decimal?".
>
> I think the answers are at least partially historical in that many early
> computers used decimal arithmetic. Packed decimal was invented in a time of very
> expensive and limited storage. Languages such as COBOL have decimal or string
> arithmetic which fits decimal operations.
>
> And last, there is at least some merit to the notion that decimal arithmetic is a
> better fit to fixed point operations such as financial calculations.

The issue was also exactness. Commercial programs had to do
precise decimal computations, and floating point did not look
good from that perspective, we couldn't even reliably compare
two floating point numbers for equality! With packed decimal
arithmetic, the result of operating on two numbers was
predictable and repeatable. It mapped well into the Cobol
ability of creating limited size decimal numbers, and that was
important in the days of 20Mb hard drives and half-a-gig mag
tapes. Moreover, it allowed for variable length fields and for
large numbers.

Alberto.

Stephen Fuld

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

"Del Cecchi" <cec...@signa.rchland.ibm.com> wrote in message
news:90iqcb$1378$1...@news.rchland.ibm.com...

> In article <3A2C4E08...@idirect.com>,
> Terrence Enger <ten...@idirect.com> writes:

> (snipping)

> I believe there are several reasons, and there are really two parts to the
> question.
>
> "why decimal instead of binary?" and "why packed decimal?".
>
> I think the answers are at least partially historical in that many early
> computers used decimal arithmetic. Packed decimal was invented in a time
of very
> expensive and limited storage. Languages such as COBOL have decimal or
string
> arithmetic which fits decimal operations.
>
> And last, there is at least some merit to the notion that decimal
arithmetic is a
> better fit to fixed point operations such as financial calculations.
>

> del cecchi
>
> --

I agree with both of these reasons. However, the Power PC does not appear
to have any hardware support for packed decimal. Does the AS/400 version of
the chip add some support or is the whole thing emulated in software?

You said the AS/400 was not renamed something else. OK, I'll not try to
second guess marketing. Have all the lines been renamed? Can you give us a
"cheat sheet" to convert from old to new names?

--
- Stephen Fuld

>
> Del Cecchi
> cecchi@rchland

Thomas

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

Dave:

No disagreement. For this, it's simply a matter of whether the
discussion is about how a compiler (DDS & RPG in this case) implements a
source code specification or how a binary field might have fractional
positions in terms of bits. I think we were each emphasizing a different
aspect.

Your example of 1.23 being stored as x0000007B is precisely where the
two aspects converge. It easily represents the storage of whole 'cents'
in a monetary field rather than fractional dollars. This is, of course,
different from, say, a 2-bit unsigned binary field with a single
fractional position. That field could take the values 0, 1/2, 1 and 1
and 1/2.

And personally I don't mind your nailing down the details. For me, the
details of binary vs packed vs whatever are almost dim memories of
history. And the subject of "what's packed decimal?" comes up often
enough in this newsgroup that it's worth going over once in a while. It
indicates new people coming in, either new to computing altogether or
simply new to the IBM world. The AS/4... uh... iSeries 400 market can
always stand to welcome new people and your voice is a credible one.

Tom Liotta

In article <rd0p2tkecn9ehkjc1...@4ax.com>,

--

Thomas

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

In article <3A2CF3B5...@moreira.mv.com>,
alb...@moreira.mv.com wrote:

> Del Cecchi wrote:
>
> and that was
> important in the days of 20Mb hard drives

Heh. I recall upgrading our 2311(? no longer absolutely certain of its
model identifier) disk unit attached to a 1401 from 5MB to 10MB.
Essentially, our IBM CE removed a screw from the drive unit and the arm
could then travel twice as far. 20MB never made it to that machine.
Sheesh, it was only an 8K base memory (with a 4K expansion box about the
size of a washing machine to bring it up to a massive 12K.)

But then, it didn't exactly use packed-decimal nor exactly binary
neither. There are times when I still miss being able to set "word
marks".

--
Tom Liotta, AS/400 Systems Programmer
The PowerTech Group, Inc.; http://www.400security.com
...and for you automated email spammers out there:
jqu...@fcc.gov sn...@fcc.gov rch...@fcc.gov mpo...@fcc.gov

Terje Mathisen

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

Stephen Fuld wrote:
> I agree with both of these reasons. However, the Power PC does not appear
> to have any hardware support for packed decimal. Does the AS/400 version of
> the chip add some support or is the whole thing emulated in software?

SW emulation of BCD is easy, and with the wide registers we have now, it
is also quite fast (15 or 16 digits in a 64-bit register).

However, if you're doing a lot of fixed-point BCD math, it is probably
much faster to do all the work in binary and only convert on
input/output. :-)

Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Del Cecchi

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

In article <UwbX5.7800$Ei1.5...@bgtnsc05-news.ops.worldnet.att.net>,

"Stephen Fuld" <s.f...@worldnet.att.net> writes:
|>
|> "Del Cecchi" <cec...@signa.rchland.ibm.com> wrote in message
|> news:90iqcb$1378$1...@news.rchland.ibm.com...
|> > In article <3A2C4E08...@idirect.com>,
|> > Terrence Enger <ten...@idirect.com> writes:

|> > |> My C$.02 worth is below. Meanwhile, I have copied this message to

note snippage

|> > |> >
|> > |> > Also, when viewing numbers in hex (through file utilities, or in the
|> > |> > old days, dumps), packed decimal is infinitely more readable than
|> > |> > binary.
|> > |> >
|> > |> > Scott Lindstrom
|> > |> > Zenith Electronics
|> > |> >
|> >

|> I agree with both of these reasons. However, the Power PC does not appear
|> to have any hardware support for packed decimal. Does the AS/400 version of
|> the chip add some support or is the whole thing emulated in software?
|>

|> You said the AS/400 was not renamed something else. OK, I'll not try to
|> second guess marketing. Have all the lines been renamed? Can you give us a
|> "cheat sheet" to convert from old to new names?
|>
|>
|> --
|> - Stephen Fuld
|>
|>

As part of the eServer announcement, from now on systems that would have been
called AS/400 will be "i-series eServers". IBM still sells AS/400s announced
before the recent changes. So there is no conversion of names, per se. But the
RS/6000 will be P series and Intel based servers will be X series.

My understanding is that decimal arithmetic (zoned, packed, something) was one of
the things that was added to the PowerPC to make a PowerPC/AS. There is a bunch
of background hidden on IBM's AS400/i-series web pages.

--

Del Cecchi
cecchi@rchland

Stephen Fuld

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:3A2D54E3...@hda.hydro.com...

> Stephen Fuld wrote:
> > I agree with both of these reasons. However, the Power PC does not
appear
> > to have any hardware support for packed decimal. Does the AS/400
version of
> > the chip add some support or is the whole thing emulated in software?
>

> SW emulation of BCD is easy, and with the wide registers we have now, it
> is also quite fast (15 or 16 digits in a 64-bit register).
>
> However, if you're doing a lot of fixed-point BCD math, it is probably
> much faster to do all the work in binary and only convert on
> input/output. :-)

I agree, but a lot of data processing is a relatively modest amount of math
on a modest number of fields on a lot of records. For example, add the
interest amount to each account requires only a multiply and an add. It
still may be faster to convert though.

BTW, a matter of terminology. I thought BCD was a six bit code used on some
machines prior to S/360. It had only upper case letters, digits and a few
punctuation marks. It was what IBM "extended" into EBCD (IC for interchange
code). That is different from packed decimal (at least S/360 style) which
uses four bits for each digit with an optional sign "overpunch" in the last
digit. Is my recollection wrong here?

--
- Stephen Fuld

David Gay

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

"Stephen Fuld" <s.f...@worldnet.att.net> writes:
> "Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
> news:3A2D54E3...@hda.hydro.com...
> > Stephen Fuld wrote:
> > > I agree with both of these reasons. However, the Power PC does not
> appear
> > > to have any hardware support for packed decimal. Does the AS/400
> version of
> > > the chip add some support or is the whole thing emulated in software?
> >
> > SW emulation of BCD is easy, and with the wide registers we have now, it
> > is also quite fast (15 or 16 digits in a 64-bit register).
> >
> > However, if you're doing a lot of fixed-point BCD math, it is probably
> > much faster to do all the work in binary and only convert on
> > input/output. :-)
>
> I agree, but a lot of data processing is a relatively modest amount of math
> on a modest number of fields on a lot of records. For example, add the
> interest amount to each account requires only a multiply and an add. It
> still may be faster to convert though.
>
> BTW, a matter of terminology. I thought BCD was a six bit code used on some
> machines prior to S/360. It had only upper case letters, digits and a few
> punctuation marks. It was what IBM "extended" into EBCD (IC for interchange
> code). That is different from packed decimal (at least S/360 style) which
> uses four bits for each digit with an optional sign "overpunch" in the last
> digit. Is my recollection wrong here?

I don't know about the distant past, but for the past fifteen years at
least, BCD has been "binary coded decimal" (4 bits to a decimal digit) in
the integrated circuit world...

--
David Gay - Yet Another Starving Grad Student
dg...@cs.berkeley.edu

John R. Mashey

unread,

Dec 5, 2000, 11:48:37 PM12/5/00

to

In article <3A2D54E3...@hda.hydro.com>, Terje Mathisen <terje.m...@hda.hydro.com> writes:
|> Stephen Fuld wrote:
|> > I agree with both of these reasons. However, the Power PC does not appear
|> > to have any hardware support for packed decimal. Does the AS/400 version of
|> > the chip add some support or is the whole thing emulated in software?
|>
|> SW emulation of BCD is easy, and with the wide registers we have now, it
|> is also quite fast (15 or 16 digits in a 64-bit register).

Actually, even before that....
1) The HP PA folks included a few instructions to help packed decimal
arithmetic.

2) In 1985, MIPS had a potential customer who had a strong interest in COBOL,
and had extensive statistics on the usage patterns of various decimal arithmetic
operations, and looked at the MIPS instruction sequences to do them,
and concluded that performance was perfectly adequate, somewhat to their
surprise. As it happens, the {load, store}word{left,right} instructions,
somewhat accidentally, turned out to be fairly useful.

In general, RISC people have tended in recent years to either:
(a) Add a modest amount of hardware to help decimal arithmetic,
but usually not a full decimal instruction set.
(b) Or just use the instructions they've got, for COBOL & PL/I,
having decided they were "good enough".

[Note, I mention this because it is a common misconception that RISC
instruction sets were designed only with C, and maybe FORTRAN in mind.
This simply wasn't true: different design teams had different priorities,
but many of them gave consideration to {COBOL, PL/I}, {LISP, Smalltalk},
and ADA, and there are specific features still around that catered to these
various things, even if {C, C++}, {FORTAN} were high priorities.]

--
-John Mashey EMAIL: ma...@sgi.com DDD: 650-933-3090 FAX: 650-851-4620
USPS: SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time; cell phone = 650-575-6347.
PERMANENT EMAIL ADDRESS: ma...@heymash.com

Dave McKenzie

unread,

Dec 6, 2000, 2:23:19 AM12/6/00

to

The AS/400 chips added a few simple instructions to PowerPC to assist
packed decimal arithmetic, but most of the logic is done in software
(after all, it's RISC).

Are you sitting down? The following is machine code generated for a
*single* decimal add instruction. It's from a test MI pgm and adds
two 9-digit (5-byte) packed fields, putting the result in a third
9-digit packed field.

0A0 805FFFF5 LWZ 2,0XFFF5(31)
0A4 887FFFF9 LBZ 3,0XFFF9(31)
0A8 80FFFFF0 LWZ 7,0XFFF0(31)
0AC 7843460C RLDIMI 3,2,8,24
0B0 881FFFF4 LBZ 0,0XFFF4(31)
0B4 6866000F XORI 6,3,15
0B8 E8408148 LD 2,0X8148(0)
0BC 78E0460C RLDIMI 0,7,8,24
0C0 7CC61014 ADDC 6,6,2
0C4 6806000F XORI 6,0,15
0C8 7C200448 TXER 1,0,40
0CC E8E08140 LD 7,0X8140(0)
0D0 7C461014 ADDC 2,6,2
0D4 7C200448 TXER 1,0,40
0D8 7C0000BB DTCS. 0,0
0DC 79A27A18 SELII 2,13,15,36
0E0 7C6300BB DTCS. 3,3
0E4 79A67A18 SELII 6,13,15,36
0E8 7C223040 CMPL 0,1,2,6
0EC 41820018 BC 12,2,0X18
0F0 7CE30011 SUBFC. 7,3,0
0F4 40800018 BC 4,0,0X18
0F8 7CE01810 SUBFC 7,0,3
0FC 60C20000 ORI 2,6,0
100 4800000C B 0XC
104 7CE03A14 ADD 7,0,7
108 7CE71814 ADDC 7,7,3
10C 7C03007A DSIXES 3
110 7C633850 SUBF 3,3,7
114 7C671378 OR 7,3,2
118 78E00600 RLDICL 0,7,0,24
11C 7F003888 TD 24,0,7
120 7C0600BB DTCS. 6,0
124 79A07A18 SELII 0,13,15,36
128 78C70601 RLDICL. 7,6,0,24
12C 79E2031A SELIR 2,15,0,38
130 38070000 ADDI 0,7,0
134 7847072C RLDIMI 7,2,0,60
138 98FFFFFE STB 7,0XFFFE(31)
13C 7806C202 RLDICL 6,0,56,8
140 90DFFFFA STW 6,0XFFFA(31)
144 419C8043 BCLA 12,28,0X8040

These instructions were added to PowerPC for AS/400:

TXER Trap on XER reg
DTCS Decimal test and clear sign
SELII Select immed, immed
SELIR Select immed, reg
DSIXES Decimal (or decrement? or doubleword-of?) sixes

The actual add is done by the instruction at offset 108.
Instruction B8 loads x666666666666666A, which is added to the operands
to check for decimal data errors at C8 and D4.
Instruction CC loads x6666666666666666, which is added into the sum at
104 to force carries out of nibbles at 10 instead of 16.
Instruction 10C (DSIXES) generates a doubleword having 6 in each
nibble where there was no carry, and 0 in each nibble that generated a
carry. It's then subtracted from the sum at 110 to back out the
sixes.
The rest of the instructions deal with loading and storing the
operands, and handling the various combinations of signs.

--Dave

On 6 Dec 2000 04:48:37 GMT, ma...@mash.engr.sgi.com (John R. Mashey)
wrote:

Jonathan Thornburg

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

> I agree with both of these reasons. However, the Power PC does not appear
> to have any hardware support for packed decimal. Does the AS/400 version of
> the chip add some support or is the whole thing emulated in software?

| SW emulation of BCD is easy, and with the wide registers we have now, it
| is also quite fast (15 or 16 digits in a 64-bit register).

John R. Mashey <ma...@mash.engr.sgi.com>:

>Actually, even before that....
>1) The HP PA folks included a few instructions to help packed decimal
>arithmetic.
>
>2) In 1985, MIPS had a potential customer who had a strong interest in COBOL,
>and had extensive statistics on the usage patterns of various decimal arithmetic
>operations, and looked at the MIPS instruction sequences to do them,
>and concluded that performance was perfectly adequate, somewhat to their
>surprise. As it happens, the {load, store}word{left,right} instructions,
>somewhat accidentally, turned out to be fairly useful.

A lot of machines have decimal instructions -- the Z-80 (and 8080 too?)
come to mind with "decimal adjust accumulator", a 1-instruction fixup
for BCD add (and also subtract?). I've forgotten if the x86 series
kept this. The VAX had a full set of +-*/ arithmetic instructions
for arbitrary-length BCD strings, there was even support for two
different conventions for how the sign of a negative decimal number
was encoded.

How did/do the IBM S/3[679]0 handle decimal arithmetic?

--
-- Jonathan Thornburg <jth...@thp.univie.ac.at>
http://www.thp.univie.ac.at/~jthorn/home.html
Universitaet Wien (Vienna, Austria) / Institut fuer Theoretische Physik
[[about astronomical objects of large size]]
"it is very difficult to see how such objects could show significant
variations on astronomically relevant timescales such as the duration
of a graduate student stipend or the tenure trial period of a facculty
member." -- Spangler and Cook, Astronomical Journal 85, 659 (1980)

Del Cecchi

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

In article <3A2D54E3...@hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> writes:
|> Stephen Fuld wrote:

|> > I agree with both of these reasons. However, the Power PC does not appear
|> > to have any hardware support for packed decimal. Does the AS/400 version of
|> > the chip add some support or is the whole thing emulated in software?
|>
|> SW emulation of BCD is easy, and with the wide registers we have now, it
|> is also quite fast (15 or 16 digits in a 64-bit register).
|>

|> However, if you're doing a lot of fixed-point BCD math, it is probably
|> much faster to do all the work in binary and only convert on
|> input/output. :-)
|>

I found this at
http://www.iseries.ibm.com/beyondtech/arch_nstar_perf.htm#Extensions

The PowerPC AS architecture is a superset of the 64-bit version of the PowerPC
architecture. Specific enhancements are included for AS/400 unique internal
architecture and for business processing. AS/400 "tag" bits are included to
ensure the integrity of MI pointers. Each 16-byte MI address pointer has at least
one tag bit in main storage that identifies the location as holding a valid pointer.
Load and store instructions are also included to support MI pointers. There is a
unique translation scheme suitable for AS/400 single-level storage (see Clark
and Corrigan (1989)). The business computing enhancements include
instructions for the following:

- Decimal assist operations for decimal data, which is common in RPG
and COBOL applications
- Double-word (64-bit) move-assist instructions for faster movement of
data
- Multiple double word load and store for faster call/return and task
switching
- Vectored call/return for more direct calls to system functions

A total of 22 instructions were added as extensions to the PowerPC architecture.
--

Del Cecchi
cecchi@rchland

Terje Mathisen

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

Jonathan Thornburg wrote:
> A lot of machines have decimal instructions -- the Z-80 (and 8080 too?)
> come to mind with "decimal adjust accumulator", a 1-instruction fixup
> for BCD add (and also subtract?). I've forgotten if the x86 series
> kept this.

It did:

It has Ascii Adjust for Addition (AAA) and similar helper opcodes to be
used either before or after a binary operation on BCD data.

However, using these opcodes is now _much_ slower than to work with a
bunch of digits in parallel in a 64-bit MMX register. :-)

Jonathan Thornburg

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

I wrote:
| A lot of machines have decimal instructions -- the Z-80 (and 8080 too?)
| come to mind with "decimal adjust accumulator", a 1-instruction fixup
| for BCD add (and also subtract?). I've forgotten if the x86 series
| kept this.

In article <3A2E6113...@hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> replied:

>It did:
>
>It has Ascii Adjust for Addition (AAA) and similar helper opcodes to be
>used either before or after a binary operation on BCD data.
>
>However, using these opcodes is now _much_ slower than to work with a
>bunch of digits in parallel in a 64-bit MMX register. :-)

You mean AAA doesn't work on multiple digits in parallel? Funny, I
thought I remembered the Z-80 instruction did both hex digits in the
(8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
per add/sub, not one per BCD digit. So I'd think that the speedup from
doing BCD in an MMX register would "just" be 64 vs 32 bits. No?

--
-- Jonathan Thornburg <jth...@thp.univie.ac.at>
http://www.thp.univie.ac.at/~jthorn/home.html
Universitaet Wien (Vienna, Austria) / Institut fuer Theoretische Physik

"IRAS galaxies are all chocolate chip flavored rather than vanilla
flavored as heretofore suposed. This no doubt accounts for their
diversity and appeal." -- Vader and Simon, Astronomical Journal 94, 865

John R. Mashey

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

In article <90l9rt$od7$1...@mach.thp.univie.ac.at>, jth...@mach.thp.univie.ac.at (Jonathan Thornburg) writes:

|> A lot of machines have decimal instructions -- the Z-80 (and 8080 too?)
|> come to mind with "decimal adjust accumulator", a 1-instruction fixup
|> for BCD add (and also subtract?). I've forgotten if the x86 series

|> kept this. The VAX had a full set of +-*/ arithmetic instructions
|> for arbitrary-length BCD strings, there was even support for two
|> different conventions for how the sign of a negative decimal number
|> was encoded.
|>
|> How did/do the IBM S/3[679]0 handle decimal arithmetic?

Almost all models have hardware (i.e., microcode) support for a fairly
complete set of decimal arithmetic (SS instructions - memory-to-memory),
plus truly-amazing instructions like Edit-and-mark.
Two notable exceptions were the 360/44 and 360/91, which omitted these
instructions as they were targeted to technical apps.

Let me try again:
in the 1950s-1970s, if a computer was being designed to target
commercial applications, COBOL was important, and the machine would most likely
have substantial hardware and microcode dedicated to decimal arithmetic,
including memory-to-memory operations, conversions, edits, scans, etc.

Instruction sets designed later, in general, stopped doing this,
although people did incorporate modest amounts of hardware where they'd
do decimal arithmetic using the base instruction set, but a few instructions
would help the code sequences.

Put another way, all-out support of decimal arithmetic in hardware
used to be at least a marketing requirement for certain classes of systems,
but this requirement has diminished "recently", i.e., in last 20 years.
It made a lot more sense in ISAs that expected microcode anyway.

Stephen Fuld

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

"John R. Mashey" <ma...@mash.engr.sgi.com> wrote in message
news:90kgf5$nln$1...@murrow.corp.sgi.com...

> In article <3A2D54E3...@hda.hydro.com>, Terje Mathisen
<terje.m...@hda.hydro.com> writes:
> |> Stephen Fuld wrote:
> |> > I agree with both of these reasons. However, the Power PC does not
appear
> |> > to have any hardware support for packed decimal. Does the AS/400
version of
> |> > the chip add some support or is the whole thing emulated in software?
> |>
> |> SW emulation of BCD is easy, and with the wide registers we have now,
it
> |> is also quite fast (15 or 16 digits in a 64-bit register).
>
> Actually, even before that....
> 1) The HP PA folks included a few instructions to help packed decimal
> arithmetic.
>
> 2) In 1985, MIPS had a potential customer who had a strong interest in
COBOL,
> and had extensive statistics on the usage patterns of various decimal
arithmetic
> operations, and looked at the MIPS instruction sequences to do them,
> and concluded that performance was perfectly adequate, somewhat to their
> surprise. As it happens, the {load, store}word{left,right} instructions,
> somewhat accidentally, turned out to be fairly useful.

Interesting. Was their strategy to convert the decimal operations to binary
and do binary arithmetic or to work on them directly in decimal form? Did
they use packed decimal or add ASCII numbers directly?

>
> In general, RISC people have tended in recent years to either:
> (a) Add a modest amount of hardware to help decimal arithmetic,
> but usually not a full decimal instruction set.
> (b) Or just use the instructions they've got, for COBOL & PL/I,
> having decided they were "good enough".
>
> [Note, I mention this because it is a common misconception that RISC
> instruction sets were designed only with C, and maybe FORTRAN in mind.
> This simply wasn't true: different design teams had different priorities,
> but many of them gave consideration to {COBOL, PL/I}, {LISP, Smalltalk},
> and ADA, and there are specific features still around that catered to
these
> various things, even if {C, C++}, {FORTAN} were high priorities.]

That is a good point to reiterate. Thanks.

--
- Stephen Fuld

Stephen Fuld

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

"Dave McKenzie" <dav...@galois.com> wrote in message
news:fopr2tg596q0s28ib...@4ax.com...

Wow! (Catching my breath) Thanks for the information. I don't have the
relevant S/390 manuals handy. I wonder how many cycles the equivalent add
packed instruction takes? It boggles the mind!

--
- Stephen Fuld

Kevin Strietzel

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

Jonathan Thornburg wrote:
...

> In article <3A2E6113...@hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> replied:

> >It has Ascii Adjust for Addition (AAA) and similar helper opcodes to be

> >used either before or after a binary operation on BCD data.

...

> You mean AAA doesn't work on multiple digits in parallel? Funny, I
> thought I remembered the Z-80 instruction did both hex digits in the
> (8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
> per add/sub, not one per BCD digit. So I'd think that the speedup from
> doing BCD in an MMX register would "just" be 64 vs 32 bits. No?

[I hope I got the attributions right!]

The 8080/8085/Z80 instruction AAD was for BCD, packed decimal
with two digits/byte. According to
http://www.ntic.qc.ca/~wolf/punder/asm/0000004e.htm, the
8086/8088/etc instructions are:

AAA - ASCII Adjust for Addition
AAD - ASCII Adjust for Division
AAM - ASCII Adjust for Multiplication
AAS - ASCII Adjust for Subtraction

DAA - Decimal Adjust for Addition
DAS - Decimal Adjust for Subtraction

The AAx instructions do unpacked decimal (not ASCII). The DAx
instructions do packed decimal (BCD).

--Kevin Strietzel
Not speaking for Stratus.

Terje Mathisen

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

Jonathan Thornburg wrote:
> Terje Mathisen <terje.m...@hda.hydro.com> replied:

> >It did:

> >
> >It has Ascii Adjust for Addition (AAA) and similar helper opcodes to be
> >used either before or after a binary operation on BCD data.
> >

> >However, using these opcodes is now _much_ slower than to work with a
> >bunch of digits in parallel in a 64-bit MMX register. :-)
>

> You mean AAA doesn't work on multiple digits in parallel? Funny, I
> thought I remembered the Z-80 instruction did both hex digits in the
> (8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
> per add/sub, not one per BCD digit. So I'd think that the speedup from
> doing BCD in an MMX register would "just" be 64 vs 32 bits. No?

Nope, 64 vs 8 bits. I.e. 8x speedup to swallow the extra instructions
needed.

All the AA* opcodes work on just one or two BCD digits.

John F Carr

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

In article <90kgf5$nln$1...@murrow.corp.sgi.com>,

John R. Mashey <ma...@mash.engr.sgi.com> wrote:

>[Note, I mention this because it is a common misconception that RISC
>instruction sets were designed only with C, and maybe FORTRAN in mind.
>This simply wasn't true: different design teams had different priorities,
>but many of them gave consideration to {COBOL, PL/I}, {LISP, Smalltalk},
>and ADA, and there are specific features still around that catered to these
>various things, even if {C, C++}, {FORTAN} were high priorities.]

What's an example of a feature in a mainstream processor to help
support LISP or Smalltalk?

I recently read _Garbage Collection_ by Jones and Lins. It discusses
some hardware techniques that might help garbage collected languages,
but the only part that seemed relevant to real, modern systems was
about cache organization and write buffering. Cache is so important to
everything that I expect the core market requirements to prevent any
tradeoffs to improve fringe languages.

(Some older IBM systems have ~32 byte granularity "lock bits" for
fine-grained access control; these might be usable to implement some of
the techniques discussed in the book but I've never heard of any such
actual use. I don't know if this feature is still present in Power PC.)

--
John Carr (j...@mit.edu)

John R. Mashey

unread,

Dec 6, 2000, 10:57:31 PM12/6/00

to

In article <O5xX5.5113$T43.4...@bgtnsc04-news.ops.worldnet.att.net>, "Stephen Fuld" <s.f...@worldnet.att.net> writes:

|>
|> Interesting. Was their strategy to convert the decimal operations to binary
|> and do binary arithmetic or to work on them directly in decimal form? Did
|> they use packed decimal or add ASCII numbers directly?

I don't reca for sure; I think it was packed decimal, but the few
sequences I ever saw were reminiscent of that IBM example posted here,
and there may have even been some mixed strategy, as there was a large
set of special cases.

While all of this sounds awful, it is worth noting that in the early 1980s,
even a simple FPU was a hefty chunk of silicon, and many people provided
FP emulation libraries, which were also nontrivial to get right,
and a lot of code.

Note, of course, that if the deciaml operations just need to work, but aren't actually used very often, it is an easier implementation to have the
compiler just generate calls to intrinsic functions, passing the arguments
and lengths, and letting the intrinsic figure things out at run-time.

John R. Mashey

unread,

Dec 6, 2000, 11:38:17 PM12/6/00

to

In article <3a2ecd1a$0$29...@senator-bedfellow.mit.edu>, j...@mit.edu (John F Carr) writes:
|>
|> In article <90kgf5$nln$1...@murrow.corp.sgi.com>,
|> John R. Mashey <ma...@mash.engr.sgi.com> wrote:
|

|> >but many of them gave consideration to {COBOL, PL/I}, {LISP, Smalltalk},
|> >and ADA, and there are specific features still around that catered to these
|> >various things, even if {C, C++}, {FORTAN} were high priorities.]
|>

|> What's an example of a feature in a mainstream processor to help
|> support LISP or Smalltalk?

|> John Carr (j...@mit.edu)

1) In the early 1980s, Dave Paterson & co were looking at Smalltalk architectures, especially, including the work on SOAR, i.e., start with a vanilla RISC and add tweaks to help Smalltalk.

2) SPARC's Tagged Add and Tagged Subtract (TADD, TSUB) were included for
the help of these languages, and possibly some of the trap-comparisons.

3) In MIPS-II, the trap-comparisons were partially justified for these,
although more for ADA.

In practice, I'm not sure how much the special features got used;
maybe somebody from SPARC-land can comment. At one point, I was told by
a Smalltalk implementor that they cared more about portability, and tended to ewschew features only found on one or two CPU families, i.e., this was after the
on-the-fly code generation got pretty good. From some LISP implementors,
I heard the following viewpoint:
(a) Either give me a LISP Machine
OR (b) Give me something generic and fast
BUT don't expect a lot from just a few extra features to help.

Sometime in the late 1980s, Alan Kay came by MIPS, and we had a good
discussion about the sorts of features that fit into a normal RISC design,
fit the software directions, and give useful speedups. We couldn't
come up with anything really compelling, so we didn't implement anything.

Of the various ideas we threw around, the most interesting was the idea
that, rather than building in specific tag-check hardware, maybe there was some
way to have a programmable mask+compare operation applied to memory references,
i.e., so you could do pointer-checking with parallel hardware, rather than
in-line instruction sequences. Also, one would want some really lean user-level
to user-level trapping sequences.

Regarding the other RISCs, I'm not sure if there is anything as explicit as
SPARC TADD, but perhaps some features were partially justified by arguements for
LISP and SMALLTALK: here's a typical discussion:

(a) We have instruction set 1, and lots of statistics.
(b) We propose that the following list of instructions be added,
in descending order of expected performance improvement,
based on analysis of our standard benchmark suites.
4% A
3% B
1% C
.8% D
.2% E

(c) Most people would add A & B, not add E, and argue about
C & D. An argument that might come up would be "D would help
LISP, SMALLTALK, not in our standard benchmark group", and that
might cause D to be included.

I couldn't find it quickly, but earl Killian put together exactly this
sort of list for doing MIPS-I => MIPS-II, and I know other groups
work through the same sort of process.

Duane Rettig

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

ma...@mash.engr.sgi.com (John R. Mashey) writes:

> In article <3a2ecd1a$0$29...@senator-bedfellow.mit.edu>, j...@mit.edu (John F Carr) writes:
> |>
> |> What's an example of a feature in a mainstream processor to help
> |> support LISP or Smalltalk?
>
> |> John Carr (j...@mit.edu)
>
> 1) In the early 1980s, Dave Paterson & co were looking at Smalltalk
> architectures, especially, including the work on SOAR, i.e., start
> with a vanilla RISC and add tweaks to help Smalltalk.
>
> 2) SPARC's Tagged Add and Tagged Subtract (TADD, TSUB) were included for
> the help of these languages, and possibly some of the trap-comparisons.

Our own Common Lisp implementation uses these instructions on Sparcs.
I will include an example and an explanation below the MIPS example, below.

> 3) In MIPS-II, the trap-comparisons were partially justified for these,
> although more for ADA.

We use conditional traps for as many things as possible that are not
very likely to occur, like wrong-number-of-arguments situations,
timeout synchronizations, etc.

> In practice, I'm not sure how much the special features got used;
> maybe somebody from SPARC-land can comment. At one point, I was told by
> a Smalltalk implementor that they cared more about portability, and tended
> to ewschew features only found on one or two CPU families, i.e., this was
> after the on-the-fly code generation got pretty good.

This is not true for our lisp implementation; we care very much to use
every feature that is available from the hardware. See the examples
below. If we had more features that were useful, we would use them.

> From some LISP implementors,
> I heard the following viewpoint:
> (a) Either give me a LISP Machine
> OR (b) Give me something generic and fast
> BUT don't expect a lot from just a few extra features to help.

Perhaps this is true. But one feature I would _love_ to see is the
combination hardware/software for a truly fast user-level trap handler
(on the order of tens of cycles rather than hundreds). Such
capabilities would make it possible for us to implement real-time and
truly incremental garbage collectors using read-barriers to instigate
pointer forwarding.

> Sometime in the late 1980s, Alan Kay came by MIPS, and we had a good
> discussion about the sorts of features that fit into a normal RISC design,
> fit the software directions, and give useful speedups. We couldn't
> come up with anything really compelling, so we didn't implement anything.
>
> Of the various ideas we threw around, the most interesting was the idea
> that, rather than building in specific tag-check hardware, maybe there was some
> way to have a programmable mask+compare operation applied to memory references,
> i.e., so you could do pointer-checking with parallel hardware, rather than
> in-line instruction sequences. Also, one would want some really lean user-level
> to user-level trapping sequences.

Yes, a fast trap-handling capability would be very nice. But we already
_do_ use a trick that effects tag-checking on any architectures that
implement alignment traps (note that this does not include Power, which
insists on fixing up misaligned data rather than allowing user traps to
do so, and it also excludes IA-32, which does have an alignment-trap
enable bit on 486 and newer, but no operating system I know of allows
the setting of this bit).

Example 1:

To set up this example of tag-checking by alignment traps, I will stay with
32-bit lisps, and will use the term "LispVal" to mean a 32-bit word whose
least significant 3 bits are Tag bits, and the other 29 are either data
or pointers to 8-byte Allocation Units (AUs) whose addresses end in binary 000.
The smallest object in a lisp heap is 1 AU, usually a "cons cell". The
cons cell has a Tag of 001, and "nil" which looks like a cons cell, has
Tag 101 (5). Note that these are 4 bytes apart. To access the first element
of a cons cell (i.e. the "car"), the load instruction must be offset by -1
from the register holding the cons or nil, and to access the second element
(the "cdr"), the offset is +3. This takes the Tag into consideration, and
accesses the word on a properly aligned word boundary. However, it has the
added benefit that if the object in the register is not a cons cell or nil,
then an alignment trap occurs, and interrupt handlers can interpret the
result.

Note in the example below:

1,2. A function is defined to take the "car" of the argument and is
compiled at high speed on an SGI MIPS architecture.
3. The disassembler output shows four instructions: the one at 0
does the "car" access, the one at offset 4 reloads the caller's
environment, the one at 8 returns, and the one at 12 sets the lisp's
return-value-count register.
4. The function is called with a non-cons (i.e. the value 10). This
causes an alignment trap, but the signal handler is able to determine
the exact error from the context. A break level is entered.
5. The :zoom command at high verbosity shows that foo was
"suspended" (in reality, a trap occurred) at instruction offset 0, which
was the lw instruction.

cl-user(1): (defun foo (x)
(declare (optimize speed (safety 0) (debug 0)))
(car x))
foo
cl-user(2): (compile 'foo)
foo
nil
nil
cl-user(3): (disassemble 'foo)
;; disassembly of #<Function foo>
;; formals: x

;; code start: #x30fe2724:
0: 8c84ffff lw r4,-1(r4)
4: 8fb40008 lw r20,8(r29)
8: 03e00008 jr r31
12: 34030001 [ori] move r3,1
cl-user(4): (foo 10)
Error: Attempt to take the car of 10 which is not listp.
[condition type: simple-error]

Restart actions (select using :continue):
0: Return to Top Level (an "abort" restart)
1: Abort #<process Initial Lisp Listener>
[1] cl-user(5): :zo :all t :verbose t :count 2
Evaluation stack:

... 3 more newer frames ...

call to sys::..context-saving-runtime-operation with 0 arguments.
function suspended at address #x5ffa769c (handle_saving_context+716)

----------------------------
->call to foo
required arg: x = 10
function suspended at relative address 0

----------------------------
(ghost call to excl::%eval)

----------------------------
call to eval
required arg: exp = (foo 10)
function suspended at relative address 156

----------------------------

... more older frames ...
[1] cl-user(6):

Example 2:

We use the sparc taddcc instruction to do additions of numbers in
the range between -(2^29) and 2^29-1, and to trap automatically to
overflow into "bignums" (arbitrarily large number objects) or to
generate an error for addition of non-numbers.

To set this example up, we note that two of the three-bit tags are
used for small integers (called "fixnums") and these tags are 0
(for even fixnums) and 4 (for odd fixnums). The other 29 bits
of the LispVal are the rest of the actual value for the fixnum.
Thus, in effect, a fixnum is a sign bit followed by 29 significant
bits, followed by two bits of 0.

Note here that the function has more instructions than in the
first example, so I'll only comment on the significant instructions.

1, 2. The function is of course adding two arguments x and y, compiled
at high speed. Note that there are no declarations on x or y; if they
had been declared to be fixnums, the compiler might have been able
to generate better code.
3. The disassembler output shows instructions at:
4: The actual taddcc instruction; if either an overflow occurs or
the least two bits of the operands are 0, the overflow bit is
set, meaning that the operation was not a fixnum+fixnum->fixnum
operation.
8: The overflow status is tested.
16-32: The operation is successful, so the result of the add is
moved to the result register and the function returns.
36-56: An internal function called +_2op is called to handle the
overflow case (to generate a bignum) or the case where one or
both operands are not numbers at all (an error).

cl-user(1): (defun foo (x y)
(declare (optimize speed (safety 0)))
(+ x y))
foo
cl-user(2): (compile 'foo)
foo
nil
nil
cl-user(3): (disassemble 'foo)
;; disassembly of #<Function foo>
;; formals: x y

;; code start: #x4a8678c:
0: 9de3bf98 save %o6, #x-68, %o6
4: 99060019 taddcc %i0, %i1, %o4
8: 2e800007 bvs,a 36
12: 90100018 mov %i0, %o0
16: 9010000c mov %o4, %o0
20: 86102001 mov #x1, %g3
lb1:
24: 81c7e008 jmp %i7 + 8
28: 91ea0000 restore %o0, %g0, %o0
32: 90100018 mov %i0, %o0
lb2:
36: c4013f87 ld [%g4 + -121], %g2 ; excl::+_2op
40: 92100019 mov %i1, %o1
44: 9fc1200b jmpl %g4 + 11, %o7
48: 86182002 xor %g0, #x2, %g3
52: 10bffff9 ba 24
56: 01000000 nop
cl-user(4):

--
Duane Rettig Franz Inc. http://www.franz.com/ (www)
1995 University Ave Suite 275 Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)

Rick Price

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

Not Me <not...@notme.com> wrote in message
news:7c6o2tsabi7d4b3tn...@4ax.com...

> On Mon, 04 Dec 2000 22:01:11 GMT, mar...@hotmail.com wrote:
>
> >Hello,
> >
> >I recently have to write a program to parse some packed decimal output
> >from an AS/400 machine. Not having an AS/400 background, I am curious

> >why this format is used at all? It's certainly not compact - it takes
> >more bytes to store the same number than in binary format (2's
> >complement). Or is it for readability? But then aren't these packed
> >decimal numbers meant for machine rather than human consumption so why
> >bother making them readable?
> >
> >Perplexed and regards,
> >

> >mark.

> >
> >
> >Sent via Deja.com http://www.deja.com/
> >Before you buy.

> Because for the business applications that dominate the AS/400 world,
> packed decimal completely eliminates round off errors that you get
> with floating point numbers.

The round off errors is a real problem in commercial applications so the
only choice is ordinary binary or decimal.

In the past, non floating point binary (with an implied decimal place) could
not hold large enough values for some currencies or businesses. Its even
iffy with 64 bit binary in the international business world. 64 bit binary
can hold 19 digits which sounds a lot, but COBOL has been able to hold 18
decimal digits since for ever. However COBOL can now hold 31 digits. This
increase must have been because of requirements in the real business world.
Boy, if you've ever wanted to convert pounds or dollars to/from Turkish lira
you'd know why this increase was a good idea.

Similarly SQL had to have a DECIMAL (NUMERIC) class added to be able to
handle large decimal amounts without using floating point.

So now the question is why PACKED DECIMAL instead of holding a character
string. Basically packed decimal takes up half the length of character
decimal and certainly on some hardware it cuts out a step in doing
arithmetic.

I think a standard decimal class has been added to Java (it has been
proposed). Again this shows a requirement in the real world. What amazes
me is that a decimal data type has never been added to C, C++.

You may find this FAQ interesting
http://www2.hursley.ibm.com/decimal/decifaq.html

Rick

Anton Ertl

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

In article <4zoi8r...@beta.franz.com>,

Duane Rettig <du...@franz.com> writes:
>Yes, a fast trap-handling capability would be very nice. But we already
>_do_ use a trick that effects tag-checking on any architectures that
>implement alignment traps (note that this does not include Power, which
>insists on fixing up misaligned data rather than allowing user traps to
>do so, and it also excludes IA-32, which does have an alignment-trap
>enable bit on 486 and newer, but no operating system I know of allows
>the setting of this bit).

When I tried it on Linux (in 1994, probably with Linux-1.0), it
worked; the code I used was

__asm__("pushfl; popl %eax; orl $0x40000, %eax; pushl %eax; popfl;");

However, the problem with this was that the C library caused alignment
faults, for two reasons:

1) For routines like memmove the programmer decided that misaligned
accesses were faster than other ways of implementing them (e.g.,
bytewise access). There were only a few such routines, and they would
have been easy to replace.

2) Many floating-point routines did misaligned accesses to floats,
because in their infinite wisdom Intel had required 4-byte alignment
for doubles in their ABI (to which the library conformed), whereas the
486 reported a misaligned access unless the double was 8-byte aligned.
That's why I did not use that feature.

Followups to comp.arch.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Karl Hanson

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

Rick Price wrote:
>
<snip>

>
> Similarly SQL had to have a DECIMAL (NUMERIC) class added to be able to
> handle large decimal amounts without using floating point.
>

Perhaps a nit (but somewhat on topic ;) ... for SQL on AS/400, DECIMAL
and NUMERIC may be synonymous in function, but for CREATE TABLE (column
definition), DECIMAL means packed decimal and NUMERIC means zoned
decimal.

<snip>

--

Karl Hanson

Del Cecchi

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

In article <zxPX5.53969$nh5.4...@newsread1.prod.itd.earthlink.net>,
hay...@alumni.uark.edu (Jim Haynes) writes:
|> As for why packed decimal, not addressing the question of why decimal, it
|> seems to me to be just an artifact of having chosen an 8-bit byte; or maybe
|> an argument in favor of choosing an 8-bit byte. So we could ask which
|> came first - the 8-bit byte or packed decimal?
|>
|> (I think the answer is 8-bit byte but I won't insist on it.)

I would disagree but have no proof

|>
|> Another example of solving the previous generation's problems is seen in the
|> System/360 disk usage, where you were allowed to choose any record size and
|> blocking you wished so as to maximize the number of bits you could get on a
|> track. Which means when you change to a newer model disk you have to
|> recalculate how you are going to pack things and then re-pack all your
|> data before you can get rid of the old disks. It took IBM a long time to
|> get around to "inventing" "fixed block architecture" which makes disk space
|> allocation so much easier and avoids repacking at the expense of wasting a
|> little space on the disk.
FBA was invented pretty early. It just took a while to get all the
count-key-data dependencies out of MVS.
--

Del Cecchi
cecchi@rchland

Allen J. Baum

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

In article <90kgf5$nln$1...@murrow.corp.sgi.com>, ma...@mash.engr.sgi.com
(John R. Mashey) wrote:

>In article <3A2D54E3...@hda.hydro.com>, Terje Mathisen <terje.m...@hda.hydro.com> writes:
>|> Stephen Fuld wrote:
>|> > I agree with both of these reasons. However, the Power PC does not appear
>|> > to have any hardware support for packed decimal. Does the AS/400 version of
>|> > the chip add some support or is the whole thing emulated in software?
>|>
>|> SW emulation of BCD is easy, and with the wide registers we have now, it
>|> is also quite fast (15 or 16 digits in a 64-bit register).
>
>Actually, even before that....
>1) The HP PA folks included a few instructions to help packed decimal
>arithmetic.

I seem to recall at a conference that they decided it wasn't worth it;
it was pretty easy to do it unpacked, and quicker as well.

--
**********************************************
* Allen J. Baum tel. (650)853-6626 *
* Compaq Computer Corp. fax (650)853-6513 *
* 181 Lytton Ave. *
* Palo Alto, CA 95306 ab...@pa.dec.com *
**********************************************

Dave Harris

unread,

Dec 7, 2000, 10:42:00 AM12/7/00

to

ma...@mash.engr.sgi.com (John R. Mashey) wrote (abridged):

> I heard the following viewpoint:
> (a) Either give me a LISP Machine
> OR (b) Give me something generic and fast
> BUT don't expect a lot from just a few extra features to help.

That was roughly the view of the Self project, at:
http://self.sunlabs.com/

Some of the papers there look at the cost of polymorphic dispatching.
With their compiler, they found the costs were low enough that they'd
rather have faster generic operations than zero-cost specific support.

Dave Harris, Nottingham, UK | "Weave a circle round him thrice,
bran...@cix.co.uk | And close your eyes with holy dread,
| For he on honey dew hath fed
http://www.bhresearch.co.uk/ | And drunk the milk of Paradise."

Jim Haynes

unread,

Dec 7, 2000, 12:11:59 PM12/7/00

to

As for why packed decimal, not addressing the question of why decimal, it
seems to me to be just an artifact of having chosen an 8-bit byte; or maybe
an argument in favor of choosing an 8-bit byte. So we could ask which
came first - the 8-bit byte or packed decimal?

(I think the answer is 8-bit byte but I won't insist on it.)

It seems to me - big dose of personal opinion here - that IBM had a tendency
to solve the previous generation's problems. Hence packed decimal would save
storage space in small memories and disks; but by the time it was adopted
storage space was not so serious an issue as it had been in the previous
generation of machines. And I rather doubt that the saving in storage space
was worth the time it took to pack and unpack data.

Andi Kleen

unread,

Dec 7, 2000, 12:50:43 PM12/7/00

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>
> When I tried it on Linux (in 1994, probably with Linux-1.0), it
> worked; the code I used was
>
> __asm__("pushfl; popl %eax; orl $0x40000, %eax; pushl %eax; popfl;");

It depends on your BIOS if the Alignment flag is set or not -- Linux never
rewrites CR0 completely, just clears and sets bits as needed.
ftp://ftp.firstfloor.org/pub/ak/smallsrc/alignment.c shows how to set it
always.

-Andi

Leif Svalgaard

unread,

Dec 8, 2000, 3:00:00 AM12/8/00

to

Jim Haynes <hay...@alumni.uark.edu> wrote in message
news:zxPX5.53969$nh5.4...@newsread1.prod.itd.earthlink.net...

> As for why packed decimal, not addressing the question of why decimal, it
> seems to me to be just an artifact of having chosen an 8-bit byte; or
maybe
> an argument in favor of choosing an 8-bit byte. So we could ask which
> came first - the 8-bit byte or packed decimal?
>
> (I think the answer is 8-bit byte but I won't insist on it.)

General Electric 435 of early 1960's vintage had (has actually, as they
are still around - called Bull 9000 now) 9-bit bytes and packed decimal
arithmetic. The sign handily fitted in the 9th bit. No half bytes
wasted as in the IBM-360 (and AS/400) architecture.

Stephen Fuld

unread,

Dec 8, 2000, 3:00:00 AM12/8/00

to

"Del Cecchi" <cec...@signa.rchland.ibm.com> wrote in message
news:90olni$1hg0$1...@news.rchland.ibm.com...

> In article <zxPX5.53969$nh5.4...@newsread1.prod.itd.earthlink.net>,
> hay...@alumni.uark.edu (Jim Haynes) writes:

snip

>
> |>
> |> Another example of solving the previous generation's problems is seen
in the
> |> System/360 disk usage, where you were allowed to choose any record size
and
> |> blocking you wished so as to maximize the number of bits you could get
on a
> |> track. Which means when you change to a newer model disk you have to
> |> recalculate how you are going to pack things and then re-pack all your
> |> data before you can get rid of the old disks. It took IBM a long time
to
> |> get around to "inventing" "fixed block architecture" which makes disk
space
> |> allocation so much easier and avoids repacking at the expense of
wasting a
> |> little space on the disk.

> FBA was invented pretty early. It just took a while to get all the
> count-key-data dependencies out of MVS.
> --

Unless it happened within the last year or so, there are still LOTS of CKD
dependencies in MVS. Besides the variable length record requirement,
partitioned data sets still use search key commands to locate members. The
extended PDSs supposedly don't require this, but are not yet in wide use.
But such system oriented things like VTOCs still require short records, etc.
Compatibility from 25 years ago!

--
- Stephen Fuld

>
> Del Cecchi
> cecchi@rchland

John R. Mashey

unread,

Dec 8, 2000, 3:00:00 AM12/8/00

to

In article <4zoi8r...@beta.franz.com>, Duane Rettig <du...@franz.com> writes:

|> This is not true for our lisp implementation; we care very much to use
|> every feature that is available from the hardware. See the examples

Now that you've jogged my memory, I do recall Franz usually went
all-out.

|>
|> Perhaps this is true. But one feature I would _love_ to see is the
|> combination hardware/software for a truly fast user-level trap handler
|> (on the order of tens of cycles rather than hundreds).

Yes, as noted later in my post, this has been a feature desired by many.
It's definitely one of the things that wish we'd had a bit more time to
design in at the beginning.

. Also, one would want some really lean user-level
|> > to user-level trapping sequences.

Note: the garbage-collection discussion reminds me of a closely-related
issue for some implementations of such languages. This is the issue of
doing hardware-efficient and software-efficient implementations of
instruction-cache-flushing for small on-the-fly code-generation.
People have done some support for this, but I haven't seen many
truly elegant solutions that were both lean and simple for
uniprocessor CPUs, and still worked cleanly for multiprcoessors.

David Hoyt

unread,

Dec 8, 2000, 10:23:11 AM12/8/00

to

Dave Harris wrote:

> ma...@mash.engr.sgi.com (John R. Mashey) wrote (abridged):
> > I heard the following viewpoint:
> > (a) Either give me a LISP Machine
> > OR (b) Give me something generic and fast
> > BUT don't expect a lot from just a few extra features to help.
>
> That was roughly the view of the Self project, at:
> http://self.sunlabs.com/
>
> Some of the papers there look at the cost of polymorphic dispatching.
> With their compiler, they found the costs were low enough that they'd
> rather have faster generic operations than zero-cost specific support.

To quote David Ungar, "Modern CPU's are like drag racers, they go in a
straight line very fast, but they can't turn worth crap." While compilers
are getting pretty good at removing many of the turns, modern languages
(OO, logical, ML like) and large systems like DBMS's still need to turn
often. The single thing that would help the most is to reduce latency.
Seeing new CPU's like the P4 with god-awful pipeline lengths are a bit
depressing. Purhaps the trace cache and large rename stuff will help, but
I'll hold back until someone proves it.

david

Jonathan Thornburg

unread,

Dec 8, 2000, 11:35:32 AM12/8/00

to

Terje Mathisen <terje.m...@hda.hydro.com> pointed out
[[about the x86 BCD-fixup opcodes]]

> However, using these opcodes is now _much_ slower than to work with a
> bunch of digits in parallel in a 64-bit MMX register. :-)

I asked:
% You mean AAA doesn't work on multiple digits in parallel? Funny, I
% thought I remembered the Z-80 instruction did both hex digits in the
% (8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
% per add/sub, not one per BCD digit. So I'd think that the speedup from
% doing BCD in an MMX register
[[on an x86 with MMX]]
% would "just" be 64 vs 32 bits. No?

In article <3A2EB011...@hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> replied:

>Nope, 64 vs 8 bits. I.e. 8x speedup to swallow the extra instructions
>needed.
>
>All the AA* opcodes work on just one or two BCD digits.

Yes, that would certainly explain the factor of 8 in performance.

But is there a good reason _why_ the nominally 32-bit x86 models
didn't extend the semantics of the AA* opcodes to work on the 8 BCD
digits one would naturally store in a 32-bit register? I would think
the same schemes used to fudge backwards compatability for old 8-bit
and 16-bit code could be used here...

--
-- Jonathan Thornburg <jth...@thp.univie.ac.at>
http://www.thp.univie.ac.at/~jthorn/home.html
Universitaet Wien (Vienna, Austria) / Institut fuer Theoretische Physik

"Washing one's hands of the conflict between the powerful and the
powerless means to side with the powerful, not to be neutral."
- quote by Freire / poster by Oxfam

Jim Haynes

unread,

Dec 8, 2000, 1:08:59 PM12/8/00

to

In article <SUXX5.850$wo2....@typhoon.austin.rr.com>,

Leif Svalgaard <le...@leif.org> wrote:
>
>General Electric 435 of early 1960's vintage had (has actually, as they
>are still around - called Bull 9000 now) 9-bit bytes and packed decimal
>arithmetic. The sign handily fitted in the 9th bit. No half bytes
>wasted as in the IBM-360 (and AS/400) architecture.
>

Well I haven't kept up with the ex-GE machines since I worked on the 635
family over 30 years ago. That family of machines also had 9-bit bytes
as an artifact of the 36-bit word. And it was pretty awful, because IBM
had decreed an 8-bit byte for magnetic tape, so you couldn't write 9-bit
bytes to tape without either dropping one bit for character data or
writing whole 36 bit words to keep the extra bits.

Bruce Hoult

unread,

Dec 8, 2000, 4:22:26 PM12/8/00

to

In article <3A30FCDF...@cognent.com>, David Hoyt
<david...@cognent.com> wrote:

> > Some of the papers there look at the cost of polymorphic dispatching.
> > With their compiler, they found the costs were low enough that they'd
> > rather have faster generic operations than zero-cost specific support.
>
> To quote David Ungar, "Modern CPU's are like drag racers, they go in a
> straight line very fast, but they can't turn worth crap." While compilers
> are getting pretty good at removing many of the turns, modern languages
> (OO, logical, ML like) and large systems like DBMS's still need to turn
> often. The single thing that would help the most is to reduce latency.
> Seeing new CPU's like the P4 with god-awful pipeline lengths are a bit
> depressing. Purhaps the trace cache and large rename stuff will help, but
> I'll hold back until someone proves it.

Well there's always the PowerPC G3 and G4 with teeny-tiny little
pipelines. They corner pretty well.

The masses don't seem to like the low MHz numbers they have even though
the 500 MHz G3 in a current iMac performs about the same as a 700 - 800
MHz Pentium III -- which is more than *most* people have.

-- Bruce

Tim McCaffrey

unread,

Dec 8, 2000, 6:12:23 PM12/8/00

to

In article <90r2kk$uaa$1...@mach.thp.univie.ac.at>, Jonathan Thornburg says...

>
>But is there a good reason _why_ the nominally 32-bit x86 models
>didn't extend the semantics of the AA* opcodes to work on the 8 BCD
>digits one would naturally store in a 32-bit register? I would think
>the same schemes used to fudge backwards compatability for old 8-bit
>and 16-bit code could be used here...
>

I heard in the past the a "big" decimal adder is expensive both in gates and
impact on the execution pipeline.

I would like to note here that since the 8087, the x86 line had two
instructions that helped tremendously in converting BCD to integer and back
again: FBLD (BCD load) and FBSTP( BCD store and pop). The execution times are
as follows:

CPU/FPU FBLD FBSTP
88/87 300 530
286/287 300 530
386/387 45-97 112-190
486DX 75 175
Pentium 48-58 148-154
Pentium II 76* 182*

* The PII numbers were done by a small program that repeatedly loaded/stored a
10 digit number. Therefore, I would consider the numbers approximations at
best, and are probably a bit high.

(Note: The original 8087 was 5 Mhz, so a FBSTP was ~106 microseconds to
execute, assuming a Pentium III has the same execution speed, it would take a
933Mhz PIII ~.2 microseconds to do the same task).

I wonder if any Cobol compilers ever used these instructions?

Tim McCaffrey

Paul DeMone

unread,

Dec 8, 2000, 8:39:09 PM12/8/00

to

David Hoyt wrote:

> Seeing new CPU's like the P4 with god-awful pipeline lengths are a bit
> depressing. Purhaps the trace cache and large rename stuff will help, but
> I'll hold back until someone proves it.

Have you seen how P4 does on gcc in their SPEC2k submission?

My understanding is that gcc is one of the hardest programs in
SPECint to speed up because of all the data and control dependencies.
High clock rate is a powerful tool to attack stubborn scalar code that
resists the brainiac approach.

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.

Del Cecchi

unread,

Dec 8, 2000, 11:56:43 PM12/8/00

to

Ah yes, the transition from 7 track to 9 track tape was probably
involved in there somewhere. My guess would be that the old 7094/1401
vintage tapes wrote 6 bit+parity characters to the tape. The 360
vintage tapes wrote 8b+P on 9 track tape. Rochester machines started
with system/3 which vaugely resembled a 360 and had 8 bit bytes.

del cecchi

David Hoyt

unread,

Dec 9, 2000, 11:42:01 AM12/9/00

to

Paul DeMone wrote:

> Have you seen how P4 does on gcc in their SPEC2k submission?
>
> My understanding is that gcc is one of the hardest programs in
> SPECint to speed up because of all the data and control dependencies.
> High clock rate is a powerful tool to attack stubborn scalar code that
> resists the brainiac approach.

I don't know for sure, but I doubt the gcc spec test is as twisty as a Smalltalk app, or
even a Java simulation code. It might be as twisty as a complex, optimized query/update
code in a DBMS; but its small enough to not have the same kinds of i-cache pressure that a
DBMS puts on a processor. While I wouldn't be astounded if its a good match for the codes
I run, but I suspect its not that close of a match, either.

But for all that, lets look at the GCC results for the 1.0 GHz P3, the 1.5 P4 and the
0.833GHz alpha.

GHz base result base/GHz result/GHz
p3 1.0 401 408 401 408
p4 1.5 588 592 392 395
264 0.833 617 687 699 778

Lets assume that gcc is representative as my codes. What do I learn? That it still isn't
as fast as the alpha; and not only that, even with the trace cache and double-pumping the
integer unit, its no faster on a clock for clock basis than the p3! Now the p3 has lots
of refinement time, and the p4 is just out of the shoot, so it might not be a fair
comparison. But I would have expected with that many more transistors, they would have be
able to speed up the performance per clock.

Now a few disclaimers. I'm doing most of my development (not production) on a P2/300, so
I bet I'd be happy with the performance with any of those processors. I didn't look at
the other processors, so they might be better than anyone other than the alpha. Also, for
multi-media and single precision (graphics) floating point codes, which are very like
simple vector codes with few twisty paths, the p4 is likely a big win. So if intel's
target for the p4 is low end gaming machines (that don't have an emotion engine in them),
they might have made the right choices.

I'm still doubtful as to if they helped my codes out very much.

david

Paul DeMone

unread,

Dec 9, 2000, 4:02:54 PM12/9/00

to

David Hoyt wrote:
>
> Paul DeMone wrote:
>
> > Have you seen how P4 does on gcc in their SPEC2k submission?
> >
> > My understanding is that gcc is one of the hardest programs in
> > SPECint to speed up because of all the data and control dependencies.
> > High clock rate is a powerful tool to attack stubborn scalar code that
> > resists the brainiac approach.
>
> I don't know for sure, but I doubt the gcc spec test is as twisty as a Smalltalk app, or
> even a Java simulation code. It might be as twisty as a complex, optimized query/update
> code in a DBMS; but its small enough to not have the same kinds of i-cache pressure that a
> DBMS puts on a processor. While I wouldn't be astounded if its a good match for the codes
> I run, but I suspect its not that close of a match, either.
>
> But for all that, lets look at the GCC results for the 1.0 GHz P3, the 1.5 P4 and the
> 0.833GHz alpha.
>
> GHz base result base/GHz result/GHz
> p3 1.0 401 408 401 408
> p4 1.5 588 592 392 395
> 264 0.833 617 687 699 778
>
> Lets assume that gcc is representative as my codes. What do I learn? That it still isn't
> as fast as the alpha; and not only that, even with the trace cache and double-pumping the
> integer unit, its no faster on a clock for clock basis than the p3! Now the p3 has lots
> of refinement time, and the p4 is just out of the shoot, so it might not be a fair
> comparison. But I would have expected with that many more transistors, they would have be
> able to speed up the performance per clock.

Careful, you have fallen into the trap of trying to compare
architectural efficiency on the basis of clock normalized
performance. Here is an example of what I mean. Here is the
gcc results for two processors A and B:

Processor GHz gcc.base gcc.base/GHz

A 1.0 377 377
B 0.8 336 420

Which processor has the most efficient microarchitecture? B right?
After all it achieves 11% higher clock normalized behaviour.

Wrong!

In my example A and B are identical other than clock rate. A and
B are coppermine PIII on Intel's VC820 board using IRC 5.0 compiled
SPEC2k binaries. Can the exact same computer be 11% more efficient
than itself? No, obviously this methdology is useless (aside from
enhancing the spirits of Mac enthusiasts ;-). The reason is the
computer from the MPU's bus interface outwards doesn't scale with
the processor clock rate so at higher frequencies the external
latencies grow larger in terms of clock cycles.

To compare the microarchitectural efficiency of P4 vs PIII you
either have to estimate how the PIII's performance scales up to
the P4's clock rate or slow down a P4 to P3 clock rates and rerun
SPEC. Please note that the two approaches aren't equivalent because
the two designs don't scale the same way. Running a P4 at 1.0 GHz
reduces its relative architectural efficiency vs PIII.

The P4 was designed to run at high clock rates so lets see its
strength with home field advantage. The PIII/VC820/IRC 5.0 combo
at 1.133 GHz gives 409 on gcc.base. So a 13.3% increase in clock
rate gives 8.4% more performance. Using the same scaling factor
applied linearly from 1.133 GHz up to 1.4 GHz, the PIII would be
estimated to get 14.9% higher gcc.base performance or 470. In
reality the PIII would fall a short of that due to diminishing
returns as the memory component of CPI increases as a percentage
of overall CPI but I don't want to go into a more accurate model
because linear scaling is sufficient to show my point.

So a hypothetical PIII running at 1.4 GHz would score under 470
on gcc.base while the P4 at 1.4 GHz yields 573 for more than a
22% IPC advantage. Not quite the same result as you get from
clock normalization is it? A 22% IPC advantage on gcc isn't
bad in light of how the P4 has been attacked for on the basis
its long pipeline would suffer badly on code with a lot of
control dependencies.

Paul DeMone

unread,

Dec 9, 2000, 4:10:54 PM12/9/00

to

Paul DeMone wrote:

> So a hypothetical PIII running at 1.4 GHz would score under 470
> on gcc.base while the P4 at 1.4 GHz yields 573 for more than a
> 22% IPC advantage.

Safer to say 22% performance advantage at the same clock rate. The
IPC advantage could be different from 22% if the gcc binaries run
on the PIII and P4 were different.

Terje Mathisen

unread,

Dec 10, 2000, 3:10:44 PM12/10/00

to

Jonathan Thornburg wrote:
>
> Terje Mathisen <terje.m...@hda.hydro.com> pointed out

> Terje Mathisen <terje.m...@hda.hydro.com> replied:
> >Nope, 64 vs 8 bits. I.e. 8x speedup to swallow the extra instructions
> >needed.
> >
> >All the AA* opcodes work on just one or two BCD digits.
>
> Yes, that would certainly explain the factor of 8 in performance.
>
> But is there a good reason _why_ the nominally 32-bit x86 models
> didn't extend the semantics of the AA* opcodes to work on the 8 BCD
> digits one would naturally store in a 32-bit register? I would think
> the same schemes used to fudge backwards compatability for old 8-bit
> and 16-bit code could be used here...

One possible/probable reason is that Intel have traces of lots of
'important' programs (for some definition of 'important'), and these
show that the DAx/AAx opcodes are almost totally unused.

This made it almost certain that they would not use more opcode space
for 16/32-bit version of these functions.

Terje

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Bernd Paysan

unread,

Dec 10, 2000, 5:41:56 PM12/10/00

to

Paul DeMone wrote:
> So a hypothetical PIII running at 1.4 GHz would score under 470
> on gcc.base while the P4 at 1.4 GHz yields 573 for more than a
> 22% IPC advantage. Not quite the same result as you get from
> clock normalization is it? A 22% IPC advantage on gcc isn't
> bad in light of how the P4 has been attacked for on the basis
> its long pipeline would suffer badly on code with a lot of
> control dependencies.

One of the most noticable differences between the P3 and the P4 is the
improved bus speed. A good comparison can be found with Athlon, which is
available with P3-comparable bus speed (the limitation here the
PC133-SDRAM and the VIA boards, which have the same DRAM interface for
both CPUs), and also has been tested with P4-comparable bus speed
(DDR-SDRAM with double-pumped FSB at 133MHz; the P4 is theoretically a
bit better). The Athlon already gains significant performance at 1GHz,
and scales much better for higher frequences.

We all know that a underpowered memory interface can dry out the clock
frequency improvements of any CPU; keeping up with the memory interface
is important (enlarge the caches and/or make the bus faster/lower
latency). The GTL+ bus doubled speed by about a factor of two during the
entire life-time, while the core speed got up by a factor of five or
six.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

David Hoyt

unread,

Dec 13, 2000, 10:24:44 AM12/13/00

to

Paul DeMone wrote:

> David Hoyt wrote:
> > Lets assume that gcc is representative as my codes. What do I learn? That it still isn't
> > as fast as the alpha; and not only that, even with the trace cache and double-pumping the
> > integer unit, its no faster on a clock for clock basis than the p3!
>

> Careful, you have fallen into the trap of trying to compare
> architectural efficiency on the basis of clock normalized
> performance. Here is an example of what I mean. Here is the
> gcc results for two processors A and B:

> ...

I wasn't trying to look at anything related to IPC. I do realize one of the reasons that
intel developed the p4 is that it couldn't figure out a way to speed up the p3 line much more.
It's just most micro-arch. jumps lately have given better than clock for clock improvements;
I've gotten jaded.

Perhaps I cut too much from my post, and the context of my response got lost. My initial
comment was that the single most important thing to speed up languages like CLOS, Smalltalk,
Java, Self and the like aren't special instructions like tagged add or cons cell or cdr coding
hardware, but rather the ability to "turn fast." And while I wouldn't be astounded that the
p4's very long pipeline wouldn't prove to be a win, I didn't expect it to be of much help for my
codes.

Someone else told me to look at the p4's gcc/spec2000 scores. For the 1.5 GHz system submitted,
that system wasn't any faster on a clock for clock basis than a p3 (system), and still worse
than a 264.

It's possible that the p4 system isn't looking that good from a system perspective over a p3
system, because intel doesn't have the deep experience in it that they have in the p3. New
usually means not optimized. It might also be that the gcc is hitting system walls, but that
seems unlikely because the 264 system is that much faster.

My conclusion is still that intel didn't spend the time of very smart engineers and a very large
transistor budget to make my code run like a bat out of hell. At best (which isn't bad) I get
to float up with the rest of the flotsam. Long pipelines will be much more of a help in
multi-media, graphics and other vectorizable codes that it will be for twisty codes like OO
programs and complex simulations. But if I get back the extra cost of a p4 system over a p3,
I'd still be happy to buy one.

Back to what David Unger said "Modern microprocessors are like drag racers, they go in a
straight line very fast, but they can't turn very well." Long pipelines are like the long noses
on drag racers, great for drag races but they won't help in a LeMans Grand Prix. If intel cared
about making my codes fast, they would be doing something else.

I still believe that adding the ability to turn quickly is the single most important thing a cpu
designer can add to a microprocessor to help speed up OO codes; don't waste time on tagged adds
and cons cell support. Maybe spend time on fine-grained pages, but really please make them turn
fast.

david