Packed decimals

Terrence Enger

unread,

Dec 4, 2000, 9:09:18 PM12/4/00

to

My C$.02 worth is below. Meanwhile, I have copied this message to
comp.arch in the hope that someone there will have another answer.

Terry.

Scott Lindstrom wrote:
>
> In article <7c6o2tsabi7d4b3tn...@4ax.com>,
> Not Me <not...@notme.com> wrote:
> > On Mon, 04 Dec 2000 22:01:11 GMT, mar...@hotmail.com wrote:
> >
> > >Hello,
> > >
> > >I recently have to write a program to parse some packed decimal
> output
> > >from an AS/400 machine. Not having an AS/400 background, I am curious
> > >why this format is used at all?

People are more used to decimal arithmetic. Now, obviously, anything is
possible in any representation, but some representations are much more
convenient for some purposes: please pick up your hammer and chisel and
calculate for us MCMXLV times MCDLXI divided by IV (just kidding). This
is a big issue at both ends of computed numbers. If the low order digit
of a result is to be rounded off--note that I am contradicting marklau's
assertion that decimal arithmetic avoids roudoff completely--most people
are more comfortable with a result rounded to the nearest dollar or
penny or hundredth of a penny; a sixteenth of a dollar or a sixty-fourth
of a penny takes a lot of decimal places on a printout or a screen.
After the loss of high-order significant digits, it is arguably better
to be "out" some multiple of $10,000,000.00 than by some multiple of
2^32 cents.

Having programmed commercial applications in assembly language (I have
*earned* these gray hairs!) I am quite happy to avoid the tedium of
extended precision binary arithmetic and decimal scaling of binary
numbers.

> > > It's certainly not compact - it takes
> > >more bytes to store the same number than in binary format (2's
> > >complement).

The issue of compactness is clouded by the fact that decimal processors
like the 360 and its descendants will handle more different sizes of
decimal numbers than binary processors will handle binary numbers. A
five-digit packed number takes three bytes; in binary it only requires
17 and a fraction bits, but how much will you probably allocate?

The redundancy in packed numbers is occasionally useful in that your bug
(okay, *my* bug; don't be so touchy!) can manifest itself sooner as a
decimal data exception rather than later as incorrect output.

> > > Or is it for readability? But then aren't these packed
> > >decimal numbers meant for machine rather than human consumption so
> why
> > >bother making them readable?
> > >
> > >Perplexed and regards,
> > >
> > >mark.
> > >
> > >
> > >Sent via Deja.com http://www.deja.com/
> > >Before you buy.
> > Because for the business applications that dominate the AS/400 world,
> > packed decimal completely eliminates round off errors that you get
> > with floating point numbers.
> >
>
> Also, when viewing numbers in hex (through file utilities, or in the
> old days, dumps), packed decimal is infinitely more readable than
> binary.
>
> Scott Lindstrom
> Zenith Electronics
>
> Sent via Deja.com http://www.deja.com/
> Before you buy.

Del Cecchi

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

In article <3A2C4E08...@idirect.com>,

(snipping)

I believe there are several reasons, and there are really two parts to the
question.

"why decimal instead of binary?" and "why packed decimal?".

I think the answers are at least partially historical in that many early
computers used decimal arithmetic. Packed decimal was invented in a time of very
expensive and limited storage. Languages such as COBOL have decimal or string
arithmetic which fits decimal operations.

And last, there is at least some merit to the notion that decimal arithmetic is a
better fit to fixed point operations such as financial calculations.

del cecchi

--

Del Cecchi
cecchi@rchland

Alberto Moreira

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

Del Cecchi wrote:

> I believe there are several reasons, and there are really two parts to the
> question.
>
> "why decimal instead of binary?" and "why packed decimal?".
>
> I think the answers are at least partially historical in that many early
> computers used decimal arithmetic. Packed decimal was invented in a time of very
> expensive and limited storage. Languages such as COBOL have decimal or string
> arithmetic which fits decimal operations.
>
> And last, there is at least some merit to the notion that decimal arithmetic is a
> better fit to fixed point operations such as financial calculations.

The issue was also exactness. Commercial programs had to do
precise decimal computations, and floating point did not look
good from that perspective, we couldn't even reliably compare
two floating point numbers for equality! With packed decimal
arithmetic, the result of operating on two numbers was
predictable and repeatable. It mapped well into the Cobol
ability of creating limited size decimal numbers, and that was
important in the days of 20Mb hard drives and half-a-gig mag
tapes. Moreover, it allowed for variable length fields and for
large numbers.

Alberto.

Stephen Fuld

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

"Del Cecchi" <cec...@signa.rchland.ibm.com> wrote in message
news:90iqcb$1378$1...@news.rchland.ibm.com...

> In article <3A2C4E08...@idirect.com>,
> Terrence Enger <ten...@idirect.com> writes:

> (snipping)

> I believe there are several reasons, and there are really two parts to the
> question.
>
> "why decimal instead of binary?" and "why packed decimal?".
>
> I think the answers are at least partially historical in that many early
> computers used decimal arithmetic. Packed decimal was invented in a time
of very
> expensive and limited storage. Languages such as COBOL have decimal or
string
> arithmetic which fits decimal operations.
>
> And last, there is at least some merit to the notion that decimal
arithmetic is a
> better fit to fixed point operations such as financial calculations.
>

> del cecchi
>
> --

I agree with both of these reasons. However, the Power PC does not appear
to have any hardware support for packed decimal. Does the AS/400 version of
the chip add some support or is the whole thing emulated in software?

You said the AS/400 was not renamed something else. OK, I'll not try to
second guess marketing. Have all the lines been renamed? Can you give us a
"cheat sheet" to convert from old to new names?

--
- Stephen Fuld

>
> Del Cecchi
> cecchi@rchland

Thomas

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

In article <3A2CF3B5...@moreira.mv.com>,
alb...@moreira.mv.com wrote:

> Del Cecchi wrote:
>
> and that was
> important in the days of 20Mb hard drives

Heh. I recall upgrading our 2311(? no longer absolutely certain of its
model identifier) disk unit attached to a 1401 from 5MB to 10MB.
Essentially, our IBM CE removed a screw from the drive unit and the arm
could then travel twice as far. 20MB never made it to that machine.
Sheesh, it was only an 8K base memory (with a 4K expansion box about the
size of a washing machine to bring it up to a massive 12K.)

But then, it didn't exactly use packed-decimal nor exactly binary
neither. There are times when I still miss being able to set "word
marks".

--
Tom Liotta, AS/400 Systems Programmer
The PowerTech Group, Inc.; http://www.400security.com
...and for you automated email spammers out there:
jqu...@fcc.gov sn...@fcc.gov rch...@fcc.gov mpo...@fcc.gov

Terje Mathisen

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

Stephen Fuld wrote:
> I agree with both of these reasons. However, the Power PC does not appear
> to have any hardware support for packed decimal. Does the AS/400 version of
> the chip add some support or is the whole thing emulated in software?

SW emulation of BCD is easy, and with the wide registers we have now, it
is also quite fast (15 or 16 digits in a 64-bit register).

However, if you're doing a lot of fixed-point BCD math, it is probably
much faster to do all the work in binary and only convert on
input/output. :-)

Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Del Cecchi

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

In article <UwbX5.7800$Ei1.5...@bgtnsc05-news.ops.worldnet.att.net>,

"Stephen Fuld" <s.f...@worldnet.att.net> writes:
|>
|> "Del Cecchi" <cec...@signa.rchland.ibm.com> wrote in message
|> news:90iqcb$1378$1...@news.rchland.ibm.com...
|> > In article <3A2C4E08...@idirect.com>,
|> > Terrence Enger <ten...@idirect.com> writes:

|> > |> My C$.02 worth is below. Meanwhile, I have copied this message to

note snippage

|> > |> >
|> > |> > Also, when viewing numbers in hex (through file utilities, or in the
|> > |> > old days, dumps), packed decimal is infinitely more readable than
|> > |> > binary.
|> > |> >
|> > |> > Scott Lindstrom
|> > |> > Zenith Electronics
|> > |> >
|> >

|> I agree with both of these reasons. However, the Power PC does not appear
|> to have any hardware support for packed decimal. Does the AS/400 version of
|> the chip add some support or is the whole thing emulated in software?
|>

|> You said the AS/400 was not renamed something else. OK, I'll not try to
|> second guess marketing. Have all the lines been renamed? Can you give us a
|> "cheat sheet" to convert from old to new names?
|>
|>
|> --
|> - Stephen Fuld
|>
|>

As part of the eServer announcement, from now on systems that would have been
called AS/400 will be "i-series eServers". IBM still sells AS/400s announced
before the recent changes. So there is no conversion of names, per se. But the
RS/6000 will be P series and Intel based servers will be X series.

My understanding is that decimal arithmetic (zoned, packed, something) was one of
the things that was added to the PowerPC to make a PowerPC/AS. There is a bunch
of background hidden on IBM's AS400/i-series web pages.

--

Del Cecchi
cecchi@rchland

Stephen Fuld

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:3A2D54E3...@hda.hydro.com...

> Stephen Fuld wrote:
> > I agree with both of these reasons. However, the Power PC does not
appear
> > to have any hardware support for packed decimal. Does the AS/400
version of
> > the chip add some support or is the whole thing emulated in software?
>

> SW emulation of BCD is easy, and with the wide registers we have now, it
> is also quite fast (15 or 16 digits in a 64-bit register).
>
> However, if you're doing a lot of fixed-point BCD math, it is probably
> much faster to do all the work in binary and only convert on
> input/output. :-)

I agree, but a lot of data processing is a relatively modest amount of math
on a modest number of fields on a lot of records. For example, add the
interest amount to each account requires only a multiply and an add. It
still may be faster to convert though.

BTW, a matter of terminology. I thought BCD was a six bit code used on some
machines prior to S/360. It had only upper case letters, digits and a few
punctuation marks. It was what IBM "extended" into EBCD (IC for interchange
code). That is different from packed decimal (at least S/360 style) which
uses four bits for each digit with an optional sign "overpunch" in the last
digit. Is my recollection wrong here?

--
- Stephen Fuld

David Gay

unread,

Dec 5, 2000, 3:00:00 AM12/5/00

to

"Stephen Fuld" <s.f...@worldnet.att.net> writes:
> "Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
> news:3A2D54E3...@hda.hydro.com...
> > Stephen Fuld wrote:
> > > I agree with both of these reasons. However, the Power PC does not
> appear
> > > to have any hardware support for packed decimal. Does the AS/400
> version of
> > > the chip add some support or is the whole thing emulated in software?
> >
> > SW emulation of BCD is easy, and with the wide registers we have now, it
> > is also quite fast (15 or 16 digits in a 64-bit register).
> >
> > However, if you're doing a lot of fixed-point BCD math, it is probably
> > much faster to do all the work in binary and only convert on
> > input/output. :-)
>
> I agree, but a lot of data processing is a relatively modest amount of math
> on a modest number of fields on a lot of records. For example, add the
> interest amount to each account requires only a multiply and an add. It
> still may be faster to convert though.
>
> BTW, a matter of terminology. I thought BCD was a six bit code used on some
> machines prior to S/360. It had only upper case letters, digits and a few
> punctuation marks. It was what IBM "extended" into EBCD (IC for interchange
> code). That is different from packed decimal (at least S/360 style) which
> uses four bits for each digit with an optional sign "overpunch" in the last
> digit. Is my recollection wrong here?

I don't know about the distant past, but for the past fifteen years at
least, BCD has been "binary coded decimal" (4 bits to a decimal digit) in
the integrated circuit world...

--
David Gay - Yet Another Starving Grad Student
dg...@cs.berkeley.edu

John R. Mashey

unread,

Dec 5, 2000, 11:48:37 PM12/5/00

to

In article <3A2D54E3...@hda.hydro.com>, Terje Mathisen <terje.m...@hda.hydro.com> writes:
|> Stephen Fuld wrote:
|> > I agree with both of these reasons. However, the Power PC does not appear
|> > to have any hardware support for packed decimal. Does the AS/400 version of
|> > the chip add some support or is the whole thing emulated in software?
|>
|> SW emulation of BCD is easy, and with the wide registers we have now, it
|> is also quite fast (15 or 16 digits in a 64-bit register).

Actually, even before that....
1) The HP PA folks included a few instructions to help packed decimal
arithmetic.

2) In 1985, MIPS had a potential customer who had a strong interest in COBOL,
and had extensive statistics on the usage patterns of various decimal arithmetic
operations, and looked at the MIPS instruction sequences to do them,
and concluded that performance was perfectly adequate, somewhat to their
surprise. As it happens, the {load, store}word{left,right} instructions,
somewhat accidentally, turned out to be fairly useful.

In general, RISC people have tended in recent years to either:
(a) Add a modest amount of hardware to help decimal arithmetic,
but usually not a full decimal instruction set.
(b) Or just use the instructions they've got, for COBOL & PL/I,
having decided they were "good enough".

[Note, I mention this because it is a common misconception that RISC
instruction sets were designed only with C, and maybe FORTRAN in mind.
This simply wasn't true: different design teams had different priorities,
but many of them gave consideration to {COBOL, PL/I}, {LISP, Smalltalk},
and ADA, and there are specific features still around that catered to these
various things, even if {C, C++}, {FORTAN} were high priorities.]

--
-John Mashey EMAIL: ma...@sgi.com DDD: 650-933-3090 FAX: 650-851-4620
USPS: SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time; cell phone = 650-575-6347.
PERMANENT EMAIL ADDRESS: ma...@heymash.com

Dave McKenzie

unread,

Dec 6, 2000, 2:23:19 AM12/6/00

to

The AS/400 chips added a few simple instructions to PowerPC to assist
packed decimal arithmetic, but most of the logic is done in software
(after all, it's RISC).

Are you sitting down? The following is machine code generated for a
*single* decimal add instruction. It's from a test MI pgm and adds
two 9-digit (5-byte) packed fields, putting the result in a third
9-digit packed field.

0A0 805FFFF5 LWZ 2,0XFFF5(31)
0A4 887FFFF9 LBZ 3,0XFFF9(31)
0A8 80FFFFF0 LWZ 7,0XFFF0(31)
0AC 7843460C RLDIMI 3,2,8,24
0B0 881FFFF4 LBZ 0,0XFFF4(31)
0B4 6866000F XORI 6,3,15
0B8 E8408148 LD 2,0X8148(0)
0BC 78E0460C RLDIMI 0,7,8,24
0C0 7CC61014 ADDC 6,6,2
0C4 6806000F XORI 6,0,15
0C8 7C200448 TXER 1,0,40
0CC E8E08140 LD 7,0X8140(0)
0D0 7C461014 ADDC 2,6,2
0D4 7C200448 TXER 1,0,40
0D8 7C0000BB DTCS. 0,0
0DC 79A27A18 SELII 2,13,15,36
0E0 7C6300BB DTCS. 3,3
0E4 79A67A18 SELII 6,13,15,36
0E8 7C223040 CMPL 0,1,2,6
0EC 41820018 BC 12,2,0X18
0F0 7CE30011 SUBFC. 7,3,0
0F4 40800018 BC 4,0,0X18
0F8 7CE01810 SUBFC 7,0,3
0FC 60C20000 ORI 2,6,0
100 4800000C B 0XC
104 7CE03A14 ADD 7,0,7
108 7CE71814 ADDC 7,7,3
10C 7C03007A DSIXES 3
110 7C633850 SUBF 3,3,7
114 7C671378 OR 7,3,2
118 78E00600 RLDICL 0,7,0,24
11C 7F003888 TD 24,0,7
120 7C0600BB DTCS. 6,0
124 79A07A18 SELII 0,13,15,36
128 78C70601 RLDICL. 7,6,0,24
12C 79E2031A SELIR 2,15,0,38
130 38070000 ADDI 0,7,0
134 7847072C RLDIMI 7,2,0,60
138 98FFFFFE STB 7,0XFFFE(31)
13C 7806C202 RLDICL 6,0,56,8
140 90DFFFFA STW 6,0XFFFA(31)
144 419C8043 BCLA 12,28,0X8040

These instructions were added to PowerPC for AS/400:

TXER Trap on XER reg
DTCS Decimal test and clear sign
SELII Select immed, immed
SELIR Select immed, reg
DSIXES Decimal (or decrement? or doubleword-of?) sixes

The actual add is done by the instruction at offset 108.
Instruction B8 loads x666666666666666A, which is added to the operands
to check for decimal data errors at C8 and D4.
Instruction CC loads x6666666666666666, which is added into the sum at
104 to force carries out of nibbles at 10 instead of 16.
Instruction 10C (DSIXES) generates a doubleword having 6 in each
nibble where there was no carry, and 0 in each nibble that generated a
carry. It's then subtracted from the sum at 110 to back out the
sixes.
The rest of the instructions deal with loading and storing the
operands, and handling the various combinations of signs.

--Dave

On 6 Dec 2000 04:48:37 GMT, ma...@mash.engr.sgi.com (John R. Mashey)
wrote:

Jonathan Thornburg

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

> I agree with both of these reasons. However, the Power PC does not appear
> to have any hardware support for packed decimal. Does the AS/400 version of
> the chip add some support or is the whole thing emulated in software?

| SW emulation of BCD is easy, and with the wide registers we have now, it
| is also quite fast (15 or 16 digits in a 64-bit register).

John R. Mashey <ma...@mash.engr.sgi.com>:

>Actually, even before that....
>1) The HP PA folks included a few instructions to help packed decimal
>arithmetic.
>
>2) In 1985, MIPS had a potential customer who had a strong interest in COBOL,
>and had extensive statistics on the usage patterns of various decimal arithmetic
>operations, and looked at the MIPS instruction sequences to do them,
>and concluded that performance was perfectly adequate, somewhat to their
>surprise. As it happens, the {load, store}word{left,right} instructions,
>somewhat accidentally, turned out to be fairly useful.

A lot of machines have decimal instructions -- the Z-80 (and 8080 too?)
come to mind with "decimal adjust accumulator", a 1-instruction fixup
for BCD add (and also subtract?). I've forgotten if the x86 series
kept this. The VAX had a full set of +-*/ arithmetic instructions
for arbitrary-length BCD strings, there was even support for two
different conventions for how the sign of a negative decimal number
was encoded.

How did/do the IBM S/3[679]0 handle decimal arithmetic?

--
-- Jonathan Thornburg <jth...@thp.univie.ac.at>
http://www.thp.univie.ac.at/~jthorn/home.html
Universitaet Wien (Vienna, Austria) / Institut fuer Theoretische Physik
[[about astronomical objects of large size]]
"it is very difficult to see how such objects could show significant
variations on astronomically relevant timescales such as the duration
of a graduate student stipend or the tenure trial period of a facculty
member." -- Spangler and Cook, Astronomical Journal 85, 659 (1980)

Del Cecchi

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

In article <3A2D54E3...@hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> writes:
|> Stephen Fuld wrote:

|> > I agree with both of these reasons. However, the Power PC does not appear
|> > to have any hardware support for packed decimal. Does the AS/400 version of
|> > the chip add some support or is the whole thing emulated in software?
|>
|> SW emulation of BCD is easy, and with the wide registers we have now, it
|> is also quite fast (15 or 16 digits in a 64-bit register).
|>

|> However, if you're doing a lot of fixed-point BCD math, it is probably
|> much faster to do all the work in binary and only convert on
|> input/output. :-)
|>

I found this at
http://www.iseries.ibm.com/beyondtech/arch_nstar_perf.htm#Extensions

The PowerPC AS architecture is a superset of the 64-bit version of the PowerPC
architecture. Specific enhancements are included for AS/400 unique internal
architecture and for business processing. AS/400 "tag" bits are included to
ensure the integrity of MI pointers. Each 16-byte MI address pointer has at least
one tag bit in main storage that identifies the location as holding a valid pointer.
Load and store instructions are also included to support MI pointers. There is a
unique translation scheme suitable for AS/400 single-level storage (see Clark
and Corrigan (1989)). The business computing enhancements include
instructions for the following:

- Decimal assist operations for decimal data, which is common in RPG
and COBOL applications
- Double-word (64-bit) move-assist instructions for faster movement of
data
- Multiple double word load and store for faster call/return and task
switching
- Vectored call/return for more direct calls to system functions

A total of 22 instructions were added as extensions to the PowerPC architecture.
--

Del Cecchi
cecchi@rchland

Terje Mathisen

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

Jonathan Thornburg wrote:
> A lot of machines have decimal instructions -- the Z-80 (and 8080 too?)
> come to mind with "decimal adjust accumulator", a 1-instruction fixup
> for BCD add (and also subtract?). I've forgotten if the x86 series
> kept this.

It did:

It has Ascii Adjust for Addition (AAA) and similar helper opcodes to be
used either before or after a binary operation on BCD data.

However, using these opcodes is now _much_ slower than to work with a
bunch of digits in parallel in a 64-bit MMX register. :-)

Jonathan Thornburg

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

I wrote:
| A lot of machines have decimal instructions -- the Z-80 (and 8080 too?)
| come to mind with "decimal adjust accumulator", a 1-instruction fixup
| for BCD add (and also subtract?). I've forgotten if the x86 series
| kept this.

In article <3A2E6113...@hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> replied:

>It did:
>
>It has Ascii Adjust for Addition (AAA) and similar helper opcodes to be
>used either before or after a binary operation on BCD data.
>
>However, using these opcodes is now _much_ slower than to work with a
>bunch of digits in parallel in a 64-bit MMX register. :-)

You mean AAA doesn't work on multiple digits in parallel? Funny, I
thought I remembered the Z-80 instruction did both hex digits in the
(8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
per add/sub, not one per BCD digit. So I'd think that the speedup from
doing BCD in an MMX register would "just" be 64 vs 32 bits. No?

--
-- Jonathan Thornburg <jth...@thp.univie.ac.at>
http://www.thp.univie.ac.at/~jthorn/home.html
Universitaet Wien (Vienna, Austria) / Institut fuer Theoretische Physik

"IRAS galaxies are all chocolate chip flavored rather than vanilla
flavored as heretofore suposed. This no doubt accounts for their
diversity and appeal." -- Vader and Simon, Astronomical Journal 94, 865

John R. Mashey

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

In article <90l9rt$od7$1...@mach.thp.univie.ac.at>, jth...@mach.thp.univie.ac.at (Jonathan Thornburg) writes:

|> A lot of machines have decimal instructions -- the Z-80 (and 8080 too?)
|> come to mind with "decimal adjust accumulator", a 1-instruction fixup
|> for BCD add (and also subtract?). I've forgotten if the x86 series

|> kept this. The VAX had a full set of +-*/ arithmetic instructions
|> for arbitrary-length BCD strings, there was even support for two
|> different conventions for how the sign of a negative decimal number
|> was encoded.
|>
|> How did/do the IBM S/3[679]0 handle decimal arithmetic?

Almost all models have hardware (i.e., microcode) support for a fairly
complete set of decimal arithmetic (SS instructions - memory-to-memory),
plus truly-amazing instructions like Edit-and-mark.
Two notable exceptions were the 360/44 and 360/91, which omitted these
instructions as they were targeted to technical apps.

Let me try again:
in the 1950s-1970s, if a computer was being designed to target
commercial applications, COBOL was important, and the machine would most likely
have substantial hardware and microcode dedicated to decimal arithmetic,
including memory-to-memory operations, conversions, edits, scans, etc.

Instruction sets designed later, in general, stopped doing this,
although people did incorporate modest amounts of hardware where they'd
do decimal arithmetic using the base instruction set, but a few instructions
would help the code sequences.

Put another way, all-out support of decimal arithmetic in hardware
used to be at least a marketing requirement for certain classes of systems,
but this requirement has diminished "recently", i.e., in last 20 years.
It made a lot more sense in ISAs that expected microcode anyway.

Stephen Fuld

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

"John R. Mashey" <ma...@mash.engr.sgi.com> wrote in message
news:90kgf5$nln$1...@murrow.corp.sgi.com...

> In article <3A2D54E3...@hda.hydro.com>, Terje Mathisen
<terje.m...@hda.hydro.com> writes:
> |> Stephen Fuld wrote:
> |> > I agree with both of these reasons. However, the Power PC does not
appear
> |> > to have any hardware support for packed decimal. Does the AS/400
version of
> |> > the chip add some support or is the whole thing emulated in software?
> |>
> |> SW emulation of BCD is easy, and with the wide registers we have now,
it
> |> is also quite fast (15 or 16 digits in a 64-bit register).
>
> Actually, even before that....
> 1) The HP PA folks included a few instructions to help packed decimal
> arithmetic.
>
> 2) In 1985, MIPS had a potential customer who had a strong interest in
COBOL,
> and had extensive statistics on the usage patterns of various decimal
arithmetic
> operations, and looked at the MIPS instruction sequences to do them,
> and concluded that performance was perfectly adequate, somewhat to their
> surprise. As it happens, the {load, store}word{left,right} instructions,
> somewhat accidentally, turned out to be fairly useful.

Interesting. Was their strategy to convert the decimal operations to binary
and do binary arithmetic or to work on them directly in decimal form? Did
they use packed decimal or add ASCII numbers directly?

>
> In general, RISC people have tended in recent years to either:
> (a) Add a modest amount of hardware to help decimal arithmetic,
> but usually not a full decimal instruction set.
> (b) Or just use the instructions they've got, for COBOL & PL/I,
> having decided they were "good enough".
>
> [Note, I mention this because it is a common misconception that RISC
> instruction sets were designed only with C, and maybe FORTRAN in mind.
> This simply wasn't true: different design teams had different priorities,
> but many of them gave consideration to {COBOL, PL/I}, {LISP, Smalltalk},
> and ADA, and there are specific features still around that catered to
these
> various things, even if {C, C++}, {FORTAN} were high priorities.]

That is a good point to reiterate. Thanks.

--
- Stephen Fuld

Stephen Fuld

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

"Dave McKenzie" <dav...@galois.com> wrote in message
news:fopr2tg596q0s28ib...@4ax.com...

Wow! (Catching my breath) Thanks for the information. I don't have the
relevant S/390 manuals handy. I wonder how many cycles the equivalent add
packed instruction takes? It boggles the mind!

--
- Stephen Fuld

Kevin Strietzel

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

Jonathan Thornburg wrote:
...

> In article <3A2E6113...@hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> replied:

> >It has Ascii Adjust for Addition (AAA) and similar helper opcodes to be

> >used either before or after a binary operation on BCD data.

...

> You mean AAA doesn't work on multiple digits in parallel? Funny, I
> thought I remembered the Z-80 instruction did both hex digits in the
> (8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
> per add/sub, not one per BCD digit. So I'd think that the speedup from
> doing BCD in an MMX register would "just" be 64 vs 32 bits. No?

[I hope I got the attributions right!]

The 8080/8085/Z80 instruction AAD was for BCD, packed decimal
with two digits/byte. According to
http://www.ntic.qc.ca/~wolf/punder/asm/0000004e.htm, the
8086/8088/etc instructions are:

AAA - ASCII Adjust for Addition
AAD - ASCII Adjust for Division
AAM - ASCII Adjust for Multiplication
AAS - ASCII Adjust for Subtraction

DAA - Decimal Adjust for Addition
DAS - Decimal Adjust for Subtraction

The AAx instructions do unpacked decimal (not ASCII). The DAx
instructions do packed decimal (BCD).

--Kevin Strietzel
Not speaking for Stratus.

Terje Mathisen

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

Jonathan Thornburg wrote:
> Terje Mathisen <terje.m...@hda.hydro.com> replied:

> >It did:

> >
> >It has Ascii Adjust for Addition (AAA) and similar helper opcodes to be
> >used either before or after a binary operation on BCD data.
> >

> >However, using these opcodes is now _much_ slower than to work with a
> >bunch of digits in parallel in a 64-bit MMX register. :-)
>

> You mean AAA doesn't work on multiple digits in parallel? Funny, I
> thought I remembered the Z-80 instruction did both hex digits in the
> (8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
> per add/sub, not one per BCD digit. So I'd think that the speedup from
> doing BCD in an MMX register would "just" be 64 vs 32 bits. No?

Nope, 64 vs 8 bits. I.e. 8x speedup to swallow the extra instructions
needed.

All the AA* opcodes work on just one or two BCD digits.

John F Carr

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

In article <90kgf5$nln$1...@murrow.corp.sgi.com>,

John R. Mashey <ma...@mash.engr.sgi.com> wrote:

>[Note, I mention this because it is a common misconception that RISC
>instruction sets were designed only with C, and maybe FORTRAN in mind.
>This simply wasn't true: different design teams had different priorities,
>but many of them gave consideration to {COBOL, PL/I}, {LISP, Smalltalk},
>and ADA, and there are specific features still around that catered to these
>various things, even if {C, C++}, {FORTAN} were high priorities.]

What's an example of a feature in a mainstream processor to help
support LISP or Smalltalk?

I recently read _Garbage Collection_ by Jones and Lins. It discusses
some hardware techniques that might help garbage collected languages,
but the only part that seemed relevant to real, modern systems was
about cache organization and write buffering. Cache is so important to
everything that I expect the core market requirements to prevent any
tradeoffs to improve fringe languages.

(Some older IBM systems have ~32 byte granularity "lock bits" for
fine-grained access control; these might be usable to implement some of
the techniques discussed in the book but I've never heard of any such
actual use. I don't know if this feature is still present in Power PC.)

--
John Carr (j...@mit.edu)

Thomas Womack

unread,

Dec 6, 2000, 3:00:00 AM12/6/00

to

"Jonathan Thornburg" <jth...@mach.thp.univie.ac.at> wrote

> You mean AAA doesn't work on multiple digits in parallel? Funny, I
> thought I remembered the Z-80 instruction did both hex digits in the
> (8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
> per add/sub, not one per BCD digit. So I'd think that the speedup from
> doing BCD in an MMX register would "just" be 64 vs 32 bits. No?

No, AAA works only on the pair of digits stored in the AL register;
presumably it was introduced to help in emulating the Z80 [IIRC it was
supposedly possible to do a machine-translation of Z80 to 8086 code], but
it's never been updated since.

AAA and its friend AAS are 1uop instructions on the P3, but this really
doesn't make up for having to process things one or two digits at a time
rather than 15 or 16.

Tom

John R. Mashey

unread,

Dec 6, 2000, 10:57:31 PM12/6/00

to

In article <O5xX5.5113$T43.4...@bgtnsc04-news.ops.worldnet.att.net>, "Stephen Fuld" <s.f...@worldnet.att.net> writes:

|>
|> Interesting. Was their strategy to convert the decimal operations to binary
|> and do binary arithmetic or to work on them directly in decimal form? Did
|> they use packed decimal or add ASCII numbers directly?

I don't reca for sure; I think it was packed decimal, but the few
sequences I ever saw were reminiscent of that IBM example posted here,
and there may have even been some mixed strategy, as there was a large
set of special cases.

While all of this sounds awful, it is worth noting that in the early 1980s,
even a simple FPU was a hefty chunk of silicon, and many people provided
FP emulation libraries, which were also nontrivial to get right,
and a lot of code.

Note, of course, that if the deciaml operations just need to work, but aren't actually used very often, it is an easier implementation to have the
compiler just generate calls to intrinsic functions, passing the arguments
and lengths, and letting the intrinsic figure things out at run-time.

John R. Mashey

unread,

Dec 6, 2000, 11:38:17 PM12/6/00

to

In article <3a2ecd1a$0$29...@senator-bedfellow.mit.edu>, j...@mit.edu (John F Carr) writes:
|>
|> In article <90kgf5$nln$1...@murrow.corp.sgi.com>,
|> John R. Mashey <ma...@mash.engr.sgi.com> wrote:
|

|> >but many of them gave consideration to {COBOL, PL/I}, {LISP, Smalltalk},
|> >and ADA, and there are specific features still around that catered to these
|> >various things, even if {C, C++}, {FORTAN} were high priorities.]
|>

|> What's an example of a feature in a mainstream processor to help
|> support LISP or Smalltalk?

|> John Carr (j...@mit.edu)

1) In the early 1980s, Dave Paterson & co were looking at Smalltalk architectures, especially, including the work on SOAR, i.e., start with a vanilla RISC and add tweaks to help Smalltalk.

2) SPARC's Tagged Add and Tagged Subtract (TADD, TSUB) were included for
the help of these languages, and possibly some of the trap-comparisons.

3) In MIPS-II, the trap-comparisons were partially justified for these,
although more for ADA.

In practice, I'm not sure how much the special features got used;
maybe somebody from SPARC-land can comment. At one point, I was told by
a Smalltalk implementor that they cared more about portability, and tended to ewschew features only found on one or two CPU families, i.e., this was after the
on-the-fly code generation got pretty good. From some LISP implementors,
I heard the following viewpoint:
(a) Either give me a LISP Machine
OR (b) Give me something generic and fast
BUT don't expect a lot from just a few extra features to help.

Sometime in the late 1980s, Alan Kay came by MIPS, and we had a good
discussion about the sorts of features that fit into a normal RISC design,
fit the software directions, and give useful speedups. We couldn't
come up with anything really compelling, so we didn't implement anything.

Of the various ideas we threw around, the most interesting was the idea
that, rather than building in specific tag-check hardware, maybe there was some
way to have a programmable mask+compare operation applied to memory references,
i.e., so you could do pointer-checking with parallel hardware, rather than
in-line instruction sequences. Also, one would want some really lean user-level
to user-level trapping sequences.

Regarding the other RISCs, I'm not sure if there is anything as explicit as
SPARC TADD, but perhaps some features were partially justified by arguements for
LISP and SMALLTALK: here's a typical discussion:

(a) We have instruction set 1, and lots of statistics.
(b) We propose that the following list of instructions be added,
in descending order of expected performance improvement,
based on analysis of our standard benchmark suites.
4% A
3% B
1% C
.8% D
.2% E

(c) Most people would add A & B, not add E, and argue about
C & D. An argument that might come up would be "D would help
LISP, SMALLTALK, not in our standard benchmark group", and that
might cause D to be included.

I couldn't find it quickly, but earl Killian put together exactly this
sort of list for doing MIPS-I => MIPS-II, and I know other groups
work through the same sort of process.

Duane Rettig

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

ma...@mash.engr.sgi.com (John R. Mashey) writes:

> In article <3a2ecd1a$0$29...@senator-bedfellow.mit.edu>, j...@mit.edu (John F Carr) writes:
> |>
> |> What's an example of a feature in a mainstream processor to help
> |> support LISP or Smalltalk?
>
> |> John Carr (j...@mit.edu)
>
> 1) In the early 1980s, Dave Paterson & co were looking at Smalltalk
> architectures, especially, including the work on SOAR, i.e., start
> with a vanilla RISC and add tweaks to help Smalltalk.
>
> 2) SPARC's Tagged Add and Tagged Subtract (TADD, TSUB) were included for
> the help of these languages, and possibly some of the trap-comparisons.

Our own Common Lisp implementation uses these instructions on Sparcs.
I will include an example and an explanation below the MIPS example, below.

> 3) In MIPS-II, the trap-comparisons were partially justified for these,
> although more for ADA.

We use conditional traps for as many things as possible that are not
very likely to occur, like wrong-number-of-arguments situations,
timeout synchronizations, etc.

> In practice, I'm not sure how much the special features got used;
> maybe somebody from SPARC-land can comment. At one point, I was told by
> a Smalltalk implementor that they cared more about portability, and tended
> to ewschew features only found on one or two CPU families, i.e., this was
> after the on-the-fly code generation got pretty good.

This is not true for our lisp implementation; we care very much to use
every feature that is available from the hardware. See the examples
below. If we had more features that were useful, we would use them.

> From some LISP implementors,
> I heard the following viewpoint:
> (a) Either give me a LISP Machine
> OR (b) Give me something generic and fast
> BUT don't expect a lot from just a few extra features to help.

Perhaps this is true. But one feature I would _love_ to see is the
combination hardware/software for a truly fast user-level trap handler
(on the order of tens of cycles rather than hundreds). Such
capabilities would make it possible for us to implement real-time and
truly incremental garbage collectors using read-barriers to instigate
pointer forwarding.

> Sometime in the late 1980s, Alan Kay came by MIPS, and we had a good
> discussion about the sorts of features that fit into a normal RISC design,
> fit the software directions, and give useful speedups. We couldn't
> come up with anything really compelling, so we didn't implement anything.
>
> Of the various ideas we threw around, the most interesting was the idea
> that, rather than building in specific tag-check hardware, maybe there was some
> way to have a programmable mask+compare operation applied to memory references,
> i.e., so you could do pointer-checking with parallel hardware, rather than
> in-line instruction sequences. Also, one would want some really lean user-level
> to user-level trapping sequences.

Yes, a fast trap-handling capability would be very nice. But we already
_do_ use a trick that effects tag-checking on any architectures that
implement alignment traps (note that this does not include Power, which
insists on fixing up misaligned data rather than allowing user traps to
do so, and it also excludes IA-32, which does have an alignment-trap
enable bit on 486 and newer, but no operating system I know of allows
the setting of this bit).

Example 1:

To set up this example of tag-checking by alignment traps, I will stay with
32-bit lisps, and will use the term "LispVal" to mean a 32-bit word whose
least significant 3 bits are Tag bits, and the other 29 are either data
or pointers to 8-byte Allocation Units (AUs) whose addresses end in binary 000.
The smallest object in a lisp heap is 1 AU, usually a "cons cell". The
cons cell has a Tag of 001, and "nil" which looks like a cons cell, has
Tag 101 (5). Note that these are 4 bytes apart. To access the first element
of a cons cell (i.e. the "car"), the load instruction must be offset by -1
from the register holding the cons or nil, and to access the second element
(the "cdr"), the offset is +3. This takes the Tag into consideration, and
accesses the word on a properly aligned word boundary. However, it has the
added benefit that if the object in the register is not a cons cell or nil,
then an alignment trap occurs, and interrupt handlers can interpret the
result.

Note in the example below:

1,2. A function is defined to take the "car" of the argument and is
compiled at high speed on an SGI MIPS architecture.
3. The disassembler output shows four instructions: the one at 0
does the "car" access, the one at offset 4 reloads the caller's
environment, the one at 8 returns, and the one at 12 sets the lisp's
return-value-count register.
4. The function is called with a non-cons (i.e. the value 10). This
causes an alignment trap, but the signal handler is able to determine
the exact error from the context. A break level is entered.
5. The :zoom command at high verbosity shows that foo was
"suspended" (in reality, a trap occurred) at instruction offset 0, which
was the lw instruction.

cl-user(1): (defun foo (x)
(declare (optimize speed (safety 0) (debug 0)))
(car x))
foo
cl-user(2): (compile 'foo)
foo
nil
nil
cl-user(3): (disassemble 'foo)
;; disassembly of #<Function foo>
;; formals: x

;; code start: #x30fe2724:
0: 8c84ffff lw r4,-1(r4)
4: 8fb40008 lw r20,8(r29)
8: 03e00008 jr r31
12: 34030001 [ori] move r3,1
cl-user(4): (foo 10)
Error: Attempt to take the car of 10 which is not listp.
[condition type: simple-error]

Restart actions (select using :continue):
0: Return to Top Level (an "abort" restart)
1: Abort #<process Initial Lisp Listener>
[1] cl-user(5): :zo :all t :verbose t :count 2
Evaluation stack:

... 3 more newer frames ...

call to sys::..context-saving-runtime-operation with 0 arguments.
function suspended at address #x5ffa769c (handle_saving_context+716)

----------------------------
->call to foo
required arg: x = 10
function suspended at relative address 0

----------------------------
(ghost call to excl::%eval)

----------------------------
call to eval
required arg: exp = (foo 10)
function suspended at relative address 156

----------------------------

... more older frames ...
[1] cl-user(6):

Example 2:

We use the sparc taddcc instruction to do additions of numbers in
the range between -(2^29) and 2^29-1, and to trap automatically to
overflow into "bignums" (arbitrarily large number objects) or to
generate an error for addition of non-numbers.

To set this example up, we note that two of the three-bit tags are
used for small integers (called "fixnums") and these tags are 0
(for even fixnums) and 4 (for odd fixnums). The other 29 bits
of the LispVal are the rest of the actual value for the fixnum.
Thus, in effect, a fixnum is a sign bit followed by 29 significant
bits, followed by two bits of 0.

Note here that the function has more instructions than in the
first example, so I'll only comment on the significant instructions.

1, 2. The function is of course adding two arguments x and y, compiled
at high speed. Note that there are no declarations on x or y; if they
had been declared to be fixnums, the compiler might have been able
to generate better code.
3. The disassembler output shows instructions at:
4: The actual taddcc instruction; if either an overflow occurs or
the least two bits of the operands are 0, the overflow bit is
set, meaning that the operation was not a fixnum+fixnum->fixnum
operation.
8: The overflow status is tested.
16-32: The operation is successful, so the result of the add is
moved to the result register and the function returns.
36-56: An internal function called +_2op is called to handle the
overflow case (to generate a bignum) or the case where one or
both operands are not numbers at all (an error).

cl-user(1): (defun foo (x y)
(declare (optimize speed (safety 0)))
(+ x y))
foo
cl-user(2): (compile 'foo)
foo
nil
nil
cl-user(3): (disassemble 'foo)
;; disassembly of #<Function foo>
;; formals: x y

;; code start: #x4a8678c:
0: 9de3bf98 save %o6, #x-68, %o6
4: 99060019 taddcc %i0, %i1, %o4
8: 2e800007 bvs,a 36
12: 90100018 mov %i0, %o0
16: 9010000c mov %o4, %o0
20: 86102001 mov #x1, %g3
lb1:
24: 81c7e008 jmp %i7 + 8
28: 91ea0000 restore %o0, %g0, %o0
32: 90100018 mov %i0, %o0
lb2:
36: c4013f87 ld [%g4 + -121], %g2 ; excl::+_2op
40: 92100019 mov %i1, %o1
44: 9fc1200b jmpl %g4 + 11, %o7
48: 86182002 xor %g0, #x2, %g3
52: 10bffff9 ba 24
56: 01000000 nop
cl-user(4):

--
Duane Rettig Franz Inc. http://www.franz.com/ (www)
1995 University Ave Suite 275 Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)

Anton Ertl

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

In article <4zoi8r...@beta.franz.com>,

Duane Rettig <du...@franz.com> writes:
>Yes, a fast trap-handling capability would be very nice. But we already
>_do_ use a trick that effects tag-checking on any architectures that
>implement alignment traps (note that this does not include Power, which
>insists on fixing up misaligned data rather than allowing user traps to
>do so, and it also excludes IA-32, which does have an alignment-trap
>enable bit on 486 and newer, but no operating system I know of allows
>the setting of this bit).

When I tried it on Linux (in 1994, probably with Linux-1.0), it
worked; the code I used was

__asm__("pushfl; popl %eax; orl $0x40000, %eax; pushl %eax; popfl;");

However, the problem with this was that the C library caused alignment
faults, for two reasons:

1) For routines like memmove the programmer decided that misaligned
accesses were faster than other ways of implementing them (e.g.,
bytewise access). There were only a few such routines, and they would
have been easy to replace.

2) Many floating-point routines did misaligned accesses to floats,
because in their infinite wisdom Intel had required 4-byte alignment
for doubles in their ABI (to which the library conformed), whereas the
486 reported a misaligned access unless the double was 8-byte aligned.
That's why I did not use that feature.

Followups to comp.arch.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Del Cecchi

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

In article <zxPX5.53969$nh5.4...@newsread1.prod.itd.earthlink.net>,
hay...@alumni.uark.edu (Jim Haynes) writes:
|> As for why packed decimal, not addressing the question of why decimal, it
|> seems to me to be just an artifact of having chosen an 8-bit byte; or maybe
|> an argument in favor of choosing an 8-bit byte. So we could ask which
|> came first - the 8-bit byte or packed decimal?
|>
|> (I think the answer is 8-bit byte but I won't insist on it.)

I would disagree but have no proof

|>
|> Another example of solving the previous generation's problems is seen in the
|> System/360 disk usage, where you were allowed to choose any record size and
|> blocking you wished so as to maximize the number of bits you could get on a
|> track. Which means when you change to a newer model disk you have to
|> recalculate how you are going to pack things and then re-pack all your
|> data before you can get rid of the old disks. It took IBM a long time to
|> get around to "inventing" "fixed block architecture" which makes disk space
|> allocation so much easier and avoids repacking at the expense of wasting a
|> little space on the disk.
FBA was invented pretty early. It just took a while to get all the
count-key-data dependencies out of MVS.
--

Del Cecchi
cecchi@rchland

Allen J. Baum

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

In article <90kgf5$nln$1...@murrow.corp.sgi.com>, ma...@mash.engr.sgi.com
(John R. Mashey) wrote:

>In article <3A2D54E3...@hda.hydro.com>, Terje Mathisen <terje.m...@hda.hydro.com> writes:
>|> Stephen Fuld wrote:
>|> > I agree with both of these reasons. However, the Power PC does not appear
>|> > to have any hardware support for packed decimal. Does the AS/400 version of
>|> > the chip add some support or is the whole thing emulated in software?
>|>
>|> SW emulation of BCD is easy, and with the wide registers we have now, it
>|> is also quite fast (15 or 16 digits in a 64-bit register).
>
>Actually, even before that....
>1) The HP PA folks included a few instructions to help packed decimal
>arithmetic.

I seem to recall at a conference that they decided it wasn't worth it;
it was pretty easy to do it unpacked, and quicker as well.

--
**********************************************
* Allen J. Baum tel. (650)853-6626 *
* Compaq Computer Corp. fax (650)853-6513 *
* 181 Lytton Ave. *
* Palo Alto, CA 95306 ab...@pa.dec.com *
**********************************************

John R Levine

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

> It seems to me - big dose of personal opinion here - that IBM had a
> tendency to solve the previous generation's problems. Hence packed
> decimal would save storage space in small memories and disks; but by
> the time it was adopted storage space was not so serious an issue as
> it had been in the previous generation of machines. And I rather
> doubt that the saving in storage space was worth the time it took to
> pack and unpack data.

The most popular member of the 360 series was the 360/30, which came
with 8K bytes standard and could be expanded to a maximum of 64K. The
disks were usually 2311s, which stored 5MB per disk pack. Bytes still
mattered.

The other reason that packed decimal was useful was that these
computers were slow, and the memory was slow, too, 1.5 or 2us per byte
fetched or stored. If you're just doing a few arithmetic operations
per value, which is typical in commercial programs, it's faster to do
them in packed decimal than to convert to binary.

I happen to have a 360/30 Functional Characteristics manual (I found
it on IBM's web site a few weeks ago and I couldn't resist), so here's
the instruction times:

A decimal add or subtract instruction takes 45+4N us, where N is the
length in bytes of the longer operand. A memory-register binary add
or subtract takes 29us, but add in the load and store for the other
operand and it's 24+29+25 or 78us. Converting between binary and
decimal was very slow, convert to binary is 89+.75H+3H^2 (that's H
squared) where H is the number of hex significant digits in the value,
convert back to decimal is 46+18H+1.5H^2. Pack and unpack to convert
between packed decimal and character code are about 50us each. If the
arithmetic were on unpacked decimal, the 4N in add and subtract would
be 8N for twice as many memory references.

With performance like that, even if you didn't need to save the bytes
on the disk (which you did), there was a broad area where packed
arithmetic beat unpacked because of fewer memory references, and also
beat binary because converting was so slow.

Re CKD disks, the 360/30's channels were run by CPU microcode, and the
CPU pretty much stopped while the channel was active. ISAM channel
code could scan disk tracks looking for the desired key without
missing rotations looking for the desired record, something I don't
think would have been possible with FBA on computers that slow.

--
John R. Levine, IECC, POB 727, Trumansburg NY 14886 +1 607 387 6869
jo...@iecc.com, Village Trustee and Sewer Commissioner, http://iecc.com/johnl,
Member, Provisional board, Coalition Against Unsolicited Commercial E-mail

Terje Mathisen

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

Duane Rettig wrote:
> > 1) For routines like memmove the programmer decided that misaligned
> > accesses were faster than other ways of implementing them (e.g.,
> > bytewise access). There were only a few such routines, and they would
> > have been easy to replace.
>

> I suspect that this would have been considered a bug if the alignment
> traps had been enabled, because misaligned moves tend to actually be
> slower than aligned moves, and possibly even bytewise moves, depending
> on the size of the transfer, due to potential breakup of cache lines.

The important bit about allowing potentially misaligned moves is that it
simplifies the _code_ a lot, getting rid of a series of special-cased
jumps to handle the different alignment permutations.

The way to do it on x86 is to make sure that the destination is aligned,
at this point most sources will be OK as well, if not only those reads
that straddle cache line boundaries will actually suffer.

Duane Rettig

unread,

Dec 7, 2000, 3:00:00 AM12/7/00

to

Terje Mathisen <terje.m...@hda.hydro.com> writes:

> Duane Rettig wrote:
> > > 1) For routines like memmove the programmer decided that misaligned
> > > accesses were faster than other ways of implementing them (e.g.,
> > > bytewise access). There were only a few such routines, and they would
> > > have been easy to replace.
> >
> > I suspect that this would have been considered a bug if the alignment
> > traps had been enabled, because misaligned moves tend to actually be
> > slower than aligned moves, and possibly even bytewise moves, depending
> > on the size of the transfer, due to potential breakup of cache lines.
>
> The important bit about allowing potentially misaligned moves is that it
> simplifies the _code_ a lot, getting rid of a series of special-cased
> jumps to handle the different alignment permutations.

Many of the architectures which enforce alignment traps also support
word-sized access instructions that do not enforce alignment. This
takes care of these situations without forcing these routines to do
bytewise accesses, instead allowing word-sized operations to be
performed in parallel on series of bytes.

Dave Harris

unread,

Dec 7, 2000, 10:42:00 AM12/7/00

to

ma...@mash.engr.sgi.com (John R. Mashey) wrote (abridged):

> I heard the following viewpoint:
> (a) Either give me a LISP Machine
> OR (b) Give me something generic and fast
> BUT don't expect a lot from just a few extra features to help.

That was roughly the view of the Self project, at:
http://self.sunlabs.com/

Some of the papers there look at the cost of polymorphic dispatching.
With their compiler, they found the costs were low enough that they'd
rather have faster generic operations than zero-cost specific support.

Dave Harris, Nottingham, UK | "Weave a circle round him thrice,
bran...@cix.co.uk | And close your eyes with holy dread,
| For he on honey dew hath fed
http://www.bhresearch.co.uk/ | And drunk the milk of Paradise."

Jim Haynes

unread,

Dec 7, 2000, 12:11:59 PM12/7/00

to

As for why packed decimal, not addressing the question of why decimal, it
seems to me to be just an artifact of having chosen an 8-bit byte; or maybe
an argument in favor of choosing an 8-bit byte. So we could ask which
came first - the 8-bit byte or packed decimal?

(I think the answer is 8-bit byte but I won't insist on it.)

It seems to me - big dose of personal opinion here - that IBM had a tendency

to solve the previous generation's problems. Hence packed decimal would save
storage space in small memories and disks; but by the time it was adopted
storage space was not so serious an issue as it had been in the previous
generation of machines. And I rather doubt that the saving in storage space
was worth the time it took to pack and unpack data.

Another example of solving the previous generation's problems is seen in the

glen herrmannsfeldt

unread,

Dec 7, 2000, 12:42:12 PM12/7/00

to

"Thomas Womack" <t...@womack.net> writes:
>"Jonathan Thornburg" <jth...@mach.thp.univie.ac.at> wrote

>> You mean AAA doesn't work on multiple digits in parallel? Funny, I
>> thought I remembered the Z-80 instruction did both hex digits in the
>> (8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
>> per add/sub, not one per BCD digit. So I'd think that the speedup from
>> doing BCD in an MMX register would "just" be 64 vs 32 bits. No?

>No, AAA works only on the pair of digits stored in the AL register;

I believe AAA (ASCII adjust for addition) works on only one digit.
There should be DAA (Decimal Adjust for Addition) that works on two.

-- glen

Duane Rettig

unread,

Dec 7, 2000, 12:07:58 PM12/7/00

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> In article <4zoi8r...@beta.franz.com>,
> Duane Rettig <du...@franz.com> writes:

> >Yes, a fast trap-handling capability would be very nice. But we already
> >_do_ use a trick that effects tag-checking on any architectures that
> >implement alignment traps (note that this does not include Power, which
> >insists on fixing up misaligned data rather than allowing user traps to
> >do so, and it also excludes IA-32, which does have an alignment-trap
> >enable bit on 486 and newer, but no operating system I know of allows
> >the setting of this bit).
>

> When I tried it on Linux (in 1994, probably with Linux-1.0), it
> worked; the code I used was
>
> __asm__("pushfl; popl %eax; orl $0x40000, %eax; pushl %eax; popfl;");
>
> However, the problem with this was that the C library caused alignment
> faults, for two reasons:

Yes, of course; I view the two statements "operating system A does not
allow operation X" and "operation X cannot be successfully used under
operating system A" to be equivalent statements. I also tried similar
flag-setting on Linux and had the same result. I guess the way I worded
my statement, though, was misleading.

> 1) For routines like memmove the programmer decided that misaligned
> accesses were faster than other ways of implementing them (e.g.,
> bytewise access). There were only a few such routines, and they would
> have been easy to replace.

I suspect that this would have been considered a bug if the alignment
traps had been enabled, because misaligned moves tend to actually be
slower than aligned moves, and possibly even bytewise moves, depending
on the size of the transfer, due to potential breakup of cache lines.

> 2) Many floating-point routines did misaligned accesses to floats,

> because in their infinite wisdom Intel had required 4-byte alignment
> for doubles in their ABI (to which the library conformed), whereas the
> 486 reported a misaligned access unless the double was 8-byte aligned.
> That's why I did not use that feature.

It all comes down to having correct defaults. Intel got it wrong, and
Linux followed suit in the interest of compatibility. N.B. when I say
"got it wrong", it probably doesn't mean that it was a conscious decision
to do the wrong thing, rather when the ABI was created they probably
just did not have the foresight to think that larger objects might want
to be optimized by the hardware when accessed aligned on their natural
size.

One more story:

We did a port of our product, under contract, to NT on the Alpha.
The Alpha hardware also has a bit to enable alignment traps, and
Dec-Unix/Tru64 both got the default right, so we were hoping to use
the same code generation for the NT port as we used for Tru64.
However, the default setting of the bit was to "fixup" misaligned
accesses. There was, however, a system setting that could be changed
which disabled the fixups, and thus allowed our alignment traps to get
through to user code, and in fact they had thought far enough ahead to
allow a system call to re-enable these fixups on a per-program basis
for programs that needed fixups.

However, this scheme was completely backwards, because that forced
programs who don't care about alignment to call the per-process setting
of the fixups, which was highly unlikely. As a result, when I un-set
the fixup mode system-wide, most of the system worked, but WinHelp
would die due to an alignment trap fault.

Ironically, in looking at NT documentation, I noticed that NT on MIPS
(which had already died by that time) got the defaults correct: there
was a system-wide fixup _enabling_ setting (which, in fact was set to
enable fixups by defualt) and individual programs such as our product
could disable the fixups on a per-process basis (thus allowing us to
receive alignment traps as was desired).

As a result, the code generation for every normal lisp-object access on
Alpha/NT was three or four instructions instead of just one, because we
must check the Tag instead of letting the hardware check it for us.

Defaults are key.

Andi Kleen

unread,

Dec 7, 2000, 12:50:43 PM12/7/00

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>
> When I tried it on Linux (in 1994, probably with Linux-1.0), it
> worked; the code I used was
>
> __asm__("pushfl; popl %eax; orl $0x40000, %eax; pushl %eax; popfl;");

It depends on your BIOS if the Alignment flag is set or not -- Linux never
rewrites CR0 completely, just clears and sets bits as needed.
ftp://ftp.firstfloor.org/pub/ak/smallsrc/alignment.c shows how to set it
always.

-Andi

Leif Svalgaard

unread,

Dec 8, 2000, 3:00:00 AM12/8/00

to

Jim Haynes <hay...@alumni.uark.edu> wrote in message
news:zxPX5.53969$nh5.4...@newsread1.prod.itd.earthlink.net...

> As for why packed decimal, not addressing the question of why decimal, it
> seems to me to be just an artifact of having chosen an 8-bit byte; or
maybe
> an argument in favor of choosing an 8-bit byte. So we could ask which
> came first - the 8-bit byte or packed decimal?
>
> (I think the answer is 8-bit byte but I won't insist on it.)

General Electric 435 of early 1960's vintage had (has actually, as they
are still around - called Bull 9000 now) 9-bit bytes and packed decimal
arithmetic. The sign handily fitted in the 9th bit. No half bytes
wasted as in the IBM-360 (and AS/400) architecture.

Stephen Fuld

unread,

Dec 8, 2000, 3:00:00 AM12/8/00

to

"Del Cecchi" <cec...@signa.rchland.ibm.com> wrote in message
news:90olni$1hg0$1...@news.rchland.ibm.com...

> In article <zxPX5.53969$nh5.4...@newsread1.prod.itd.earthlink.net>,
> hay...@alumni.uark.edu (Jim Haynes) writes:

snip

>
> |>
> |> Another example of solving the previous generation's problems is seen
in the
> |> System/360 disk usage, where you were allowed to choose any record size
and
> |> blocking you wished so as to maximize the number of bits you could get
on a
> |> track. Which means when you change to a newer model disk you have to
> |> recalculate how you are going to pack things and then re-pack all your
> |> data before you can get rid of the old disks. It took IBM a long time
to
> |> get around to "inventing" "fixed block architecture" which makes disk
space
> |> allocation so much easier and avoids repacking at the expense of
wasting a
> |> little space on the disk.

> FBA was invented pretty early. It just took a while to get all the
> count-key-data dependencies out of MVS.
> --

Unless it happened within the last year or so, there are still LOTS of CKD
dependencies in MVS. Besides the variable length record requirement,
partitioned data sets still use search key commands to locate members. The
extended PDSs supposedly don't require this, but are not yet in wide use.
But such system oriented things like VTOCs still require short records, etc.
Compatibility from 25 years ago!

--
- Stephen Fuld

>
> Del Cecchi
> cecchi@rchland

David Hoyt

unread,

Dec 8, 2000, 10:23:11 AM12/8/00

to

Dave Harris wrote:

> ma...@mash.engr.sgi.com (John R. Mashey) wrote (abridged):
> > I heard the following viewpoint:
> > (a) Either give me a LISP Machine
> > OR (b) Give me something generic and fast
> > BUT don't expect a lot from just a few extra features to help.
>
> That was roughly the view of the Self project, at:
> http://self.sunlabs.com/
>
> Some of the papers there look at the cost of polymorphic dispatching.
> With their compiler, they found the costs were low enough that they'd
> rather have faster generic operations than zero-cost specific support.

To quote David Ungar, "Modern CPU's are like drag racers, they go in a
straight line very fast, but they can't turn worth crap." While compilers
are getting pretty good at removing many of the turns, modern languages
(OO, logical, ML like) and large systems like DBMS's still need to turn
often. The single thing that would help the most is to reduce latency.
Seeing new CPU's like the P4 with god-awful pipeline lengths are a bit
depressing. Purhaps the trace cache and large rename stuff will help, but
I'll hold back until someone proves it.

david

Jonathan Thornburg

unread,

Dec 8, 2000, 11:35:32 AM12/8/00

to

Terje Mathisen <terje.m...@hda.hydro.com> pointed out
[[about the x86 BCD-fixup opcodes]]

> However, using these opcodes is now _much_ slower than to work with a
> bunch of digits in parallel in a 64-bit MMX register. :-)

I asked:
% You mean AAA doesn't work on multiple digits in parallel? Funny, I
% thought I remembered the Z-80 instruction did both hex digits in the
% (8-bit) accumulator in parallel, i.e. you only needed one decimal-adjust
% per add/sub, not one per BCD digit. So I'd think that the speedup from
% doing BCD in an MMX register
[[on an x86 with MMX]]
% would "just" be 64 vs 32 bits. No?

In article <3A2EB011...@hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> replied:

>Nope, 64 vs 8 bits. I.e. 8x speedup to swallow the extra instructions
>needed.
>
>All the AA* opcodes work on just one or two BCD digits.

Yes, that would certainly explain the factor of 8 in performance.

But is there a good reason _why_ the nominally 32-bit x86 models
didn't extend the semantics of the AA* opcodes to work on the 8 BCD
digits one would naturally store in a 32-bit register? I would think
the same schemes used to fudge backwards compatability for old 8-bit
and 16-bit code could be used here...

--
-- Jonathan Thornburg <jth...@thp.univie.ac.at>
http://www.thp.univie.ac.at/~jthorn/home.html
Universitaet Wien (Vienna, Austria) / Institut fuer Theoretische Physik

"Washing one's hands of the conflict between the powerful and the
powerless means to side with the powerful, not to be neutral."
- quote by Freire / poster by Oxfam

Jim Haynes

unread,

Dec 8, 2000, 1:02:34 PM12/8/00

to

In article <90p0kr$d1b$1...@xuxa.iecc.com>, John R Levine <jo...@iecc.com> wrote:
>The most popular member of the 360 series was the 360/30, which came
>with 8K bytes standard and could be expanded to a maximum of 64K. The
>disks were usually 2311s, which stored 5MB per disk pack. Bytes still
>mattered.

Yeah, but they learned really quickly that 8K bytes was nowhere near enough
for a machine with that architecture and those ambitions for software
compatibility across the line. So the 8K byte target was another case of
using the previous model machine as a basis for the next.

>
>The other reason that packed decimal was useful was that these
>computers were slow, and the memory was slow, too, 1.5 or 2us per byte
>fetched or stored. If you're just doing a few arithmetic operations
>per value, which is typical in commercial programs, it's faster to do
>them in packed decimal than to convert to binary.

I didn't have in mind to convert to binary, but to do decimal arithmetic on
8-bit bytes, analogous to doing it in 6-bit bytes on the 1401.

Jim Haynes

unread,

Dec 8, 2000, 1:08:59 PM12/8/00

to

In article <SUXX5.850$wo2....@typhoon.austin.rr.com>,

Leif Svalgaard <le...@leif.org> wrote:
>
>General Electric 435 of early 1960's vintage had (has actually, as they
>are still around - called Bull 9000 now) 9-bit bytes and packed decimal
>arithmetic. The sign handily fitted in the 9th bit. No half bytes
>wasted as in the IBM-360 (and AS/400) architecture.
>

Well I haven't kept up with the ex-GE machines since I worked on the 635
family over 30 years ago. That family of machines also had 9-bit bytes
as an artifact of the 36-bit word. And it was pretty awful, because IBM
had decreed an 8-bit byte for magnetic tape, so you couldn't write 9-bit
bytes to tape without either dropping one bit for character data or
writing whole 36 bit words to keep the extra bits.

Bruce Hoult

unread,

Dec 8, 2000, 4:22:26 PM12/8/00

to

In article <3A30FCDF...@cognent.com>, David Hoyt
<david...@cognent.com> wrote:

> > Some of the papers there look at the cost of polymorphic dispatching.
> > With their compiler, they found the costs were low enough that they'd
> > rather have faster generic operations than zero-cost specific support.
>
> To quote David Ungar, "Modern CPU's are like drag racers, they go in a
> straight line very fast, but they can't turn worth crap." While compilers
> are getting pretty good at removing many of the turns, modern languages
> (OO, logical, ML like) and large systems like DBMS's still need to turn
> often. The single thing that would help the most is to reduce latency.
> Seeing new CPU's like the P4 with god-awful pipeline lengths are a bit
> depressing. Purhaps the trace cache and large rename stuff will help, but
> I'll hold back until someone proves it.

Well there's always the PowerPC G3 and G4 with teeny-tiny little
pipelines. They corner pretty well.

The masses don't seem to like the low MHz numbers they have even though
the 500 MHz G3 in a current iMac performs about the same as a 700 - 800
MHz Pentium III -- which is more than *most* people have.

-- Bruce

Tim McCaffrey

unread,

Dec 8, 2000, 6:12:23 PM12/8/00

to

In article <90r2kk$uaa$1...@mach.thp.univie.ac.at>, Jonathan Thornburg says...

>
>But is there a good reason _why_ the nominally 32-bit x86 models
>didn't extend the semantics of the AA* opcodes to work on the 8 BCD
>digits one would naturally store in a 32-bit register? I would think
>the same schemes used to fudge backwards compatability for old 8-bit
>and 16-bit code could be used here...
>

I heard in the past the a "big" decimal adder is expensive both in gates and
impact on the execution pipeline.

I would like to note here that since the 8087, the x86 line had two
instructions that helped tremendously in converting BCD to integer and back
again: FBLD (BCD load) and FBSTP( BCD store and pop). The execution times are
as follows:

CPU/FPU FBLD FBSTP
88/87 300 530
286/287 300 530
386/387 45-97 112-190
486DX 75 175
Pentium 48-58 148-154
Pentium II 76* 182*

* The PII numbers were done by a small program that repeatedly loaded/stored a
10 digit number. Therefore, I would consider the numbers approximations at
best, and are probably a bit high.

(Note: The original 8087 was 5 Mhz, so a FBSTP was ~106 microseconds to
execute, assuming a Pentium III has the same execution speed, it would take a
933Mhz PIII ~.2 microseconds to do the same task).

I wonder if any Cobol compilers ever used these instructions?

Tim McCaffrey

Paul DeMone

unread,

Dec 8, 2000, 8:39:09 PM12/8/00

to

David Hoyt wrote:

> Seeing new CPU's like the P4 with god-awful pipeline lengths are a bit
> depressing. Purhaps the trace cache and large rename stuff will help, but
> I'll hold back until someone proves it.

Have you seen how P4 does on gcc in their SPEC2k submission?

My understanding is that gcc is one of the hardest programs in
SPECint to speed up because of all the data and control dependencies.
High clock rate is a powerful tool to attack stubborn scalar code that
resists the brainiac approach.

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.

Del Cecchi

unread,

Dec 8, 2000, 11:56:43 PM12/8/00

to

Ah yes, the transition from 7 track to 9 track tape was probably
involved in there somewhere. My guess would be that the old 7094/1401
vintage tapes wrote 6 bit+parity characters to the tape. The 360
vintage tapes wrote 8b+P on 9 track tape. Rochester machines started
with system/3 which vaugely resembled a 360 and had 8 bit bytes.

del cecchi

David Hoyt

unread,

Dec 9, 2000, 11:42:01 AM12/9/00

to

Paul DeMone wrote:

> Have you seen how P4 does on gcc in their SPEC2k submission?
>
> My understanding is that gcc is one of the hardest programs in
> SPECint to speed up because of all the data and control dependencies.
> High clock rate is a powerful tool to attack stubborn scalar code that
> resists the brainiac approach.

I don't know for sure, but I doubt the gcc spec test is as twisty as a Smalltalk app, or
even a Java simulation code. It might be as twisty as a complex, optimized query/update
code in a DBMS; but its small enough to not have the same kinds of i-cache pressure that a
DBMS puts on a processor. While I wouldn't be astounded if its a good match for the codes
I run, but I suspect its not that close of a match, either.

But for all that, lets look at the GCC results for the 1.0 GHz P3, the 1.5 P4 and the
0.833GHz alpha.

GHz base result base/GHz result/GHz
p3 1.0 401 408 401 408
p4 1.5 588 592 392 395
264 0.833 617 687 699 778

Lets assume that gcc is representative as my codes. What do I learn? That it still isn't
as fast as the alpha; and not only that, even with the trace cache and double-pumping the
integer unit, its no faster on a clock for clock basis than the p3! Now the p3 has lots
of refinement time, and the p4 is just out of the shoot, so it might not be a fair
comparison. But I would have expected with that many more transistors, they would have be
able to speed up the performance per clock.

Now a few disclaimers. I'm doing most of my development (not production) on a P2/300, so
I bet I'd be happy with the performance with any of those processors. I didn't look at
the other processors, so they might be better than anyone other than the alpha. Also, for
multi-media and single precision (graphics) floating point codes, which are very like
simple vector codes with few twisty paths, the p4 is likely a big win. So if intel's
target for the p4 is low end gaming machines (that don't have an emotion engine in them),
they might have made the right choices.

I'm still doubtful as to if they helped my codes out very much.

david

Paul DeMone

unread,

Dec 9, 2000, 4:02:54 PM12/9/00

to

David Hoyt wrote:
>
> Paul DeMone wrote:
>
> > Have you seen how P4 does on gcc in their SPEC2k submission?
> >
> > My understanding is that gcc is one of the hardest programs in
> > SPECint to speed up because of all the data and control dependencies.
> > High clock rate is a powerful tool to attack stubborn scalar code that
> > resists the brainiac approach.
>
> I don't know for sure, but I doubt the gcc spec test is as twisty as a Smalltalk app, or
> even a Java simulation code. It might be as twisty as a complex, optimized query/update
> code in a DBMS; but its small enough to not have the same kinds of i-cache pressure that a
> DBMS puts on a processor. While I wouldn't be astounded if its a good match for the codes
> I run, but I suspect its not that close of a match, either.
>
> But for all that, lets look at the GCC results for the 1.0 GHz P3, the 1.5 P4 and the
> 0.833GHz alpha.
>
> GHz base result base/GHz result/GHz
> p3 1.0 401 408 401 408
> p4 1.5 588 592 392 395
> 264 0.833 617 687 699 778
>
> Lets assume that gcc is representative as my codes. What do I learn? That it still isn't
> as fast as the alpha; and not only that, even with the trace cache and double-pumping the
> integer unit, its no faster on a clock for clock basis than the p3! Now the p3 has lots
> of refinement time, and the p4 is just out of the shoot, so it might not be a fair
> comparison. But I would have expected with that many more transistors, they would have be
> able to speed up the performance per clock.

Careful, you have fallen into the trap of trying to compare
architectural efficiency on the basis of clock normalized
performance. Here is an example of what I mean. Here is the
gcc results for two processors A and B:

Processor GHz gcc.base gcc.base/GHz

A 1.0 377 377
B 0.8 336 420

Which processor has the most efficient microarchitecture? B right?
After all it achieves 11% higher clock normalized behaviour.

Wrong!

In my example A and B are identical other than clock rate. A and
B are coppermine PIII on Intel's VC820 board using IRC 5.0 compiled
SPEC2k binaries. Can the exact same computer be 11% more efficient
than itself? No, obviously this methdology is useless (aside from
enhancing the spirits of Mac enthusiasts ;-). The reason is the
computer from the MPU's bus interface outwards doesn't scale with
the processor clock rate so at higher frequencies the external
latencies grow larger in terms of clock cycles.

To compare the microarchitectural efficiency of P4 vs PIII you
either have to estimate how the PIII's performance scales up to
the P4's clock rate or slow down a P4 to P3 clock rates and rerun
SPEC. Please note that the two approaches aren't equivalent because
the two designs don't scale the same way. Running a P4 at 1.0 GHz
reduces its relative architectural efficiency vs PIII.

The P4 was designed to run at high clock rates so lets see its
strength with home field advantage. The PIII/VC820/IRC 5.0 combo
at 1.133 GHz gives 409 on gcc.base. So a 13.3% increase in clock
rate gives 8.4% more performance. Using the same scaling factor
applied linearly from 1.133 GHz up to 1.4 GHz, the PIII would be
estimated to get 14.9% higher gcc.base performance or 470. In
reality the PIII would fall a short of that due to diminishing
returns as the memory component of CPI increases as a percentage
of overall CPI but I don't want to go into a more accurate model
because linear scaling is sufficient to show my point.

So a hypothetical PIII running at 1.4 GHz would score under 470
on gcc.base while the P4 at 1.4 GHz yields 573 for more than a
22% IPC advantage. Not quite the same result as you get from
clock normalization is it? A 22% IPC advantage on gcc isn't
bad in light of how the P4 has been attacked for on the basis
its long pipeline would suffer badly on code with a lot of
control dependencies.

Paul DeMone

unread,

Dec 9, 2000, 4:10:54 PM12/9/00

to

Paul DeMone wrote:

> So a hypothetical PIII running at 1.4 GHz would score under 470
> on gcc.base while the P4 at 1.4 GHz yields 573 for more than a
> 22% IPC advantage.

Safer to say 22% performance advantage at the same clock rate. The
IPC advantage could be different from 22% if the gcc binaries run
on the PIII and P4 were different.

Terje Mathisen

unread,

Dec 10, 2000, 3:10:44 PM12/10/00

to

Jonathan Thornburg wrote:
>
> Terje Mathisen <terje.m...@hda.hydro.com> pointed out

> Terje Mathisen <terje.m...@hda.hydro.com> replied:
> >Nope, 64 vs 8 bits. I.e. 8x speedup to swallow the extra instructions
> >needed.
> >
> >All the AA* opcodes work on just one or two BCD digits.
>
> Yes, that would certainly explain the factor of 8 in performance.
>
> But is there a good reason _why_ the nominally 32-bit x86 models
> didn't extend the semantics of the AA* opcodes to work on the 8 BCD
> digits one would naturally store in a 32-bit register? I would think
> the same schemes used to fudge backwards compatability for old 8-bit
> and 16-bit code could be used here...

One possible/probable reason is that Intel have traces of lots of
'important' programs (for some definition of 'important'), and these
show that the DAx/AAx opcodes are almost totally unused.

This made it almost certain that they would not use more opcode space
for 16/32-bit version of these functions.

Bernd Paysan

unread,

Dec 10, 2000, 5:41:56 PM12/10/00

to

Paul DeMone wrote:
> So a hypothetical PIII running at 1.4 GHz would score under 470
> on gcc.base while the P4 at 1.4 GHz yields 573 for more than a
> 22% IPC advantage. Not quite the same result as you get from
> clock normalization is it? A 22% IPC advantage on gcc isn't
> bad in light of how the P4 has been attacked for on the basis
> its long pipeline would suffer badly on code with a lot of
> control dependencies.

One of the most noticable differences between the P3 and the P4 is the
improved bus speed. A good comparison can be found with Athlon, which is
available with P3-comparable bus speed (the limitation here the
PC133-SDRAM and the VIA boards, which have the same DRAM interface for
both CPUs), and also has been tested with P4-comparable bus speed
(DDR-SDRAM with double-pumped FSB at 133MHz; the P4 is theoretically a
bit better). The Athlon already gains significant performance at 1GHz,
and scales much better for higher frequences.

We all know that a underpowered memory interface can dry out the clock
frequency improvements of any CPU; keeping up with the memory interface
is important (enlarge the caches and/or make the bus faster/lower
latency). The GTL+ bus doubled speed by about a factor of two during the
entire life-time, while the core speed got up by a factor of five or
six.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

David Hoyt

unread,

Dec 13, 2000, 10:24:44 AM12/13/00

to

Paul DeMone wrote:

> David Hoyt wrote:
> > Lets assume that gcc is representative as my codes. What do I learn? That it still isn't
> > as fast as the alpha; and not only that, even with the trace cache and double-pumping the
> > integer unit, its no faster on a clock for clock basis than the p3!
>

> Careful, you have fallen into the trap of trying to compare
> architectural efficiency on the basis of clock normalized
> performance. Here is an example of what I mean. Here is the
> gcc results for two processors A and B:

> ...

I wasn't trying to look at anything related to IPC. I do realize one of the reasons that
intel developed the p4 is that it couldn't figure out a way to speed up the p3 line much more.
It's just most micro-arch. jumps lately have given better than clock for clock improvements;
I've gotten jaded.

Perhaps I cut too much from my post, and the context of my response got lost. My initial
comment was that the single most important thing to speed up languages like CLOS, Smalltalk,
Java, Self and the like aren't special instructions like tagged add or cons cell or cdr coding
hardware, but rather the ability to "turn fast." And while I wouldn't be astounded that the
p4's very long pipeline wouldn't prove to be a win, I didn't expect it to be of much help for my
codes.

Someone else told me to look at the p4's gcc/spec2000 scores. For the 1.5 GHz system submitted,
that system wasn't any faster on a clock for clock basis than a p3 (system), and still worse
than a 264.

It's possible that the p4 system isn't looking that good from a system perspective over a p3
system, because intel doesn't have the deep experience in it that they have in the p3. New
usually means not optimized. It might also be that the gcc is hitting system walls, but that
seems unlikely because the 264 system is that much faster.

My conclusion is still that intel didn't spend the time of very smart engineers and a very large
transistor budget to make my code run like a bat out of hell. At best (which isn't bad) I get
to float up with the rest of the flotsam. Long pipelines will be much more of a help in
multi-media, graphics and other vectorizable codes that it will be for twisty codes like OO
programs and complex simulations. But if I get back the extra cost of a p4 system over a p3,
I'd still be happy to buy one.

Back to what David Unger said "Modern microprocessors are like drag racers, they go in a
straight line very fast, but they can't turn very well." Long pipelines are like the long noses
on drag racers, great for drag races but they won't help in a LeMans Grand Prix. If intel cared
about making my codes fast, they would be doing something else.

I still believe that adding the ability to turn quickly is the single most important thing a cpu
designer can add to a microprocessor to help speed up OO codes; don't waste time on tagged adds
and cons cell support. Maybe spend time on fine-grained pages, but really please make them turn
fast.

david

Anton Ertl

unread,

Dec 14, 2000, 11:33:27 AM12/14/00

to

In article <4g0k0n...@beta.franz.com>,
Duane Rettig <du...@franz.com> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
[486 alignment checks]

>> However, the problem with this was that the C library caused alignment
>> faults, for two reasons:
>
>Yes, of course; I view the two statements "operating system A does not
>allow operation X" and "operation X cannot be successfully used under
>operating system A" to be equivalent statements. I also tried similar
>flag-setting on Linux and had the same result. I guess the way I worded
>my statement, though, was misleading.

Especially since it's not clear that a libc limitation would be a big
issue for a Lisp implementation. OTOH, I did my experiments in the
context of a Forth implementation, so I am not surprised.

>> 2) Many floating-point routines did misaligned accesses to floats,
>> because in their infinite wisdom Intel had required 4-byte alignment
>> for doubles in their ABI (to which the library conformed), whereas the
>> 486 reported a misaligned access unless the double was 8-byte aligned.
>> That's why I did not use that feature.
>
>It all comes down to having correct defaults. Intel got it wrong, and
>Linux followed suit in the interest of compatibility. N.B. when I say
>"got it wrong", it probably doesn't mean that it was a conscious decision
>to do the wrong thing, rather when the ABI was created they probably
>just did not have the foresight to think that larger objects might want
>to be optimized by the hardware when accessed aligned on their natural
>size.

Well, it would have been nice then if their hardware checking matched
their ABI (yes, that would have been more expensive to implement).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Dec 17, 2000, 7:08:53 AM12/17/00

to

In article <bruce-4B141C....@news.nzl.ihugultra.co.nz>,

Bruce Hoult <br...@hoult.org> writes:
>In article <3A30FCDF...@cognent.com>, David Hoyt
><david...@cognent.com> wrote:
>> To quote David Ungar, "Modern CPU's are like drag racers, they go in a
>> straight line very fast, but they can't turn worth crap." While compilers
>> are getting pretty good at removing many of the turns, modern languages
>> (OO, logical, ML like) and large systems like DBMS's still need to turn
>> often. The single thing that would help the most is to reduce latency.
>> Seeing new CPU's like the P4 with god-awful pipeline lengths are a bit
>> depressing.

>Well there's always the PowerPC G3 and G4 with teeny-tiny little

>pipelines. They corner pretty well.

If you compute the pipeline length of a G3, G4, Pentium-III, Athlon,
and Pentium-IV in ns for the top speed grades available, there is not
as much difference as the cycle numbers suggest:

int. pipe length
CPU clock cycles ns
G3/G4 2ns 4 8
K7 0.83ns ~10 ~8.3
P6 1ns ~10 ~10
Pentium-IV 0.67ns ~20 ~13.3

Also, note that branch prediction lets the K7, P6, and Pentium-IV take
most corners quite fast.

>The masses don't seem to like the low MHz numbers they have even though
>the 500 MHz G3 in a current iMac performs about the same as a 700 - 800
>MHz Pentium III -- which is more than *most* people have.

On some benchmarks yes, but on others the G3 cannot keep up even with
Intel and AMD chips at the same clock frequency (see
http://www.complang.tuwien.ac.at/franz/latex-bench for an example).

BTW, a 500MHz G3 is also more than most people have. It may make
sense to compare the boxes people can buy right now, e.g., compare
prices or market segments. My guess is that in both respects a 500MHz
G3 box is not below a 700-800MHz Pentium-III box.

Niels Jørgen Kruse

unread,

Dec 17, 2000, 12:41:11 PM12/17/00

to

I artiklen <91iacl$jqb$1...@news.tuwien.ac.at> ,
an...@mips.complang.tuwien.ac.at (Anton Ertl) skrev:

> Also, note that branch prediction lets the K7, P6, and Pentium-IV take
> most corners quite fast.

A branch that is correctly predicted is not a corner at all.

>>The masses don't seem to like the low MHz numbers they have even though
>>the 500 MHz G3 in a current iMac performs about the same as a 700 - 800
>>MHz Pentium III -- which is more than *most* people have.
>
> On some benchmarks yes, but on others the G3 cannot keep up even with
> Intel and AMD chips at the same clock frequency (see
> http://www.complang.tuwien.ac.at/franz/latex-bench for an example).

Are you sure that this benchmark measures CPU speed rather than disk speed
or dram latency?

I notice a 500 Mhz 21264 getting its butt kicked by a 450 Mhz AMD K6-3...

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Anton Ertl

unread,

Dec 19, 2000, 4:49:12 AM12/19/00

to

In article <i27%5.197$V76....@news.get2net.dk>,

"Niels J=?ISO-8859-1?B?+A==?=rgen Kruse" <nj_k...@get2net.dk> writes:
>I artiklen <91iacl$jqb$1...@news.tuwien.ac.at> ,
>an...@mips.complang.tuwien.ac.at (Anton Ertl) skrev:
>
>> Also, note that branch prediction lets the K7, P6, and Pentium-IV take
>> most corners quite fast.
>
>A branch that is correctly predicted is not a corner at all.

That's one viewpoint. It has the disadvantage of making the number of
corners (and the length of straights) dependent on the CPU.

>> On some benchmarks yes, but on others the G3 cannot keep up even with
>> Intel and AMD chips at the same clock frequency (see
>> http://www.complang.tuwien.ac.at/franz/latex-bench for an example).
>
>Are you sure that this benchmark measures CPU speed rather than disk speed
>or dram latency?

Yes, at least in Unix. First of all, we are measuring user (CPU)
time, so waiting for disk is not counted; Moreover, the files are
cached in RAM after the first run, so disk speed is not an issue, as
can be seen by comparing CPU (user+system) and real time:

[~/franz-bench:1008] time latex bench >/dev/null

real 0m2.862s
user 0m2.820s
sys 0m0.030s

BTW, that's the Athlon 800; RedHat-6.2's LaTeX seems to be a little
slower than 5.1's.

If DRAM access was the limit, the Pentium-III box should be king
(lmbench latency 110ns), and the Athlon box (lmbench latency 190ns)
should be about as fast as the XP1000 (lmbench latency ~200ns).

And here are some performance counter results on the Athlon:

tsc : 2299127304
event 00430040: 1382569338 #d-cache acesses
event 00430041: 4270491 #d-cache misses
event 00431f42: 4132287 #d-cache refills from L2
event 00431f43: 408509 #d-cache refills from system
event 00431f44: 4537940 #d-cache writebacks (to L2?)
event 00430080: 977466461 #i-cache accesses
event 00430081: 1000256 #i-cache misses
event 004300c0: 1937771206 #instructions (retired)

Even assuming a write and a read per refill from system (each at
190ns) explains only 155ms of the time.

On MacOS screen output appears to take a significant amount of
time, though.

>I notice a 500 Mhz 21264 getting its butt kicked by a 450 Mhz AMD K6-3...

Yes, the K6-3 surprised us nicely on this benchmark. As for the
21264, this used the default RedHat-5.2 latex binary; using BWX
instructions and/or using ccc to compile TeX might provide some
speedup.

Thomas Womack

unread,

Dec 19, 2000, 6:38:56 AM12/19/00

to

"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote

> And here are some performance counter results on the Athlon:
>
> tsc : 2299127304
> event 00430040: 1382569338 #d-cache acesses
> event 00430041: 4270491 #d-cache misses
> event 00431f42: 4132287 #d-cache refills from L2
> event 00431f43: 408509 #d-cache refills from system
> event 00431f44: 4537940 #d-cache writebacks (to L2?)
> event 00430080: 977466461 #i-cache accesses
> event 00430081: 1000256 #i-cache misses
> event 004300c0: 1937771206 #instructions (retired)

Are those results for the entire system whilst the process was being run, or
actually per-process performance counters?

If the latter, I had no idea (despite Andy Glew having sat here saying how
wonderful performance counters were for, umm, at least five years ...) that
you could get that degree of profiling; what tool do you use for it?

> On MacOS screen output appears to take a significant amount of
> time, though.

Were you running LaTeX on a VGA text screen or in a 1600x1200 large-fonted
xterm with some clever outline-font system providing the characters?

Tom

Christian Bau

unread,

Dec 19, 2000, 9:04:47 AM12/19/00

to

In article <91nl7v$pcs$1...@news8.svr.pol.co.uk>, "Thomas Womack"
<t...@womack.net> wrote:

On MacOS, most programs that use stdout at all redirect their output to a
window with a complete text editor in the background, with all the
overhead of being able to handle multiple fonts, different fontsizes, line
breaks etc. As a result, screen output is usually only twice as fast as
you can read instead of 20 times as fast. Obviously very bad for
benchmarks :-(

On the other hand, nobody cares how fast console output is until someone
uses it as a benchmark.

Jeffrey S. Dutky

unread,

Dec 19, 2000, 9:34:29 AM12/19/00

to

Anton Ertl originally wrote:
>
> the G3 cannot keep up even with Intel and AMD chips at the same
> clock frequency (see
> <http://www.complang.tuwien.ac.at/franz/latex-bench> for an
> example).

To which Niels Jørgen Kruse responded:

>
> Are you sure that this benchmark measures CPU speed rather than
> disk speed or dram latency?

And Anton replied:

>
> Yes, at least in Unix.

But then goes on to say:

>
> On MacOS screen output appears to take a significant amount of
> time, though.
>
> > I notice a 500 Mhz 21264 getting its butt kicked by a 450 Mhz
> > AMD K6-3...
>
> Yes, the K6-3 surprised us nicely on this benchmark. As for the
> 21264, this used the default RedHat-5.2 latex binary; using BWX
> instructions and/or using ccc to compile TeX might provide some
> speedup.

So, you stipulate that you are only certain of what is being
measured when the benchmark is run under Unix, but are perfectly
happy to make scurrilous comments about the G3 based on tests
run under entirely different conditions. Further, you imply that,
at least on some unix-like systems, the choice of compiler may
be a mitigating factor as well.

It seems that your choice of benchmark, and of benchmarking
methodology, leaves much to be desired in the realms of both
accuracy and reproducability. Why, for example, did you not run
the same OS/compiler suite on all three referenced systems (G3,
K6-3 and Alpha)? At least two unix-variants are readily available
for all three architectures (Linux and NetBSD, off the top of my
head) and would provide far more reliable and convincing numbers.

In you current arrangement, I suspect that what you are actually
measuring are OS/compiler differences, rather than any intrinsic
properties of the CPUs in question.

- Jeff Dutky

Jan Ingvoldstad

unread,

Dec 19, 2000, 1:53:53 PM12/19/00

to

On 17 Dec 2000 12:08:53 GMT, an...@mips.complang.tuwien.ac.at (Anton
Ertl) said:

> On some benchmarks yes, but on others the G3 cannot keep up even with
> Intel and AMD chips at the same clock frequency (see
> http://www.complang.tuwien.ac.at/franz/latex-bench for an example).

That is cute, but what is the benchmark, exactly? Do you have a URL
to the LaTeX document used, and the testing methodology? It would be
nice for others to have a peek at, too ...

(I'm sorry that I don't read German as well as I did a decade ago, and
that I don't trust Babelfish. There may be something there.)
--
In the beginning was the Bit, and the Bit was Zero. Then Someone
said, Let there be One, and there was One. And Someone blessed them,
and Someone said unto them, Be fruitful, and multiply, and replenish
the Word and subdue it: and have dominion over every thing that is.

Stefan Monnier <foo@acm.com>

unread,

Dec 19, 2000, 3:13:48 PM12/19/00

to

>>>>> "Jeffrey" == Jeffrey S Dutky <du...@bellatlantic.net> writes:
> In you current arrangement, I suspect that what you are actually
> measuring are OS/compiler differences, rather than any intrinsic
> properties of the CPUs in question.

That might be, although since it's user CPU time, maybe it is relevant.
Also I'm sure he just ran his benchmark on whatever he had available.
I sure don't expect someone to install an OS on a machine just to run a
benchmark for comp.arch's pleasure.

Whether it's a good CPU benchmark or not is actually rather irrelevant.
It is a system benchmark, just like SPEC, except that it doesn't claim
to be representative of all workloads except those that are LaTeX-like.

Stefan

Anton Ertl

unread,

Dec 19, 2000, 3:38:53 PM12/19/00

to

In article <91nl7v$pcs$1...@news8.svr.pol.co.uk>,
"Thomas Womack" <t...@womack.net> writes:
[performance counter results]

>Are those results for the entire system whilst the process was being run, or
>actually per-process performance counters?

Per-process.

>If the latter, I had no idea (despite Andy Glew having sat here saying how
>wonderful performance counters were for, umm, at least five years ...) that
>you could get that degree of profiling; what tool do you use for it?

http://www.complang.tuwien.ac.at/anton/linux-perfex/ based on Mikael
Pettersson's patches and library
(http://www.csd.uu.se/~mikpe/linux/perfctr/).

>> On MacOS screen output appears to take a significant amount of
>> time, though.
>
>Were you running LaTeX on a VGA text screen or in a 1600x1200 large-fonted
>xterm with some clever outline-font system providing the characters?

On MacOS? Whatever the default on the machine was; i.e. graphics mode.

Anton Ertl

unread,

Dec 19, 2000, 3:56:13 PM12/19/00

to

In article <72ACB55C...@bellatlantic.net>,

"Jeffrey S. Dutky" <du...@bellatlantic.net> writes:
>Anton Ertl originally wrote:
>>
>> the G3 cannot keep up even with Intel and AMD chips at the same
>> clock frequency (see
>> <http://www.complang.tuwien.ac.at/franz/latex-bench> for an
>> example).

...

>So, you stipulate that you are only certain of what is being
>measured when the benchmark is run under Unix, but are perfectly
>happy to make scurrilous comments about the G3 based on tests
>run under entirely different conditions.

Follow the URL above, and you will see that it also contains results
for Macs (including a G3) under Linux.

> Further, you imply that,
>at least on some unix-like systems, the choice of compiler may
>be a mitigating factor as well.

That's usually the case.

>It seems that your choice of benchmark, and of benchmarking
>methodology, leaves much to be desired in the realms of both
>accuracy and reproducability. Why, for example, did you not run
>the same OS/compiler suite on all three referenced systems (G3,
>K6-3 and Alpha)?

We did. There are Linux results for two PPCs, several IA32 boxes, and
several Alphas, and it's extremely probably that on the Linux
distributions TeX was compiled with gcc (we just used whatever binary
package was preinstalled on Linux).

>In you current arrangement, I suspect that what you are actually
>measuring are OS/compiler differences, rather than any intrinsic
>properties of the CPUs in question.

Even with the same compiler, do you truly measure intrinsic properties
of the CPU in question? A compiler may be better tuned for one
architecture than another, or conversely, a specific CPU can be more
sensitive to the presence or absence of certain optimizations that
some compiler may or may not support. You always measure the
combination of CPU/compiler/OS (at least if you want to be
architecture-independent); it's the same with the SPECcpu suite.

OTOH, if most users experience the G3 under MacOS, and the Athlon
under Windows 95+, why should it be inappropriate to benchmark these
CPUs under the respective OSs? Although LaTeX may not be a benchmark
that most users would be interested in.

Niels Jørgen Kruse

unread,

Dec 19, 2000, 8:18:24 PM12/19/00

to

I artiklen <91nauo$5vj$1...@news.tuwien.ac.at> ,
an...@mips.complang.tuwien.ac.at (Anton Ertl) skrev:

> In article <i27%5.197$V76....@news.get2net.dk>,
> "Niels J=?ISO-8859-1?B?+A==?=rgen Kruse" <nj_k...@get2net.dk> writes:
>>I artiklen <91iacl$jqb$1...@news.tuwien.ac.at> ,
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) skrev:
>>
>>> Also, note that branch prediction lets the K7, P6, and Pentium-IV take
>>> most corners quite fast.
>>
>>A branch that is correctly predicted is not a corner at all.
>
> That's one viewpoint. It has the disadvantage of making the number of
> corners (and the length of straights) dependent on the CPU.

Of course we can't ignore differences in prediction schemes. Let me refrase
my statement to "a branch that is easy to predict is not a corner at all".
"Easy to predict" is CPU independent, although not very precise.

>>> On some benchmarks yes, but on others the G3 cannot keep up even with
>>> Intel and AMD chips at the same clock frequency (see
>>> http://www.complang.tuwien.ac.at/franz/latex-bench for an example).
>>
>>Are you sure that this benchmark measures CPU speed rather than disk speed
>>or dram latency?
>
> Yes, at least in Unix. First of all, we are measuring user (CPU)
> time, so waiting for disk is not counted; Moreover, the files are
> cached in RAM after the first run, so disk speed is not an issue, as
> can be seen by comparing CPU (user+system) and real time:
>
> [~/franz-bench:1008] time latex bench >/dev/null
>
> real 0m2.862s
> user 0m2.820s
> sys 0m0.030s
>
> BTW, that's the Athlon 800; RedHat-6.2's LaTeX seems to be a little
> slower than 5.1's.
>
> If DRAM access was the limit, the Pentium-III box should be king
> (lmbench latency 110ns), and the Athlon box (lmbench latency 190ns)
> should be about as fast as the XP1000 (lmbench latency ~200ns).

The long cachelines of the Athlon should help sequential write speed.

>
> And here are some performance counter results on the Athlon:

You left out TLB misses?

> tsc : 2299127304
> event 00430040: 1382569338 #d-cache acesses
> event 00430041: 4270491 #d-cache misses

Pretty good L1 missrate, it lives mostly in L1.

> event 00431f42: 4132287 #d-cache refills from L2
> event 00431f43: 408509 #d-cache refills from system

Shouldn't refills equal misses? Effect of write-combining?

> event 00431f44: 4537940 #d-cache writebacks (to L2?)

The Thunderbird Athlon has exclusive L2, so it wouldn't make sense to count
refills from L2 twice. I think this is writebacks to system.

> event 00430080: 977466461 #i-cache accesses
> event 00430081: 1000256 #i-cache misses
> event 004300c0: 1937771206 #instructions (retired)

Dividing by 2.82 seconds, we get 687 Minstructions / second or 0.859
instructions / clock. If dram latency was unimportant, we would expect
better. The other explanation, rampant misprediction, would favor the G3.

The G3 stalls on the second write to a cacheline, that is waiting to fill
for the first. This is one of the major differences between G3 and G4, the
G4 supports write-combining and multiple unresolved misses.

> Even assuming a write and a read per refill from system (each at
> 190ns) explains only 155ms of the time.

I would not assume write and read to be the same speed.

> On MacOS screen output appears to take a significant amount of
> time, though.

Antialiasing on? How much output? Sending it to a file (on a ram-disk if
possible) should be faster.

Jeffrey S. Dutky

unread,

Dec 20, 2000, 12:31:24 AM12/20/00

to

Anton Ertl wrote:
>
> In article <72ACB55C...@bellatlantic.net>,
> "Jeffrey S. Dutky" <du...@bellatlantic.net> writes:
> >Anton Ertl originally wrote:
> > >
> > > the G3 cannot keep up even with Intel and AMD chips at
> > > the same clock frequency (see
> > > <http://www.complang.tuwien.ac.at/franz/latex-bench>
> > > for an example).
> ...
> > So, you stipulate that you are only certain of what is
> > being measured when the benchmark is run under Unix, but
> > are perfectly happy to make scurrilous comments about the
> > G3 based on tests run under entirely different conditions.
>
> Follow the URL above, and you will see that it also contains
> results for Macs (including a G3) under Linux.

Ok, I followed the URL and examined the results. From what I
can tell (unfortunately, I don't read German and babelfish is
only partially understandable) you have a very wide range of
test conditions: at least eight unix/linux versions (RH?-alpha,
RH5.1-intel, RH5.2-intel, RH6.0-intel, RH6.1-intel, RH6.2-intel,
S.u.s.e-intel, Debian-1.3.1-intel, Linux?-ppc, Digital Unix,
DEC-Unix and SunOS) as well as at least two versions of MacOS
and possibly other OSs (no mention is made of any Windows
installations, but some of the intel/AMD boxes are unaccounted
for).

All of the systems running G3s seem to be either upgrades,
which have sub-standard memory busses, or laptops. In either
case, there is much we don't know about the configuration and
performance characteristics of these systems that could result
in greatly diminished benchmark scores. (the upgraded PowerMac
7500 may or may not be using any L2 cache and the Powerbooks
are as likely as not to be in power cycling mode).

Overall, there seem to be more than enough unknowns in the test
setups to lend a great deal of skepticism to any claims made
from this data. We don't know what compilers, or what versions
of what compilers, were used to build any of the software on any
of the platforms. We don't know what optimization options were
used in the builds (this will make a big difference between
different models of the various PPC chips, for example).

> ...

> > It seems that your choice of benchmark, and of benchmarking
> > methodology, leaves much to be desired in the realms of both
> > accuracy and reproducability. Why, for example, did you not
> > run the same OS/compiler suite on all three referenced systems
> > (G3, K6-3 and Alpha)?
>
> We did. There are Linux results for two PPCs, several IA32
> boxes, and several Alphas, and it's extremely probably that
> on the Linux distributions TeX was compiled with gcc (we just
> used whatever binary package was preinstalled on Linux).
>
> > In you current arrangement, I suspect that what you are
> > actually measuring are OS/compiler differences, rather
> > than any intrinsic properties of the CPUs in question.
>
> Even with the same compiler, do you truly measure intrinsic

> properties of the CPU in question? ... You always measure the

> combination of CPU/compiler/OS (at least if you want to be
> architecture-independent); it's the same with the SPECcpu
> suite.
>
> OTOH, if most users experience the G3 under MacOS, and the
> Athlon under Windows 95+, why should it be inappropriate to
> benchmark these CPUs under the respective OSs? Although LaTeX
> may not be a benchmark that most users would be interested in.

But you did not make a claim about the performance of PowerMacs
running MacOS, but about the performace of the PowerPC G3 verses
Intel/AMD CPUs. Your data don't support that claim, in that the
data are hopelessly imprecise for any kind of measurement of CPU
performance. I'll grant you that MacOS may not be a very high
performance platform for some applications, but you will have to
provide some more reliable data before I'll believe that a 100MHz
Pentium is in the same performance class as a 300MH alpha 21064a.

- Jeff Dutky

Maynard Handley

unread,

Dec 20, 2000, 3:49:04 PM12/20/00

to

In article <oasofy8...@fjorir.ifi.uio.no>, Jan Ingvoldstad
<ja...@ifi.uio.no> wrote:

> On 17 Dec 2000 12:08:53 GMT, an...@mips.complang.tuwien.ac.at (Anton
> Ertl) said:
>
> > On some benchmarks yes, but on others the G3 cannot keep up even with
> > Intel and AMD chips at the same clock frequency (see
> > http://www.complang.tuwien.ac.at/franz/latex-bench for an example).
>
> That is cute, but what is the benchmark, exactly? Do you have a URL
> to the LaTeX document used, and the testing methodology? It would be
> nice for others to have a peek at, too ...

I don't think that is especially the issue. I think a better way to
describe the issue is that the PPC can do extremely well on
well-written/well-generated code. Unfortunately (as Intel are well aware)
most programmers are clueless about how CPUs work and so most code
(whether hand-written assembly, or spaghetti pointer-aliased poorly
optimized C) is not very high quality.
I have no idea where OzTex fits in.

So if you want to brag about how great PPC CAN be, there is some truth
there. On the other hand, if you want a CPU that will perform well on
real-world (poorly written/generated) code, x86 is in many ways better
designed for that. Of course you pay a price for that in heat and die
size.

Maynard

Anton Ertl

unread,

Dec 20, 2000, 4:07:10 PM12/20/00

to

In article <oasofy8...@fjorir.ifi.uio.no>,

Jan Ingvoldstad <ja...@ifi.uio.no> writes:
>On 17 Dec 2000 12:08:53 GMT, an...@mips.complang.tuwien.ac.at (Anton
>Ertl) said:
>> http://www.complang.tuwien.ac.at/franz/latex-bench for an example).
>
>That is cute, but what is the benchmark, exactly? Do you have a URL
>to the LaTeX document used, and the testing methodology? It would be
>nice for others to have a peek at, too ...
>
>(I'm sorry that I don't read German as well as I did a decade ago, and
>that I don't trust Babelfish. There may be something there.)

There wasn't. But I got the permission to put the stuff up, and you
can find it on http://www.complang.tuwien.ac.at/anton/latex-bench/,
along with instructions.

Anton Ertl

unread,

Dec 20, 2000, 4:36:35 PM12/20/00

to

In article <IOT%5.380$7S.2...@news.get2net.dk>,

"Niels J=?ISO-8859-1?B?+A==?=rgen Kruse" <nj_k...@get2net.dk> writes:
>I artiklen <91nauo$5vj$1...@news.tuwien.ac.at> ,
>an...@mips.complang.tuwien.ac.at (Anton Ertl) skrev:

>> If DRAM access was the limit, the Pentium-III box should be king
>> (lmbench latency 110ns), and the Athlon box (lmbench latency 190ns)
>> should be about as fast as the XP1000 (lmbench latency ~200ns).
>The long cachelines of the Athlon should help sequential write speed.

Bandwidth yes. They may also help the miss rate.

>> And here are some performance counter results on the Athlon:
>You left out TLB misses?

As you wish:

event 00430045: 31451866 ;L1 DTLB miss, L2 hit
event 00430046: 133382 ;L1 and L2 DTLB miss
event 00430084: 2006436 ;L1 ITLB miss and L2 hit
event 00430085: 7960 ;L1 and L2 ITLB miss

>> event 00430040: 1382569338 #d-cache acesses
>> event 00430041: 4270491 #d-cache misses
>Pretty good L1 missrate, it lives mostly in L1.
>
>> event 00431f42: 4132287 #d-cache refills from L2
>> event 00431f43: 408509 #d-cache refills from system
>Shouldn't refills equal misses? Effect of write-combining?

No idea.

>> event 00431f44: 4537940 #d-cache writebacks (to L2?)
>The Thunderbird Athlon has exclusive L2, so it wouldn't make sense to count
>refills from L2 twice. I think this is writebacks to system.

The performance counters probably also worked on the pre-Thunderbird
Athlons, and there it made sense. On the Thunderbird you do indeed
see that the number of writebacks is about equal to the number of
refills.

I don't think the Thunderbird would write back direct from the
D-cache, and if these are write-backs from L2 to system, why does it
say "D-cache writebacks".

>> event 00430080: 977466461 #i-cache accesses
>> event 00430081: 1000256 #i-cache misses
>> event 004300c0: 1937771206 #instructions (retired)
>Dividing by 2.82 seconds, we get 687 Minstructions / second or 0.859
>instructions / clock. If dram latency was unimportant, we would expect
>better. The other explanation, rampant misprediction, would favor the G3.

Maybe you feel better if you see the number of operations:-):

event 004300c1: 2495259879 ;operations (retired)
event 004300c2: 382672272 ;branches
event 004300c3: 39784858 ;mispredicted branches

I guess there are lots of little reasons that pull the performance
down from the theoretical 2400 MIPS. D-cache port contention and data
dependences are probably the most important ones, mispredictions, TLB
and cache misses contribute. Looking at some of the papers I have
seen, this IPC does not look too wrong.

Anton Ertl

unread,

Dec 23, 2000, 6:55:44 AM12/23/00

to

In article <72AD8788...@bellatlantic.net>,

"Jeffrey S. Dutky" <du...@bellatlantic.net> writes:
>All of the systems running G3s seem to be either upgrades,
>which have sub-standard memory busses, or laptops. In either
>case, there is much we don't know about the configuration and
>performance characteristics of these systems that could result
>in greatly diminished benchmark scores. (the upgraded PowerMac
>7500 may or may not be using any L2 cache

Under MacOS it certainly does, and since the Linux results are better,
either Linux uses the L2 cache or the cache is not important (I guess
it's the former).

As for substandardness of that system, with 1MB L2 cache the memory
bus speed should be unimportant, and the L2 cache speed (same as CPU
speed) is actually better than that of comparable PowerMac-G3s (IIRC
1/2 CPU speed).

>Overall, there seem to be more than enough unknowns in the test
>setups to lend a great deal of skepticism to any claims made
>from this data. We don't know what compilers, or what versions
>of what compilers, were used to build any of the software on any
>of the platforms. We don't know what optimization options were
>used in the builds (this will make a big difference between
>different models of the various PPC chips, for example).

You are invited to run the benchmark on a G3/G4 under conditions you
control and optimize and submit a result. I am surprised that you
have not already done so (if you missed my posting about the
availability of the reference input, here's the URL again:
http://www.complang.tuwien.ac.at/anton/latex-bench/). Until more
results are available, I'll continue to assume that the results we
have are representative of PPC performance.

>> OTOH, if most users experience the G3 under MacOS, and the
>> Athlon under Windows 95+, why should it be inappropriate to
>> benchmark these CPUs under the respective OSs? Although LaTeX
>> may not be a benchmark that most users would be interested in.
>
>But you did not make a claim about the performance of PowerMacs
>running MacOS, but about the performace of the PowerPC G3 verses
>Intel/AMD CPUs.

The remark above was meant in general. We also have data available
for Linux on PPCs, so there is no need to restrict the conclusions in
the way mentioned above.

> Your data don't support that claim, in that the
>data are hopelessly imprecise for any kind of measurement of CPU
>performance.

Your hyperbolic "hopelessly imprecise" suggests that you won't be
satisfied with the data until your favourite processor comes out
on top.

E.g., look at the Pentium-133 numbers, and you will see that the
different Linux distributions make little difference; and the
Pentium-100 numbers indicate that even a change to OS/2 and a
different TeX distribution makes little difference. I also received
results for MikTeX on WNT and W2K that are in the expected area:

Pentium-166 21.5s
Athlon 900 2.3s

>I'll grant you that MacOS may not be a very high
>performance platform for some applications, but you will have to
>provide some more reliable data before I'll believe that a 100MHz
>Pentium is in the same performance class as a 300MH alpha 21064a.

If you are interested in the CPU itself, you should look at the
fastest result for each CPU, and there the 21064a-300 performs like a
Pentium-133 on this benchmark.

Anyway, the same speed would be easy to explain: "The bottleneck is
elsewhere"; but the other data contradicts that explanation.

In particular, how do do we explain that the K6-2 300 is 3 times
faster then the 21064a-300? My explanation is:

- L1 is write-through on 21064a, write-back on K6-2.

- L2 is slow (~45MHz bus clock or so) and asynchronous on 21064a, making
the write-throughs even more expensive; by contrast, the K6-2 has a
100MHz pipelined burst cache.

- The larger L1 cache size (32+32KB on K6-2, 16+16KB on 21064a) proably
gives the K6-2 an additional benefit (the 21064a would probably hardly
benefit, since the write-through traffic probably dominates already
and won't benefit from a larger cache).

- The K6-2 can perform byte access instructions, it has fewer dispatch
restrictions, more resources, and out-of-order execution.

Back to the Pentium-133 vs. 21064a comparison: Looking at the
SPECint95 tables, I find:

peak base
Digital Equipmen AlphaStation 255/300 5.23 4.31
Dell Computer Co Dell Dimension XPS (133MHz) 3.96 3.96

So, the base numbers are not that far apart (and base is probably more
representative than peak of the way TeX was compiled on the boxes we
measured). Looking at specific benchmarks, the base results of gcc,
li, perl, and vortex are better for the Dell box. Even the 100MHz
Dell box is faster than the AlphaStation on li and vortex.

Is SPECint "hopelessly imprecise" and unbelievable, too?

Peter Seebach

unread,

Dec 23, 2000, 8:57:54 PM12/23/00

to

In article <9223s0$82s$1...@news.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>In article <72AD8788...@bellatlantic.net>,
> "Jeffrey S. Dutky" <du...@bellatlantic.net> writes:
>>All of the systems running G3s seem to be either upgrades,
>>which have sub-standard memory busses, or laptops. In either
>>case, there is much we don't know about the configuration and
>>performance characteristics of these systems that could result
>>in greatly diminished benchmark scores. (the upgraded PowerMac
>>7500 may or may not be using any L2 cache

>As for substandardness of that system, with 1MB L2 cache the memory

>bus speed should be unimportant, and the L2 cache speed (same as CPU
>speed) is actually better than that of comparable PowerMac-G3s (IIRC
>1/2 CPU speed).

Except that the L2 cache on a 7500 is *NOT* full speed if the machine
has been upgraded.

Anton Ertl

unread,

Dec 24, 2000, 5:40:05 AM12/24/00

to

In article <3a455821$0$89536$3c09...@news.plethora.net>,

se...@plethora.net (Peter Seebach) writes:
>Except that the L2 cache on a 7500 is *NOT* full speed if the machine
>has been upgraded.

The G3 uses a back-side L2 cache. On this upgrade (and probably all
other G3 upgrades) the cache is on the CPU card and is completely
independent of the front-side bus speed of the system.

Peter Seebach

unread,

Dec 25, 2000, 10:02:39 PM12/25/00

to

In article <924jq5$aje$1...@news.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>In article <3a455821$0$89536$3c09...@news.plethora.net>,
> se...@plethora.net (Peter Seebach) writes:
>>Except that the L2 cache on a 7500 is *NOT* full speed if the machine
>>has been upgraded.

>The G3 uses a back-side L2 cache. On this upgrade (and probably all
>other G3 upgrades) the cache is on the CPU card and is completely
>independent of the front-side bus speed of the system.

In that case it'll probably be half speed, but several people have pointed
out (in Mac groups) that the 7500's ability to take a G3 depends on the exact
cache unit in use.

Anton Ertl

unread,

Dec 28, 2000, 11:19:56 AM12/28/00

to

In article <3a480a4f$0$89537$3c09...@news.plethora.net>,

se...@plethora.net (Peter Seebach) writes:
>In article <924jq5$aje$1...@news.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>The G3 uses a back-side L2 cache. On this upgrade (and probably all
>>other G3 upgrades) the cache is on the CPU card and is completely
>>independent of the front-side bus speed of the system.
>
>In that case it'll probably be half speed,

It's full-speed (as benchmarked; it supports various speeds of
processor and cache).

> but several people have pointed
>out (in Mac groups) that the 7500's ability to take a G3 depends on the exact
>cache unit in use.

The G3 has an on-chip cache controller, so this statement looked
nonsensical to me at first.

The only plausible interpretation I have for this statement is that
these people were trying to use a G3 with a system-level cache board
(instead of, or in addition to the G3's back-side cache). The
measured system had no such cache board, and given the 1MB back-side
cache on the G3 card such a system-level cache would not be useful for
most applications, including latex-bench.