The current Alpha processor chip is called a 21064. I believe samples
are available for customer's evaluation now, according to the sales
information I've seen.
I don't want to be a salesman on the net here, but to avoid the
flood of mail I'd otherwise recieve, I include a phone number
for more information.
To learn more about pricing and availability of the 21064
microprocessor in its 150 MHz or faster clock rate versions, contact
your local Digital sales representative. Or, in the United States, call
1-800-DEC-2717; 1-800-DEC-2515 (TTY).
Some spec's on the chip follow the Architecture summary below.
- Jim Gettys
ALPHA ARCHITECTURE TECHNICAL SUMMARY
Dick Sites, Rich Witek
[NOTE: "Alpha" is an internal code name. An official name will be announced
soon.]
WHAT IS ALPHA?
Alpha is a 64-bit RISC architecture, designed with particular emphasis on
speed, multiple instruction issue, multiple processors, software migration
from VAX VMS and MIPS ULTRIX, and long lifetime. The architects rejected
any feature that did not appear to be usable for at least 25 years.
The first chip implementation runs at up to 200 MHz. The speed of Alpha
implementations is expected to scale up from this by at least a factor of
1000 over the next 25 years.
FORMATS
Data Formats
Alpha is a load/store RISC architecture with all operations done between
registers. Alpha has 32 integer registers and 32 floating registers, each
64 bits. Integer register R31 and floating register F31 are always zero.
Longword (32-bit) and quadword (64-bit) integers are supported. Four
floating datatypes are supported: VAX F-float, VAX G-float, IEEE single
(32-bit), and IEEE double (64-bit). Memory is accessed via 64-bit virtual
little-endian byte addresses.
Instruction Formats
Alpha instructions are all 32 bits, in four different instruction formats
specifying 0, 1, 2, or 3 register fields. All formats have a 6-bit opcode.
+-----+-------------------------+
| OP | number | PALcall
+-----+----+--------------------+
| OP | RA | disp | Branch
+-----+----+----+---------------+
| OP | RA | RB | disp | Memory
+-----+----+----+----------+----+
| OP | RA | RB | func. | RC | Operate
+-----+----+----+----------+----+
PALcalls specify one of a few dozen complex operations to be performed.
Conditional branches test register RA and specify a signed 21-bit
PC-relative longword target displacement. Subroutine calls put the return
address in RA.
Loads and stores move longwords or quadwords between RA and memory, using
RB plus a signed 16-bit displacement as the memory address.
Operates use source registers RA and RB, writing result register RC. There
is an extended opcode in the 11-bit function field. Integer operates can use
the RB field and part of the function field to specify an 8-bit
zero-extended literal.
INSTRUCTIONS
PALcall Instructions
The Privileged Architecture Library call instructions specify one of a few
dozen complex functions to be performed. These functions deal with
interrupts and exceptions, task switching, virtual memory, and other
complex operations that must be done atomically. PALcall instructions
vector to a privileged library of software subroutines (using the same Alpha
instruction set) that implement an operating-system-specific set of these
complex operations.
Branch Instructions
Conditional branch instructions can test a register for positive/negative
or for zero/nonzero. They can also test integer registers for even/odd.
Unconditional branch instructions can write a return address into a
register. There is also a calculated jump instruction the branches to an
arbitrary 64-bit address in a register.
Load/Store Instructions
Load and store instructions can move either 32- or 64-bit aligned
quantities. The VAX floating-point load/store instructions swap words to
give a consistent register format for floats. Memory addresses are flat
64-bit virtual addresses, with no segmentation. A 32-bit integer datum is
placed in a register in a canonical form that makes 33 copies of the high
bit of the datum. A 32-bit floating datum is placed in a register in a
canonical form that extends the exponent by 3 bits and extends the fraction
with 29 low-order zeros. 32-bit operates preserve these canonical forms.
There are no 8- or 16-bit load/store instructions, but there are facilities
for doing byte manipulation in registers.
Alpha has no 32/64 mode bit or other such device. Compilers, as directed by
user declarations, can generate any mixture of 32- and 64-bit operations.
Integer Operate Instructions
The integer operate instructions manipulate full 64-bit values, and include
the usual assortment of arithmetic, compare, logical, and shift
instructions. There are just three 32-bit integer operates: add, subtract,
and multiply. These differ from their 64-bit counterparts ONLY in overflow
detection and in producing 32-bit canonical results.
There is no integer divide instruction.
In addition to the operations found in conventional RISC architectures,
there are scaled add/subtract for quick subscript calculation, 128-bit
multiply for division by a constant and multiprecision arithmetic,
conditional moves for avoiding branches, and an extensive set of
in-register byte manipulation instructions for avoiding single-byte writes.
Rather then keeping a global state bit for integer overflow trap enable,
the enable is encoded in the function field of each instruction. Thus, both
ADDQ/V and ADDQ opcodes exist for specifying 64-bit add with and without
overflow checking. This makes pipelined implementations easier.
Floating-point Operate Instructions
The floating operate instructions include four complete sets of VAX and
IEEE arithmetic, plus conversions between float and integer.
There is no floating square root instruction.
In addition to the operations found in conventional RISC architectures,
there are conditional moves for avoiding branches, and merge sign/exponent
instructions for simple field manipulation.
Rather then keeping global state bits for arithmetic trap enables and
rounding mode, these enable and mode bits are encoded in the function field
of each instruction.
SIGNIFICANT DIFFERENCES BETWEEN ALPHA AND CONVENTIONAL RISC PROCESSORS
First, Alpha is a true 64-bit architecture, with a minimal number of 32-bit
instructions. It is not a 32-bit architecture that was later expanded to 64
bits.
Second, Alpha was designed to allow very high-speed implementations. The
instructions are very simple (no load-four-registers-unaligned-and-check-
for-bytes-of-zero). There are no special registers that would prevent
pipelining multiple instances of the same operations (no MQ register and no
condition codes). The instructions interact with each other ONLY by one
instruction writing a register or memory, and another one reading from the
same place. This makes it particularly easy to build implementations that
issue multiple instructions every CPU cycle. (The first implementation
in fact issues two instructions every cycle.) There are no
implementation-specific pipeline timing hazards, no load-delay slots, and
no branch-delay slots. These features would make it difficult to maintain
binary compatibility across multiple implementations and difficult to
maintain full speed on multiple-issue implementations.
Alpha is unconventional in the approach to byte manipulation. Single-byte
stores found in conventional RISC architectures force cache and memory
implementations to include byte shift-and-mask logic, and sequencer logic
to perform read-modify-write on memory words. This approach is awkward to
implement quickly, and tends to slow down cache access to normal 32- or
64-bit aligned quantities. It also makes it awkward to build a high-speed
error-correcting write-back cache, which is often needed to keep a very
fast RISC implementation busy. It also can make it difficult to pipeline
multiple byte operations.
Instead, the byte shifting and masking is done in Alpha with normal 64-bit
register-to-register instructions, crafted to keep the sequences short.
Alpha is also unconventional in the approach to arithmetic traps. In
contrast to conventional RISC architectures, Alpha arithmetic traps
(overflow, underflow, etc.) are imprecise -- they can be delivered an
arbitrary number of instructions after the instruction that triggered the
trap, and traps from many different instructions can be reported at once.
This makes implementations that use pipelining and multiple issue
substantially easier to build.
If precise arithmetic exceptions are desired, trap barrier instructions can
be explicitly inserted in the program to force traps to be delivered at
specific points.
Alpha is also unconventional in the approach to multiprocessor shared
memory. As viewed from a second processor (including an I/O device), a
sequence of reads and writes issued by one processor may be arbitrarily
reordered by an implementation. This allows implementations to use
multi-bank caches, bypassed write buffers, write merging, pipelined writes
with retry on error, etc. If strict ordering between two accesses must be
maintained, memory barrier instructions can be explicitly inserted in the
program.
The basic multiprocessor interlocking primitive is a RISC-style
load_locked, modify, store_conditional sequence. If the sequence runs
without interrupt, exception, or an interfering write from another
processor, then the conditional store succeeds. Otherwise, the store fails
and the program eventually must branch back and retry the sequence. This
style of interlocking scales well with very fast caches, and makes Alpha an
especially attractive architecture for building multiple-processor systems.
Alpha includes a number of HINTS for implementations, all aimed at allowing
higher speed. Calculated jumps have a target hint that can allow much
faster subroutine calls and returns. There are prefetching hints for the
memory system that can allow much higher cache hit rates. There are also
granularity hints for the virtual-address mapping that can allow much more
effective use of translation lookaside buffers for big contiguous
structures.
Alpha includes a very flexible privileged library of software for operating-
system-specific operations, invoked with PALcalls. This library allows Alpha
to run full VMS using one version of this software library that mirrors many
of the VAX operating-system features, and to run OSF/1 using a different
version that mirrors many of the MIPS operating-system features, and
similarly for NT. Other versions could be tailored for real-time, teaching,
etc. The PALcalls allow Alpha to run VMS with hardly more hardware than
a a conventional RISC machine has (the PAL mode bit itself, plus 4 extra
protection bits in each TB entry). This library makes Alpha an especially
attractive architecture for multiple operating systems.
Finally, Alpha is not strongly biased toward only one or two programming
languages. It is an attractive architecture for compiling at least a dozen
different languages.
SUMMARY
Alpha is designed to be a leadership 64-bit architecture.
--------------------
Specifications (150MHz version).
Process Technology .75 micron CMOS
Cycle Time 150 MHz (6.6 ns)
Die Size 13.9mm x 16.8mm
Transistor Count 1.68 million
Package 431 pin PGA
Number of Signal Pins 291
Power Dissipation 23 W at 6.6 ns cycle
Power Supply 3.3 volts
Clocking Input 300 MHz differential
On-chip D-cache 8 Kbyte, physical, direct-mapped,
write-through, 32-byte line, 32-byte fill
On-chip I-cache 8 Kbyte, physical, direct-mapped,
32-byte line, 32-byte fill, 64 ASNs
On-chip DTB 32-entry; fully-associative; 8-Kbyte,
64-Kbyte, 256-Kbyte, 4-Mbyte page sizes
On-chip ITB 8-entry, fully associative, 8-Kbyte page
plus 4-entry, fully-associative, 4-Mbyte page
Floating Point Unit On-chip FPU supports both IEEE and VAX
floating point
Bus Separate data and address bus.
128-bit/64-bit data bus
Serial ROM Interface Allows the chip to directly
access serial ROM
Virtual Address Size 64 bits checked; 43 bits
implemented
Physical Address Size 34 bits implemented
Page Size 8 Kbytes
Issue Rate 2 instructions per cycle to A-box,
E-box, or F-box
Integer Pipeline 7-stage pipeline
Floating Pipeline 10-stage pipeline
>[NOTE: "Alpha" is an internal code name. An official name will be announced
> soon.]
Mongoose and "SPARCkiller" come to mind.
"First thing after we take over Canada, we turn the University of Toronto
into a production facility for Intel... "
-- > SYS...@CADLAB.ENG.UMD.EDU < --
Seriously, when IBM announced the System\360 architecture in 1964, it
was with the admission or declaration that it was to last TEN years.
It still exists in extended form nearly thirty years later. But to
propose that one has made decisions today that will scale in some
sense over the next 25 years, in the face of unknown market and techno-
logy (hardware and ESPECIALLY SOFTWARE) forces and constraints is, I
think, premature. And especially if the Alpha is to be used as a node
in a massively parallel system, one has to address such issues as pipe-
lines being too long and figuring out exactly when and where an error
occurred.
> Second, Alpha was designed to allow very high-speed implementations. The
> instructions are very simple (no load-four-registers-unaligned-and-check-
> for-bytes-of-zero). There are no special registers that would prevent
> pipelining multiple instances of the same operations (no MQ register and no
> condition codes). The instructions interact with each other ONLY by one
> instruction writing a register or memory, and another one reading from the
> same place. This makes it particularly easy to build implementations that
> issue multiple instructions every CPU cycle. (The first implementation
> in fact issues two instructions every cycle.) There are no
> implementation-specific pipeline timing hazards, no load-delay slots, and
> no branch-delay slots. These features would make it difficult to maintain
> binary compatibility across multiple implementations and difficult to
> maintain full speed on multiple-issue implementations.
If you read the above exerpt from the summary I posted, you will see
that pipe depths are not visible to user programs.
Please don't confuse the Alpha architecture with the specific details
of the current 21064 implementation.
- Jim
I considered how the following code fragment might compile:
f(char *a, char*b)
{
*a = *b;
}
Assuming that the parameters are in r1, r2, we might get something like:
r3 <- r1 >> 5
r4 <- r2 >> 5
r5 <- load(r3)
r6 <- load(r4)
r7 <- extract(r5 using r1)
r8 <- shift(r7 using r2)
r9 <- insert(r8 using r2)
store(r9)
What might have been one instruction on a vax becomes eight. Am I missing
some clever optimizations?
Arguably this has to be done at some level anyways, but it looks like life
will be pretty painful for compiler writers. I don't suppose DEC will
have to worry about GCC n.0 causing loss of compiler sales for Alpha for
quite some time.
-castor fu
cas...@drizzle.stanford.edu
In <1992Feb25.1...@crl.dec.com> j...@crl.dec.com (Jim Gettys) writes:
>The first chip implementation runs at up to 200 MHz. The speed of Alpha
>implementations is expected to scale up from this by at least a factor of
>1000 over the next 25 years.
1000x seems very ambitious. Was there any (non-marketing) reason for
choosing this number and expecting it to be realistic?
>There is no integer divide instruction.
And from the sounds of it, no divide step. Any estimates on the
expected cost of a division routine (i.e. best/avg/worst cycle
times)?
> There are no
>implementation-specific pipeline timing hazards, no load-delay slots, and
>no branch-delay slots.
Is it fully interlocked? What's the load latency assuming a cache
hit? a cache miss? How many simultaneous pending load/store operations?
>Instead, the byte shifting and masking is done in Alpha with normal 64-bit
>register-to-register instructions, crafted to keep the sequences short.
I.e. byte access is read-modify-write. How many cycles to update a
random byte in memory assuming the address is in a register variable,
the new byte is in a register variable, and the memory word is in cache?
> On-chip ITB 8-entry, fully associative, 8-Kbyte page
> plus 4-entry, fully-associative, 4-Mbyte page
Seems awfully small. What's the expected TLB miss cost?
> Issue Rate 2 instructions per cycle to A-box,
> E-box, or F-box
A-box = integer, E-box = address, F-box = float???
> Integer Pipeline 7-stage pipeline
> Floating Pipeline 10-stage pipeline
Latency for FADD, FMUL, FDIV? How many independent functional units?
--
First there was Unix... Now there's AIX, AU/X, BSD, BSDI, Dynix, EP/IX, FTX,
Hurricane, HP-UX, Irix, Linux, Mach, Minix, Open Desktop, OSF/1, OSx, PC/IX,
Plan 9, Polyx, Posix, QNX, Risc/OS, Risc/ix, SCO Unix, Sinix, Solaris,
Sprite, SunOS, SVRx, Topaz, Tunis, Ultrix, Unicos, V, v10, Xenix, ..."
>but i was wondering
>about how hard people will find the transition to systems where
>(char *) != (int *) ?
It will give programmers a convenient reason for debugging their
*incorrect* programs ;-)
Seriously, what kind of support for *fast* byte swapping is there?
If the processor is going to be used as an MP coprocessor building
block, byte-swapping overhead will be significant.
P.S. In the original article, the following statement was made:
"Memory is accessed via 64-bit virtual little-endian byte addresses. "
I assume that this is saying that 64 bit load/store instructions
are little-endian, but it is a little confusing, since *addresses*
are neither little- nor big-endian. ? And, if they added
word-swapped loads/stores to handle VAX FP formats, surely they
could have included big-endian loads/stores as well ?
--
Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster
NASA Ames Research Center Internet: lama...@ames.arc.nasa.gov
Moffett Field, CA 94035 With Good Mailer: lama...@george.arc.nasa.gov
Phone: 415/604-1056 #include <std.disclaimer>
|> propose that one has made decisions today that will scale in some
|> sense over the next 25 years, in the face of unknown market and techno-
|> logy (hardware and ESPECIALLY SOFTWARE) forces and constraints is, I
|> think, premature.
I'd agree with this part of your flame. All you have to do is try to imagine designing a chip for today given the state of your knowledge in 1967. <snicker> But just do s/25/5/g and move on....
--
Paul K. Rodman
rod...@sgi.com
jg> There is no integer divide instruction.
borasky> Does this mean I can't use integer divides for 25 years?
jg> There is no floating square root instruction.
borasky> Does this mean I can't use floating square roots for 25 years?
yes, that's right. programs that do integer division will *not* work
on any alpha-based machines. ever. you can't do square roots either.
and if you try to work around this by writing your own sqrt function,
the compiler detects it, and panics the machine.
jg> Alpha arithmetic traps (overflow, underflow, etc.) can be
jg> delivered an arbitrary number of instructions after the
jg> instruction that triggered the
borasky> Then I guess for the next 25 years I'd better not write any
borasky> programs with bugs in them, either.
overflows will not go undected. the only difference is that if you
are going to *handle* the overflow (which is very rare (like in
hand-coded bignum functions)) then you need to do a trap-sync.
the alpha feature that i find most interesting is:
jg> a sequence of reads and writes issued by one processor may be
jg> arbitrarily reordered by an implementation... If strict ordering
jg> between two accesses must be maintained, memory barrier
jg> instructions can be explicitly inserted in the program.
this sounds kick-ass for vectors, but when using pointer-heavy
structures, it seems almost every write needs a barrier. am i wrong?
if not, that detracts from the final claim of the post:
jg> Finally, Alpha is not strongly biased toward only one or two
jg> programming languages. It is an attractive architecture for
jg> compiling at least a dozen different languages.
yea, FORTRAN/IV, FORTRAN77, FORTRAN90, FORTRAN-D, SISAL, APL, J ... :)/2
--
heart
scott draves liver
sp...@cs.cmu.edu tongue
I heard that Sun admitted at ISSCC that Viking only ran at 40 MHz. Did
anyone else hear this?
If true, it sounds like Sun is going to be really hurting for performance.
--
= = = = = = = = = = = = = = = = = = = = = = = = = = =
Andrew C. Payne, N8KEI UUCP: ...!cornell!batcomputer!payne
INTERNET: pa...@tc.cornell.edu
Yes. What's missing is the first part of jg's quote: "..as viewed from
a second processor...". In other words, if you are poking at an I/O
device, you have to add a memory barrier for correct operation. For example,
consider the sequence:
[store to set I/O mode]
{memory barrier needed here}
[load to read I/O status]
The barrier is needed because you don't want to read the status before
setting the mode--which could happen because of reordering.
What about languages with exception features? [Ada, Modula 3 and soon
C++.] When a compiler generates code for these, it has to keep
distinct the regions of code that have different exception handlers.
When code execution crosses a region boundary, the Alpha's pending
conditions will have to be checked for.
How expensive will that check be?
--
Don D.C.Lindsay Carnegie Mellon Computer Science
Sounds fine. Except, surely, you require sequences of writes to the
same location, to not occur out of order? I can't see how a barrier
technique would be a fix for that.
>I heard that Sun admitted at ISSCC that Viking only ran at 40 MHz. Did
>anyone else hear this?
Yes, they did. And yes, someone else heard it. I refrain from further
comment.
---Pete
kai...@heron.enet.dec.com
+33 92.95.62.97
>Sounds fine. Except, surely, you require sequences of writes to the
>same location, to not occur out of order?
Why must they? The crucial phrase concerns shared memory "as viewed by a
second processor". On a single processor writes to memory can occur in any
order so long as that processor can, for instance, get the latest value
from cache; I agree, though, that cache must be flushed in proper order
even on a single processor. That, however, is a detail of implementation.
---Pete
kai...@heron.enet.dec.com
+33 92.95.62.97
[ where do these fives come from? ]
> r5 <- load(r3)
> r6 <- load(r4)
> r7 <- extract(r5 using r1)
> r8 <- shift(r7 using r2)
> r9 <- insert(r8 using r2)
> store(r9)
>
>What might have been one instruction on a vax becomes eight. Am I missing
>some clever optimizations?
That's not what I see. When doing bytes, it's *nominally* a
byte-addressed processor, or to look at it another way, it's a
64-bit-word-addressed cpu with 3 bits after the low end of the address
used as a byte offset within the word. (My reading says it's this,
and not 32-bit-word-addressed/2 bits.) So, an address looks just like
it's a 64 bit byte-aligned address, but the load/store happens
differently. i.e., a pointer is:
[63...address...3|2..byteoffset...0]
I also think there are instructions to insert/extract bytes (and, I
hope, 16-bit words) using these pointers. So what I see for your
instruction sequence is:
r3 <- load(r1)
r4 <- load(r2)
; byte goes in low 8 bits of r5
r5 <- extract8(r3 using low bits of r1)
r4 <- insert8(r5 using low bits of r2)
@r2<- store(r4)
; hmm... still five instructions
So, (char *) == (int *), modulo alignment restrictions. Remember, it
IS still the Daughter-Of-Vax. :) They could have chosen to do 32-bit
words this same way, but I guess they didn't for efficiency reasons.
All this implies that either 1) loads/stores ignore the bottom 3 bits
and round down a la IBM RT, or 2) there are special (char *)
load/stores that ignore the bottom 3 bits, and the regular load/stores
trap on those three bits != 0.
Has anybody else done something like this?
I rather like what I see of Alpha. (software-hacker-speaking alert)
--jh
--
John Hood, CU student, CU employee, and sometime BananaOp
jh...@albert.mannlib.cornell.edu,jh...@banana.ithaca.ny.us,j...@crnlvax5.bitnet
By any reasonable computer's standard, I'm a virus. So are you.
--
John Hood, CU student, CU employee, and sometime BananaOp
jh...@albert.mannlib.cornell.edu,jh...@banana.ithaca.ny.us,j...@crnlvax5.bitnet
By any reasonable computer's standard, I'm a virus. So are you.
1000x in 25 years ~= doubling every 30 months.
----------------------------------
Ed Kubaitis (e...@ux2.cso.uiuc.edu)
Computing & Communications Services Office - University of Illinois, Urbana
Is "PALcall" just another name for chmk or svc, or does it do more than
other system call instructions?
Does DEC consider Alpha compilers advanced enough to get a good estimate of
the average instructions/cycle count as compared to the maximum of 2?
--
John Carr (j...@athena.mit.edu)
Note that this is not going to slow down the program, since the cost of
the mb's will be buried by the cost of actually doing the off-chip cycles.
I don't know that any sensible implementation would, in the absense of
barriers, deliver writes to a single location out of order, but earlier
writes might not be delivered at all. However, delivering writes to a single
location out-of-order is perfectly OK, provided that the last write is delivered
last.
As seen by the instruction stream executing on a given processor, the memory
system is serialized. As seen by other processors or I/O devices, arbitrary
merging and reordering can happen, unless you use mb.
-Larry Stewart, DEC Cambridge Research Lab
I would certainly hope that, yes, writes to the same location by the same
processor are seen in the same order everywhere. The reason is as follows:
Any program that could detect the difference in the ordering of writes to
different locations is using those writes to convey synchronization
information, and they should be using the "barrier" instructions. If a
barrier is used after the writes, and no other processors access those
locations until after the writes, then noone will be the wiser.
By contrast, if a barrier is used after multiple writes to a single
location, the final value of that location will still be wrong unless the
earlier writes completed in the same order.
--
=============================================================================
Pete Keleher "Relax! Don't worry! Have a homebrew!" pe...@cs.rice.edu
=============================================================================
Probably not too bad, since the regions tend to be quite large (at least
in Ada), often covering an entire procedure body. Note that this problem
appears in architectures with imprecise FP exceptions, too, such as the
Moto 88K.
-- Jerry Callen
jca...@world.std.com
--
Jerry Callen, Thinking Machines Corp.
jca...@think.com
{uunet,harvard}!think!jcallen
Hmm, I thought performance, giving some generic dollar figure, was doubling
every 18 months for a RISC box.
I don't have the Digital NewsblurReview in front of me, but they've already
got people working on stuff upto 800Mhz (from the current 200Mhz).
No matter how you slice it, a 1000MIPS on a desktop is scary.
In most cases they don't need to. But, if you have a frame buffer with
a built in ALU writes can have side effects and must be done in order.
Frame buffers with builtin ALUs are getting more and more common in
the world of X11 based graphics.
So, if the writes have side effects, there must be a way to ensure
that writes are done in order. But, store barrier instructions handle
the problem.
Bob P.
--
| Bob Pendleton | Engineering Anethema: |
| bo...@hal.com | 1) You've earned an "I told you so." |
| Speaking only for myself. | 2) Our customers don't do that. |
How should the data dependencies be solved in so a long pipelined architecture?
Is it fully interlocked or full forwarding?
|> If precise arithmetic exceptions are desired, trap barrier instructions can
|> be explicitly inserted in the program to force traps to be delivered at
|> specific points.
|>
|> with retry on error, etc. If strict ordering between two accesses must be
|> maintained, memory barrier instructions can be explicitly inserted in the
|> program.
Could anyone tell me what a barrier instruction is (it's a new concept for me!)?
|> The basic multiprocessor interlocking primitive is a RISC-style
|> load_locked, modify, store_conditional sequence. If the sequence runs
|> without interrupt, exception, or an interfering write from another
|> processor, then the conditional store succeeds. Otherwise, the store fails
|> and the program eventually must branch back and retry the sequence. This
|> style of interlocking scales well with very fast caches, and makes Alpha an
|> especially attractive architecture for building multiple-processor systems.
|>
Should Alpha be used to construct memory-shared or message-passing multiproces-
sor systems? How is the problem "Latency" in multiprocessor environment?
Are there some special instructions used to support efficient multiprocessing
(for example, split-phase instructions)? Is Alpha also a multithreaded
processor ?
|> Integer Pipeline 7-stage pipeline
How many pipeline stages does Alpha contain?
instruction fetch, decode, register fetch, ...?
Xiaoming Fan
Dept. of Computer Science
Univ. of Hamburg
Troplowitzstr. 7
W-2000 Hamburg 54
Germany
> Could anyone tell me what a barrier instruction is (it's a new concept for
> me!)?
It's an instruction whose effect is to prevent further writes to memory
until all writes to memory from the same processor have completed. This
forces sequentiality of writes to memory as seen from another processor.
VAX, and many other processors, do this implicitly with every instruction;
Alpha removes this limitation, which is a hindrance to performance (you pay
to flush the cache on every memory write, or you have to use expensive
caching schemes between processors).
The Alpha architecture leaves the ordering of writes to memory under the
control of designers and compilers, where it can be better controlled, and
used selectively rather than having it forced on you at all times.
---Pete
kai...@heron.enet.dec.com
+33 92.95.62.97
>Frame buffers with builtin ALUs are getting more and more common in
>the world of X11 based graphics.
How common? The IBM RT monochrome graphics adapter ("apa16", introduced
around 1986) is the only example I know of. Until recently, the X server
for this display didn't make much use of the ability for writes to be
mapped to a read-modify-write operation. Most of the reason X on the
apa16 is fast is that the apa16 has hardware support for line drawing
and rectangle copy/fill.
Probably as a side effect of the smart frame buffer, access to the apa16
adapter frame buffer takes nearly twice as long as access to the color
graphics adapter (a 16 bit read takes 3.2 microseconds compared to 1.8).
Is X11 the only popular window system that supports multiple logical
operations for drawing commands?
--
John Carr (j...@athena.mit.edu)
I wonder if write barriers are all that is needed to handle memory
hierarchy coherence between processors... take the following example:
Processor A runs a producer process... it does some number of writes,
then issues a write barrier, then signals (using a semaphore or some
such mechanism) that a consumer process (which may run on another
processor) may continue.
Processor B runs a consumer process... after being signalled by the
producer process, it does some number of reads, then waits until it
can read again.
Does the write barrier issued by the producer process on Processor A
allow the consumer process to safely read the words which were just
written (by the write barrier), or does it need to worry that there
may be old data in its own processor's cache which won't be accurate
anymore... in other words, does a compiler ever need to generate a
"cache bypass" read or "cache flush" for a process to be sure that it
will get data just written to memory by another processor?
--
John R. Grout
University of Illinois, Urbana-Champaign
Center for Supercomputing Research and Development
INTERNET: j-g...@uiuc.edu
Someone (I can't remember who) once suggested PDP-64.
Silly me.
X := 1;
X := 2;
<barrier>
could possibly leave the remote processor thinking X == 1, so, do
X := 1;
<barrier>
X := 2;
"Doctor, it hurts when I do THIS."
"Then don't do that."
Paul Rodman writes:
> agree...just try to imagine designing a chip for today given
> the state of your knowledge in 1967
Let's stand the question on its head. Is it a good thing to build
systems that will be architecturally obsolete within 5 years of product
introduction? Or to design chips for general purpose use that will be
architecturally obsolete within, say, 10 years of first availability?
Or to define architectures that will be unable to take advantage of
technology that arrives 15, 20, or 25 years from now?
There are so many examples of this to study, surely one could come to a
definite conclusion about the matter :-) Designing for the short term
and the fast buck is a well explored set of engineering questions.
Of course we don't know enough about the next 25 years of technology to
say with confidence that any architecture will scale well. We do know
enough about the previous 40 years to identify some architectural
features that won't scale well. DEC has tried to avoid them in Alpha.
Will it really make 25 years? Who knows? It seems worth aiming for.
-- Jon
--
Jon Krueger j...@ingres.com Oh we pulled the handle out,
and we pushed the handle in, and the steam went to the pistons just the same
--Jeff
--
Jeff Weinstein - X Protocol Police
Silicon Graphics, Inc., Entry Systems Division, Window Systems
j...@xhead.esd.sgi.com
Any opinions expressed above are mine, not sgi's.
Obviously, the first piece of code does the right thing. Hopefully, the
architecture manual specifies this in unambiguous language. (I.e. it isn't
as easy as you think to describe such semantics precisely.)
--
Zalman Stern, MIPS Computer Systems, 928 E. Arques 1-03, Sunnyvale, CA 94088
zal...@mips.com OR {ames,decwrl,prls,pyramid}!mips!zalman (408) 524 8395
School? Oh no. I wanted to answer suicide calls for Christmas! -- Spalding Gray
Digital always has been a conservative company when making performance
claims :-).
- Jim Gettys
--
Digital Equipment Corporation
Cambridge Research Laboratory
Perhaps I wasn't clear enough. I agree with all the goals, I even agree with most of the choices. I just can't imagine somebody beliving that it really WILL be optimal in 25 years. (I think that today's cpus will be about as interesting as a cat-wisker germanium diode radio in 25 years....)
Just to fan the flames, look at the Vax arch. and ask yourself "For how many years was this an optimal arch. for current technology?". How easy was it to pipeline? How long did Venus take, anyway, and why?
Again, let me restate for those that don't catch my drift:
I am NOT casting any aspersions on any engineering...past, present or future, I'm just stating that engineering is hard, it IS NOT a science, and I belive nobody can come even _close_ to predicting the future so far out.
I tend to think that it wasn't the low-level engineers that thought up this 25 years nonsense...
-paul
I think this is an indication that using synchronous traps to report
floating point errors is becoming a problem in heavily pipelined
implemntations, and that other mechanisms should be used to get maximum
performance (explicit test-and-branch instructions, perhaps, or asynchronous
traps containing more context...). Are more details of the Alpha trap
mechanism available?
--
-- Peter da Silva, Ferranti International Controls Corporation
-- Sugar Land, TX 77487-5012; +1 713 274 5180
-- "Have you hugged your wolf today?"
If this is true, then potentially any store/assignment to a non-local
variable ("static" for the C fans) must be prefixed with the <barrier>
instruction, because the compiler has no way of telling whether that
same memory location was just stored into previously. Consider the
two separate source files:
File A.c File B.c
extern int X; extern int X;
main() { sub() {
X = 1; X = 2;
sub(); }
}
When compiling file B.c, the compiler has no way of knowing whether there
was a store into X "recently", and so it must generate the <barrier> to
insure the proper sequencing of the memory writes. Am I missing something
here?
By the way, is there some "guarantee" that a memory write will be retired
after so many cycles? If not, then EVERY single memory write may have to be
prefixed with the <barrier> because most variables are assigned to more
than once in most "interesting" programs. If there is no way of telling
that the previous memory write was retired, then a <barrier> is needed
to insure proper sequencing again.
Finally, just how expensive is the <barrier> instruction in terms of cycles?
--
Dan Lau, Intel Corp.
Wow! a 200GHz clock ("what's that fat pin? Oh ... it's the waveguide for
the clock ..."). I presume you mean that it will scale by 1000, including MP
systems ....?
Paul
--
Paul Campbell UUCP: ..!mtxinu!taniwha!paul AppleLink: CAMPBELL.P
"In the future in the new world order if we don't like your trade policies we
will puke on you ..." "The problem with the market economy is that it delivers
the most efficient quality of life" "Jennifer Fitz. comes clean, news at 11 ..."
So what does this mean for a C compiler? Will long actually mean a
64-bit int?
> Integer Pipeline 7-stage pipeline
What exactly does this mean? Will an integer add take 7 cycles?
Will there have to be 13 instructions (at two a cycle) that occur in
parallel to take full advantage of the 21064's performance?
--
Paul Beusterien pa...@ssd.csd.harris.com
Harris Computer Systems
Ft. Lauderdale, FL voice: (305) 973 5270
> What about languages with exception features?
> How expensive will that check be?
>Probably not too bad, since the regions tend to be quite large (at least
>in Ada), often covering an entire procedure body.
While the scope of the exception handler is often an entire procedure,
don't most languages with exception features (or at least most programmers
using these features) assume that the exception will be raised before
execution of the statement after the one that caused the exception? For
instance, if the following statements were in a procedure:
a = b + c;
d = f(a);
and the addition raises an overflow exception, the call of f() (and any
side effects it might cause) and assignment to d, and perhaps also the
assignment to a, should not take place. This implies that trap-sync
instructions would have to be placed all over the place in this procedure.
They might not have to be quite as common as checks of an overflow flag,
but neither would they only be once per handler scope.
--
Barry Margolin
System Manager, Thinking Machines Corp.
bar...@think.com {uunet,harvard}!think!barmar
I doubt it. Not to start the "volatile" flame war again, but although the
ANSI C rationale claims that volatile variables are intended for
communication with memory mapped I/O devices, the only use the stamdard
itself makes of them is for communication between mainline routines and
routines invoked by asynchronous signals. Barriers there would merely
slow down the program without doing anything useful. This is not new to
Alpha, the IBM 370 has had a similar barrier instruction for 20 years
which is equally ill-matched to volatile variables. Even on computers
that have memory mapped I/O, volatiles are rarely sufficient to access I/O
registers without a #pragma or the equivalent to specify byte vs. word
access, write vs. read/modify/write, etc.
C, along with nearly every other known language, does not have a well
defined way to control access to shared memory. It's not a simple
problem. A few years ago most shared memory multiprocessors had snoopy
caches or the like to provide consistent views. Now we've discovered that
we can't provide fully consistent memory semantics at high performance, so
there are various hacks to implement partially consistent memory, such as
barriers, uncached memory pages, and even special banks of sharable
memory. I'm at a loss to propose a scheme which is general enough to
handle all of these along with whatever comes along in the next few years.
Regards,
John Levine, jo...@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl
>What exactly does this mean? Will an integer add take 7 cycles?
>Will there have to be 13 instructions (at two a cycle) that occur in
>parallel to take full advantage of the 21064's performance?
Of course not. The pipeline length and the latencies are somewhat
unrelated. The latencies for the 21064 *implementation* of the
Alpha *architecture* (subject to change in future implementations
with some notice on this newsgroup :-) ) is generally 1 cycle for
integer operates, 3 for loads and 6 for the floating point unit.
All of these are in the ISSCC paper.
Burkhard Neidecker-Lutz
Multimedia Engineering, CEC Karlsruhe
Software Motion Pictures
Digital Equipment Corporation
neid...@nestvx.enet.dec.com
In article <1992Feb25.1...@crl.dec.com> j...@crl.dec.com (Jim Gettys) writes:
>Alpha includes a number of HINTS for implementations, all aimed at allowing
>higher speed. Calculated jumps have a target hint that can allow much
>faster subroutine calls and returns. There are prefetching hints for the
>memory system that can allow much higher cache hit rates. There are also
>granularity hints for the virtual-address mapping that can allow much more
>effective use of translation lookaside buffers for big contiguous
>structures.
Can somebody fill in a little bit how these hints work ? I can make some
guesses (I don't have any idea what they really are):
target hint: A few branch instruction that tells the machine that
the next *real* branch will use the same target as
this one.
prefetching hints: A fake load instruction that says to the cache
to load the cache line without stalling the
machine because the result is not put anywhere.
granularity hints: An instruction that tells the machine that something
huge starts here and it should assign one of the
huge 4 Mbyte TLB entries to it.
Am I right ?
> On-chip I-cache 8 Kbyte, physical, direct-mapped,
> 32-byte line, 32-byte fill, 64 ASNs
What's an ASN ?
Like the Alpha, the AMD29050 does not deal with bytes
directly either. Rather, it uses "insert-byte-in-long" and
"extract-byte-from-long" instructions which base the byte position on
the low bits of the previous ld instruction.
(From memory) on the AMD29K as soon as you do a "ld byte" instruction
(really, "ld" with a flag bit set), the entire long word at
address&0xfffffffc gets loaded into your register. Simultaneously, a
register tied to the barrelshifter gets loaded with the low 2 bits
from the address in the "ld" instruction. A subsequent "extract"
instruction will rip out the correct byte in the register you specify
and put it where you like. Likewise, an "insert" instruction will take
the byte in your source register and after shifting it replace the
corresponding byte in the destination register. (I refer you to the
AMD29K docs if you find this vague..)
(Again, from memory), the code sample for *(char *)p = *(char *)q
becomes something like this.....
; assume r1=p, r2=q
ld "byte", addr=r1, dst=r3 ; word around q, call it "near-q"
extract src=r3, dst=r3 ; now r3 contains (char *)*p
ld "byte", addr=r2, dst=r4 ; word around q, call it "near-q"
insert src=r3, dstr4 ; insert *p into "near-q"
st "byte", r2, r4 ; store "near-q"
That's not so bad. To compare, i'm wondering how Alpha really deals
with bytes. Thanks.
<tdr.
I sure hope so.
>I also think there are instructions to insert/extract bytes (and, I
>hope, 16-bit words) using these pointers. So what I see for your
>instruction sequence is:
The speed of this is probably not so bad, there's lots of optimizations one
could do. I think making software to do it well is going to be a mess
though. Still, if it lets them make 200MIPS chips it's worth the
ineffeciency.
>So, (char *) == (int *), modulo alignment restrictions. Remember, it
>IS still the Daughter-Of-Vax. :) They could have chosen to do 32-bit
>words this same way, but I guess they didn't for efficiency reasons.
I think this was a mistake. The summary says that it's not trying to add
things which will make it hard to make fast versions in the future. This
does just that: it requires alignment hardware which is in a critical path.
If they're going to have alignment hardware at all, I would have prefered if
they forgot about 32-bit words and did bytes instead. Perhaps in future
versions these will be software traps. Maybe they made the 32-bit handling
instructions look more like the operating system trap instructions than real
instructions.
>I rather like what I see of Alpha. (software-hacker-speaking alert)
Me too. I think the most interesting thing is its lack of condition flags.
Most importantly, no overflow or carry. Both of these flags are in critical
speed paths, so I think it was probably a good idea for DEC to eliminate
them. Also having zero test in an instruction eliminates that ugly big NOR
gate from a critical speed path. It will be interesting to see how the
software is made to handle signed compares. I think you might have to mask
bits out of the operands before doing the test (or just use a three operand
instruction).
Not having delayed branches is interesting too. I like the idea behind it,
and I think it might make for more intersting branch optimization hardware.
Perhaps future versions could detect and remember the tops of loops and do
some kind of automatic delayed branches by having extra execution hardware-
once we have a hundred times as many transistors available :-)
Now what they really should have done is completely eliminate intereger
support :-) It should be possible to do everything with using floating
point instructions only. Then they could just concentrate on making those
fast.
--
/* rca...@wpi.wpi.edu */ /* Amazing */ /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
> Now we've discovered that
>we can't provide fully consistent memory semantics at high performance, so
>there are various hacks to implement partially consistent memory, such as
>barriers, uncached memory pages, and even special banks of sharable
>memory.
Yes, once again, the "wrong answers fast" crowd are up to their tricks.
Weinberg said it years ago: you can amke it run as fast as you like,
if it doesn't have to work.
> I'm at a loss to propose a scheme which is general enough to
>handle all of these along with whatever comes along in the next few years.
Suggestion: invest time, money and resources into doing the right thing,
rather than into repeatedly fudging around the fact that you've done
the wrong thing.
John> I wonder if write barriers are all that is needed to handle memory
John> hierarchy coherence between processors... take the following example:
No, not only must all writes have completed, but they must have completed to a
coherent level of the memory hierarchy. For example, suppose a chip has an on
chip, non coherent cache, and an external, coherent cache. The write barrier
must flush dirty items from the on chip cache to the external cache.
(Presumably there is some mechanism for the external cache to invalidate or
update the on chip cache, too.) The point is that the write barrier is an
explicit request for a certain kind of coherence at a certain point. This is
thought to be more efficient than enforcing coherence at every write (which
would also imply that writes to different locations could not be reordered).
--
J. Eliot B. Moss, Assistant Professor
Department of Computer Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, MA 01003
(413) 545-4206, 545-1249 (fax); Mo...@cs.umass.edu
>So what does this mean for a C compiler? Will long actually mean a
>64-bit int?
>
If you choose so, yes. You could imagine doing different C compilers
(like the ones on a 80386, shudder) that would treat this as either
long == int (and have long long be 64) or make long 64. On those
In principle. In practice, however, many programs out there assume
that a (void *) fits in a long (and even worse, an int), so such a
compiler would make many programs hard to come up.
> prefetching hints: A fake load instruction that says to the cache
> to load the cache line without stalling the
> machine because the result is not put anywhere.
If the interlocks are done right, a load to the permanent zero register will
do this.
On some versions of the sparc, a load to the zero register followed by a use
of the register will result in a delay even though the loaded data is not
used. The IBM RT has a similar problem: r0 is zero when used in an address
but the processor waits for a load into r0 to complete before ignoring the
value. Older MIPS implementations stall the processor on a cache miss so
you can't use this method for prefetching a cache line.
--
John Carr (j...@athena.mit.edu)
1. Ignore it -- extremely BAD idea! Nothing your code does afterwards
can be trusted.
2. Prevent it -- extremely GOOD idea (in THEORY)! The earlier an error
is caught and fixed the less it costs the users. But nearly 25
years of research in proving correctness of programs, decades of
institutionalized structured methodologies and the accumulated wisdom
of millions of programmer-years since the days of Grace Hopper have
NOT yielded perfect programs of any significant size in ANY
application area. I don't suppose they ever will as long as there
are unsolvable and solvable but very complex problems about abstract
computers. What we do NOW is a very great improvement over the IAS
machine days; compilers, operating systems and 300,000 line finite
element codes would not exist if we didn't have "make", revision
control, chief programmer teams, structured walk-throughs and all
the other bug-prevention devices.
3. Find it after it has happened and fix it. Yes, there is a tradeoff
between performance/pipelining and error detection, but once an error
has occurred, the duty of the implementation is to shut down the
process as soon as possible and give the programmer as much help as
it can to find his mistake. Once an error has occurred, the fact
that succeeding "computation" is fully pipelined at 3.5 massively
parallel gigaflops is downright insulting to me as a programmer.
Not necessarily - the semantics of call/return could be defined to include
an implicit trap-sync. This is the way that the 80960 family handles the
problem. An explicit instruction (SYNCF) is provided, which guarantees that
any faults generated by preceding instructions have occurred. An implicit
SYNCF is performed at the start of call/return instructions and certain other
operations. This was sufficient to allow the correct implementation of the
Ada exception model.
Bob Bentley
Intel Corp., M/S JF1-19 E-mail: r...@ichips.intel.com
5200 N.E. Elam Young Parkway Phone: (503) 696-4728
Hillsboro, Oregon 97124 Fax: (503) 696-4515
Or maybe A-box = Adress box, E-box = Execution box and F-box = floating point.
That seems more likely.
Bengt L.
--
Bengt Larsson - ben...@maths.lth.se
>In article <1992Feb26.1...@cs.cmu.edu> lind...@cs.cmu.edu (Donald Lindsay) writes:
>>I wrote:
>>>>As viewed from a second processor (including an I/O device), a
>>>>sequence of reads and writes issued by one processor may be arbitrarily
>>>>reordered by an implementation.
>>>Sounds fine. Except, surely, you require sequences of writes to the
>>>same location, to not occur out of order? I can't see how a barrier
>>>technique would be a fix for that.
There are two problems here: missed writes (i.e. writeback cache) for
a single variable and out-of-order writes for multiple variables. I can't
imagine ANY implementation getting writes to a single word out of order,
but some might be missed -- eliminated by a write-back cache. The same
write-back cache might cause writes to two different cells to occur out
of order.
>By the way, is there some "guarantee" that a memory write will be retired
>after so many cycles? If not, then EVERY single memory write may have to be
>prefixed with the <barrier> because most variables are assigned to more
>than once in most "interesting" programs. If there is no way of telling
>that the previous memory write was retired, then a <barrier> is needed
>to insure proper sequencing again.
The way to "guarantee" such is typically to use a write-thru cache.
However MOST variables should not need this protection (at least
that is the assumption behind the Alpha's semantics). You do need
to protect (a) memory mapped registers, (b) semaphores and locks.
That is, what you would term usually *volatile* veriables. Since all
reads/writes are properly synchronized w.r.t. to CPU, it only
matters when *another* processor (PPU or CPU) gets involved. If
you want all writes to be synchronous... use another architecture.
(It seems a fair tradeoff to me.)
--
First there was Unix... Now there's AIX, AU/X, BSD, BSDI, Dynix, EP/IX, FTX,
Hurricane, HP-UX, Irix, Linux, Mach, Minix, Open Desktop, OSF/1, OSx, PC/IX,
Plan 9, Polyx, Posix, QNX, Risc/OS, Risc/ix, SCO Unix, Sinix, Solaris,
Sprite, SunOS, SVRx, Topaz, Tunis, Ultrix, Unicos, V, v10, Xenix, ..."
There is an explicit "FETCH address" command:
"This address is used to designate an aligned 512-byte block of data. An
implementation may optionally attempt to move all or part of this block (or
a larger surrounding block) of data to a faster-access part of the memory
hierarchy, in anticipation of subsequent Load or Store instructions that
access that data."
In other words, the FETCH will always designate at least 512-bytes. Since
the optimum size may change over the next 25 years, implementations may
interpret the FETCH to designate "larger surrounding blocks".
--
Ramsey Haddad <had...@decwrl.dec.com>
[example deleted]
>By the way, is there some "guarantee" that a memory write will be retired
>after so many cycles? If not, then EVERY single memory write may have to be
>prefixed with the <barrier> because most variables are assigned to more
>than once in most "interesting" programs. If there is no way of telling
>that the previous memory write was retired, then a <barrier> is needed
>to insure proper sequencing again.
I think people are getting a little carried away with this. This is
a non-problem because
1) you can't allow a memory location to be (visibly) updated out
of program order any more than you can allow a register to be
updated out of program order - you simply wouldn't be able
to program the beast.
2) you'd really have to try hard to design a system where this
could happen. What *can* easily happen, in multi-ported
phased-bank split-transaction memory systems that have been
common in mainframe design for the past 15-20 years, is writes
to *different* locations out of order. Thus when updating
a multi-word structure, barrier synchronization is needed
between producer & consumer. But writes to the same location
tend to get serialized in order, because the store instructions
are executed in order, they're in the same bank, etc. I'm sure
it is this type of situation that the Alpha spokesperson was
referring to, i.e. Alpha supports mainframe-type memory systems.
I'm sure one could design a memory system where out-of-order same-location
writes could occur; I'm also sure that it wouldn't buy anything in
performance and hence would be pointless. As Dan alludes to above,
it would hurt performance when used with the sequential execution
paradigm that we all know and love.
---------------------------------
Dave Christie My opinions, generalizations, and gut feelings only.
I'm always happy to be enlightened.
1) It seems perfectly clear that the current Alpha implementations are
targeted at highest-performance applications for certain kinds of systems,
and certainly not at truly high-volume designs. This is instantly
obvious if you talk to any chip person who's looked at second-sourcing
the chips, because the process is sufficiently unlike high-volume
merchant chips. All of this makes perfect sense for DEC to do, of course.
2) It also seems clear that the current implementations, and some of
the architecture are aimed at high-performance, with some limitations in
flexibility. For example, from past history, if you want to use chips in
designs where you DON'T have complete control over the peripherals,
but want access to a wide variety of low-cost ones, then the chip must adapt
to the environment, rather than the reverse. In particular, the straightforward
way to use many peripherals, in practice, requires:
a) Write C device drivers, that use:
b) Simple structs to describe the device register layout
c) Containing whatever mixture of chars, shorts, ints, etc is needed.
d) Use volatile pointers to these.
e) Generate ordinary code that does not cause extra accesses
either read or write.
In general, from PAINFUL experience, any environment where a volatile
variable reference doesn't generate exactly one load or store to exactly
the size of the data object, will cause some poor systems programmer
nightmares. Even simple write-buffering has been known to cause great pain,
if it can take effect on uncached accesses.
3) So, this [having no byte/short operations] is probably reasonable
for building large machines [which aren't directly close to device operations
anyway]; or workstations with constrained environments [like, don't expect
to use 3rd-party drivers and boards]. Most user-level code will be fine,
as the number of (especially) short operations is fairly low.
Kernel-level code will suffer a little, of course, as it uses more 16-bit
operations, given the dense packing of data structures, networking code, etc.
If the overall performance makes up for it, that's fine.
There will be some classes of machines that will be difficult to
build, AND be able to use in practice, i.e., if a board vendor supplies
a driver, and you tell them they've got to massively restructure it,
you're either going to have to spend a lot of money, or it won't happen.
There are ways to work around this stuff, but sometimes it means you
don't get to use widely-available and/or standard support chips, or you ahve
to stick into extra translations, etc.
4) This discussion came up, in essentially the same form, several years ago with
the AMD 29K, I think.
5) It might be useful in any further discussion on this to identify
one's background, i.e., given the point of view expressed above,
I've found a common separation:
a) hardware designers would love to get rid of short loads/stores
b) systems programmers would rather not, because they have to
deal with the consequences...
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash
DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: MIPS Computer Systems M/S 5-03, 950 DeGuigne, Sunnyvale, CA 94086-3650
>I doubt it. Not to start the "volatile" flame war again, but although the
>ANSI C rationale claims that volatile variables are intended for
>communication with memory mapped I/O devices, the only use the stamdard
>itself makes of them is for communication between mainline routines and
An interesting data point: I did a grep -c volatile in a few directories
of our UNIX; in 5 minutes, I found 800 instances....
In general, all of the device-descriptive info uses this a lot.
>prefetching hints
(FETCH) this instruction will go out of the chip to tell the outside caches that
in near future the CPU will access a certain page. if the external memory
subsystem is smart enough, it will prefetch that page. so by the time the CPU
needs the page the cache is ready to deliver.
>granularity hints
PTE(page table entries) can point to different page sizes. granula bits
specifies the page size.
>ASN
Address Space Number, each process has a unique ASN, so upon context switch
it is possible to avoid full TLB flush or ICache flush. (the method is left
as an exercise for the reader)
jg> There is no integer divide instruction.
borasky> Does this mean I can't use integer divides for 25 years?
jg> There is no floating square root instruction.
borasky> Does this mean I can't use floating square roots for 25 years?
The are many algorithms for doing fast integer division.(left as an exercise
for the reader)
1 method: (memory is cheap and it is geting cheaper) generate a table for the
reciprocal of the number and use multiply.
2 method: use the Fbox (floating point box) Division.
And finally I want to repeat Jim Gettys:
"Please don't confuse the Alpha architecture with the specific details of the
current 21064 implementation."
bor...@ogicse.ogi.edu (M. Edward Borasky) writes:
>propose that one has made decisions today that will scale in some
>sense over the next 25 years, in the face of unknown market and techno-
>logy (hardware and ESPECIALLY SOFTWARE) forces and constraints is, I
>think, premature. And especially if the Alpha is to be used as a node
Let history prove/disprove it.
--
-------------------------------------------------------------------------------
Homayoon Akhiani "Turning Ideas into ... Reality"
Digital Equipment Corporation "Alpha, The New Beginning"
77 Reed Rd. Hudson, MA 01701 "All Rights Reserved. Copyright(c)1992"
Email: akh...@ricks.enet.dec.com "I talk for myself and not for my company"
"Feb 25, 1992: The day of the eclipse"
-------------------------------------------------------------------------------
If your memory system involves a network with randomized routing
(as already exists on the CM-5, I believe), writes to the same
destination may be reordered.
Rishiyur Nikhil
DEC Cambridge Research Lab
> On-chip I-cache 8 Kbyte, physical, direct-mapped,
> 32-byte line, 32-byte fill, 64 ASNs
Burkhard> What's an ASN ?
ASN = Address Space Number. If you're not doing multiple threads per address
space, this is the "process id". Anyway, it distinguishes between different
"user's"/"job's" addressing contexts.
How many ASN bits (aka PID tags in the TLB)?
--
Andy Glew, gl...@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Pkwy,
Hillsboro, Oregon 97124-6497
This is a private posting; it does not indicate opinions or positions
of Intel Corp.
Intel Inside (tm)
>>> In article <1992Feb25.1...@crl.dec.com>, j...@crl.dec.com (Jim Gettys) writes:
>>Someone (I can't remember who) once suggested PDP-64.
Ummm...., I like this.
But PDP stands for Programed Data Prcessor, "Programed" means maicrocode...
How about DECsystem64.
--
>_________________________________v_______________________________<
カンとケイケンとミエとハッタリⅹ 捍」腾估汤△けぇぴぃしぃˉこむ
Hiroaki Sasaki | 2630 Walsh Ave.
Kubota Pacific Computer Inc.| Santa Clara CA 95051 USA
Customer Support | E-Mail hi...@galle.kpc.com
^
|> Probably not too bad, since the regions tend to be quite large (at least
|> in Ada), often covering an entire procedure body. Note that this problem
|> appears in architectures with imprecise FP exceptions, too, such as the
|> Moto 88K.
|>
And it was such a pain to deal with in the Moto 88100 they made all
exceptions precise in the 88110. Debugging a segmentation violation
when you don't know exactly where or when it occurred is a pain.
The SciFi author James P. Hogan (former DEC Salescritter) had PDP-64's
in one of his books.
poster#2 >>Yes, they did. And yes, someone else heard it.
poster#2 >>I refrain from further comment.
Actually what they said was, the Viking chips run at 50MHz
on the chip-tester. The computer systems in the lab presently
(i.e. on 20 Feb 1992) are being operated at 40MHz.
Along this line, the DEC speaker said that on the chip-tester,
a few Alpha parts fall into the "fast bin" which is 200MHz.
Most of the parts on the chip-tester fall into the "normal bin"
which is 170MHz.
The "fact sheet" published by DEC as part of its press kit on the
21064 microprocessor (Alpha) calls it 150MHz, with a "Peak
instruction execution of 300 MIPS." Fact sheet doesn't
state what clock speed the computer systems in DEC's lab are
operating (on 20 Feb 1992).
--
Mark Johnson ma...@microunity.com (408) 734-8100
>All this implies that either 1) loads/stores ignore the bottom 3 bits
>and round down a la IBM RT, or 2) there are special (char *)
>load/stores that ignore the bottom 3 bits, and the regular load/stores
>trap on those three bits != 0.
>Has anybody else done something like this?
Yes: the original Stanford MIPS had only word load/store, and used byte
extract/insert instructions to do byte ops. The commercial MIPs
abandoned this approach after measurements showed about 15% performance
loss (or maybe I'm remembering measurements I did).
The DEC Titan processor also used this approach. I don't recall
whether there were separate forms of load/store that ignored the
lows bits of the address & didn't trap.
Note that the cost of loading/storing bytes is not quite as bad as it
appears; while an individual byte load/store might take 5 or 6 cycles,
when the bytes are part of a byte string, and accessed in a loop,
unrolling the loop can make a lot of the intermediate load/stores go
away. It might be nce to have some kind of test to see if incrementing or
decrementing an address crossed a word boundary. Conditional ops could
use this to avoid a lot of the cost.
The IBM RT can generate imprecise interrupts: loads and stores are
overlapped with other instructions and memory faults are not detected when
the instruction is started. Floating point hardware is memory mapped, so
FP exceptions are also imprecise. On a memory error, the processor saves 3
words of state: the access type (load/store, byte/halfword/word), the
register being loaded or stored, the memory address, and the data being
stored.
When "-g" is used, the RT C compiler puts a sync instruction after every C
source line. This makes sure that even though the processor may be
executing a different machine instruction when it detects a fault, the
debugger will show you the correct source line.
It is not impossible to debug code compiled without -g on the RT. It does
require some understanding of the processor to find the instruction that
caused an exception, but this is usually true of assembly-level debugging.
Often it is sufficient to know what the bad memory address was without
knowing the exact instruction.
It would take only one addition to the information recorded on a memory
error to make debugging much easier: the address of the instruction that
generated the exception (in general this would require passing a few extra
bits per instruction through the pipeline and having the instruction
dispatch unit maintain a table mapping these tags to instruction addresses;
if this had been done for the RT it would have been easier because only
memory accesses can generate imprecise interrupts).
--
John Carr (j...@athena.mit.edu)
Posted: Thu Feb 27 12:30:10 1992
]
] In article <> her...@crl.dec.com (Maurice Herlihy) writes:
]
] >>> In article <>, j...@crl.dec.com (Jim Gettys) writes:
] >>Someone (I can't remember who) once suggested PDP-64.
]
] Ummm...., I like this.
] But PDP stands for Programed Data Prcessor, "Programed" means maicrocode...
]
] How about DECsystem64.
How about VRAX-11/mmm : VAX RISC Architecture Xtension to the PDP-11.
( rembering that VAX stands for Virtual Architecture Extension : I do
love nested acronyms. :-) Although I doubt the Alpha has a PDP-11
emulation mode, the way the original VAXen did.
--
Dennis O'Connor doco...@sedona.intel.com
Not an official representative of Intel.
>But PDP stands for Programed Data Prcessor, "Programed" means maicrocode...
"Macrocode" or "microcode"? If "microcode", then it only means that if
it includes both "microcode" in the sense of "code that implements an
interpreter for the machine's official instruction set" and
"instructions in the regular instruction set with individual fields to
control individual operations", because there are PDPs that have
"microcode" in the first sense of the word but not the second (e.g.,
most of the PDP-11s - but not all, I think) and "microcode" in the
second sense of the word but not the first (e.g., the PDP-8, with its
"operate"-class instructions; most, if not all, implementations of the
PDP-8 architecture were hardwired, as far as I know).
How does the newly announced superscalar HP-PA-RISC 7100 match the Sun
Viking and the DEC Alpha.
Initially, the 7100 will run at 100 MHz.
Bo
--
^ Bo Thide'--------------------------------------------------------------
|I| Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
|R| Phone: (+46) 18-303671. Fax: (+46) 18-403100. IP: 130.238.184.19
/|F|\ INTERNET: b...@irfu.se UUCP: ...!mcvax!sunic!irfu!bt
~~U~~ ----------------------------------------------------------------sm5dfw-
Boy, is this *sour grapes* or what? It's obvious that if every other
instruction for the alpha is a barrier, then you'll get fully consistant
memory at a cost of signigicant slowdown. The point is: *not* *every*
*application* *needs* fully consistant memory after each instruction.
Alpha's semantics allows such applications to take advantage of the speedup
that results when consistancy can be relaxed. Sounds like a fine idea to
me.
The Intel i960CA has a bit that can be set to force precise
exceptions, or cleared to allow imprecise exceptions. The danger
with imprecise faults (on the 960CA) is that the sources of
the operation that faulted may have been modified by subsequent
instructions, and might therefor be unrecoverable.
I imagine any processor with sophisticated scoreboarding logic
(e.g. a pipelined super-scalar processor) can furnish this
flag-controllable capability in the scoreboard logic.
STAR-100 data flag branch, anyone ???
:-)
Rob
---
-legal mumbo jumbo follows-
This mail/post only reflects the opinions of the poster (author),
and in no manner reflects any corporate policy, statement, opinion,
or other expression by Network Systems Corporation.
Is that right? I thought VAX stands for
Virtual Address Indexing. Correct me if I am wrong.
--
K J Chang, Hewlett-Packard ICBD R & D, (())_-_(())
Palo Alto, CA 94304 | (* *) |
Internet: kjc...@hpl.hp.com a UCLA Bruin --> { \_@_/ }
voice mail: (415)857-4604 `-----'
In a current R3000 kernel [with huge # of drivers], I found, static frequencies:
358,424 instructions, of which
58,118 19.0% are lw (load word)
38,342 10.7% are sw (store word)
106,460 29.7 % of total are 32-bit load/stores
--------------
Alpha + cycles -V
* % =-------------V
2,236 0.62% are lh (load halfword) 2 1.24
3,306 0.92% are lhu (load halfword with zero extend) 1 0.92
3,893 1.08% are sh (store halfword) 4 4.32
705 0.02% are lb (load byte) 2 0.04
3,730 1.21% are lbu (load byte with zero-extend) 1 1.21
4,417 1.23% are sb (store byte) 4 4.92
---------------------
18,270 5.1 % of total are partial-word load/stores 12.65
It is of course not a good assumption to assume dynamic frequency == static,
I'll do it anyway; in this case I believe it's actually reasonable.
The usual rule of thumb is 20% loads, 10% stores for dynamic frequency,
but kernel code is (predictably) higher in load/store usage, both dynamic
and static. The numbers at right above are quick approximations of the
additional cycle counts for using the Alpha sequences in place of the MIPS
ones; using the sequences from Alpha manual A.4.1
[which are normal case, assuming
that you know the offset-within word of the halfword or byte]; some cases
will be worse; some cases will be better, for various reasons.
As can be seen, the major penalties arise from the exra instructions for stores.
>Note that the cost of loading/storing bytes is not quite as bad as it
>appears; while an individual byte load/store might take 5 or 6 cycles,
>when the bytes are part of a byte string, and accessed in a loop,
>unrolling the loop can make a lot of the intermediate load/stores go
>away. It might be nce to have some kind of test to see if incrementing or
>decrementing an address crossed a word boundary. Conditional ops could
>use this to avoid a lot of the cost.
I'd suspect that relatively few of the above static loads/stores are
byte copies, but rather simple accesses to variables, most commonly to
a) elements of structures accessed through pointers
b) u-area [per-process data] elements
c) local variables.
The APA16 was an interesting device. I thought that the interface to the
hardware was pretty good. There was an area below the screen where you could
queue up instructions. (Actually you can execute the stuff on the screen too,
but i wouldn't recommend it.) Then you write to a location and it increments
the number of instructions to execute.
It had some weird features, for example you could address the screen with the
bits grouped vertically. When you did a copy, you could specify a 90-degree
rotation, although i wonder if anyone uses this. I'm not sure what these
features say about the hardware.
The thing i found most useful was that you can cache some characters in the
area below the screen. Then when you want to draw a character, you can tell
the hardware to do the blit, and you don't have to wait for it to finish. I
got it down to about a dozen instructions to draw a character in the case when
it is cached. Even redrawing a full page looked almost instantaneous.
Then we could talk about the Viking, which had downloadable microcode.
--
Joe Keane, professional programmer
j...@osc.com (uunet!amdcad!osc!jgk)
>Is X11 the only popular window system that supports multiple logical
>operations for drawing commands?
>--
> John Carr (j...@athena.mit.edu)
As far as I know, my homebrew ``sort-of-like-NeWS'' PostScript window
system is the only one, popular or otherwise, which doesn't allow
all those nasty logical operatons. The nice thingis that it forced me
to do real software ``overlays'' for rubber-banding, which means
that rubber band lines don't disappear over complex backgrounds, like
those terrible XOR thingies:-)
--
Ian D. Kemmish Tel. +44 767 601 361
18 Durham Close uad...@dircon.co.uk
Biggleswade uknet!dircon!uad1077
Beds SG18 8HZ United Kingdom
I beseech Digital and any other Alpha C compiler writers to have mercy
upon us poor software engineers who are stuck with sizeof(long int) == 4
code (heck, forget code, how about DATA FILES that are just structs
written to disk!?!?). Please provide compability switches or whatever
that makes sizeof(long int) == sizeof(char *) == sizeof(int) == 4.
True 64 bit integer math isn't worth the portability headaches of
making long int be 64 bits.
The real world has 32 bits on the brain and it will take a long time
before that changes.
Perry
P.S.
I didn't see anything in the ANSI C spec that allowed "long long" or
such other new syntax. Are we going to have a plethora of nonstandard
extensions?!?
--
ca...@adobe.com ...!{sun,decwrl}!adobe!caro Contents: my opinions, no others
Could I deduce here that Alpha is also a multithreaded processor? According to
Sites's Summary Alpha is also suitable for shared memory multiprocessor.
I would imagine that the multithreading technique would be used for tolerating
memory latency! But in the technical summary of alpha architecture, no related
mechanims for tolerating latency were mentioned, except for issue order of
memory accesses! Am I right? please correct me!
Thanks in advances
Xiaoming Fan
Dept. of Computer Science
Univ. of Hamburg
Germany
Just off the top of my head, it seems that it would be easy to use
most third-part devices. Just have them ignore the low-order address
bits and change your "struct dev {char control, data;}" to
"struct dev {int control, data;}". The other points (about packed
data structures for networking) are well taken.
[more stuff deleted]
>5) It might be useful in any further discussion on this to identify
>one's background, i.e., given the point of view expressed above,
>I've found a common separation:
> a) hardware designers would love to get rid of short loads/stores
> b) systems programmers would rather not, because they have to
> deal with the consequences...
>--
>-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
For the record, I'm an undecided systems programmer.
--
Iain Bason
..uunet!caliban!ig
It's certainly an interesting idea, and not one that DEC thought up on
its own. Another example is the 88110, which allows reads to bypass writes
in the load/store unit (provided they're separate addresses).
There's been a lot of work looking at relaxed consistency models (e.g.,
weak consistency, processor consistency, release consistency) and how
one would program an application so that it *appears* sequentially
consistent to that application. Following this is a subset of a
bibliography posted to comp.os.research by John Carter from Rice Univ.
-Erich
------------------------------------------------------------------------------
Erich Nahum A305 Lederle Graduate Research Center
Real-Time Systems Group Department of Computer Science
na...@cs.umass.edu University of Massachusetts at Amherst
(413) 545-4753 Amherst, MA 01003
----------------------------- cut here ---------------------------------------
@CONFERENCE{advehill90b,
AUTHOR = {S. Adve and M. Hill},
TITLE = {Implementing Sequential Consistency in Cache-Based Systems},
BOOKTITLE = ICPP90,
PAGES = {47-50},
MONTH = aug,
YEAR = 1990}
@CONFERENCE {attiyawelch91,
AUTHOR = {H. Attiya and J. L. Welch},
TITLE = {Sequential Consistency versus Linearizability},
BOOKTITLE = SPAA91,
ADDRESS = {Hilton Head, South Carolina},
MONTH = jul,
YEAR = 1991,
PAGES = {304-315}}
@TECHREPORT{bershadzekauskas91,
AUTHOR = {B.N. Bershad and M.J. Zekauskas},
TITLE = {Midway: Shared Memory Parallel Programming with Entry
Consistency for Distributed Memory Multiprocessors},
INSTITUTION = {Carnegie-Mellon University},
YEAR = {1991},
NUMBER = {CMU-CS-91-170},
MONTH = sep}
@CONFERENCE{duboisscheurich86,
AUTHOR = {Michel Dubois and Christoph Scheurich and Fay\'{e}~A. Briggs}
,
TITLE = {Memory access buffering in multiprocessors},
BOOKTITLE = SIGARCH86,
MONTH = may,
YEAR = 1986,
PAGES = {434-442}}
@ARTICLE{duboisscheurich88,
AUTHOR = {M. Dubois and C. Scheurich and F.A. Briggs},
TITLE = {Synchronization, coherence, and event ordering in
multiprocessors},
JOURNAL = {{IEEE} Computer},
VOLUME = 21,
NUMBER = 2,
PAGES = {9-21},
MONTH = feb,
YEAR = 1988}
@ARTICLE{duboisscheurich90,
AUTHOR = {M. Dubois and C. Scheurich},
TITLE = {Memory Access Dependencies in Shared-Memory Multiprocessors},
JOURNAL = IEEE-TC,
VOLUME = {16},
NUMBER = {6},
PAGES = {660-673},
MONTH = jun,
YEAR = 1990}
@CONFERENCE {gharachorloolenoski90,
AUTHOR = {K. Gharachorloo and D. Lenoski and J. Laudon and
P. Gibbons and A. Gupta and J. Hennessy},
TITLE = {Memory Consistency and Event Ordering in Scalable
Shared-Memory Multiprocessors},
BOOKTITLE = sigarch90,
ADDRESS = {Seattle, Washington},
MONTH = may,
YEAR = 1990,
PAGES = {15-26}}
@CONFERENCE{gharachorloogupta91,
AUTHOR = {K. Gharachorloo and A. Gupta and J. Hennessy},
TITLE = {Performance Evaluations of Memory Consistency Models
for Shared-Memory Multiprocessors},
BOOKTITLE = ASPLOS4,
YEAR = 1991,
MONTH = apr}
@CONFERENCE {gibbonsmeritt91,
AUTHOR = {P.B. Gibbons and M. Merritt and K. Gharachorloo},
TITLE = {Proving Sequential Consistency of High-Performance Shared Memo
ry},
BOOKTITLE = SPAA91,
ADDRESS = {Hilton Head, South Carolina},
MONTH = jul,
YEAR = 1991,
PAGES = {292-303}}
@TECHREPORT{goodman91,
AUTHOR = {J.R. Goodman},
TITLE = {Cache consistency and sequential consistency},
INSTITUTION = {University of Wisconsin-Madison},
YEAR = 1991,
NUMBER = {CS-1006},
MONTH = feb}
@CONFERENCE {huttoahamad90,
AUTHOR = {P.W. Hutto and M. Ahamad},
TITLE = {Slow Memory: Weakening Consistency to Enhance Concurrency
in Distributed Shared Memories},
BOOKTITLE = DCS90,
ADDRESS = {Paris, France},
MONTH = may,
YEAR = 1990,
PAGES = {302-311}}
@UNPUBLISHED{kelehercox91,
AUTHOR = {P. Keleher and A. Cox and W. Zwaenepoel},
TITLE = {Lazy Consistency for Software Distributed Shared Memory},
NOTE = {To appear at the 18th Annual International Symposium on
Computer Architecture},
MONTH = may,
YEAR = 1992}
* Me too. I think the most interesting thing is its lack of condition flags.
* Most importantly, no overflow or carry. Both of these flags are in critical
* speed paths, so I think it was probably a good idea for DEC to eliminate
* them.
Unfortunately overflow is important for certain languages. Lisp
implementations need to determine when word-integer operations have
overflowed in order to use the bignum (extended integer) operations.
Other languages have finite range integers but specify that overflow
exceptions must be signalled by the implementation.
Are overflow and carry inherently hard, or only because a hidden
register (condition code register) is needed to store them? I can
easily imagine an architecture with two add (and subtract)
instructions. One set would compute the same values that the current
set does. The other would compute only the carry and overflow bits.
I believe it is only that condition codes are needed. But are condition
codes hard? There are ways of computing overflow and carry without
doing the addition, but is the cost of the extra instruction worth
the minor additional cost of including them in the addition? Or are
we so determined that a procedure can only have one result that we
insist on cutting off our noses to spite our faces?
If hardware can produce something at a cost of C in hardware, and
can produce that and other useful things at a cost of 1.01*C, including
the other things should not be dismissed because of complications in
languages not providing for them, etc.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@pop.stat.purdue.edu (Internet, bitnet)
{purdue,pur-ee}!pop.stat!hrubin(UUCP)
Uh, you haven't been reading the fine print. Early releases of the 29000
did not do byte accesses. Later ones *do*, although they retained the
facilities the older ones had for synthesizing them -- it remains an
option for the system designer whether he wants to implement them in
hardware or software. I'd be very surprised if the 29050 didn't do
them, although my docs are not handy.
--
The X Window system is not layered, and | Henry Spencer @ U of Toronto Zoology
it was not designed. -Shane P. McCarron | he...@zoo.toronto.edu utzoo!henry
There's been a lot of work looking at relaxed consistency models (e.g.,
weak consistency, processor consistency, release consistency) and how
consistent to that application. Following this is a subset of a
bibliography posted to comp.os.research by John Carter from Rice Univ.
-Erich
And the beat goes on. From the advanced program for the upcoming
computer architecture symposium in Australia.
Session 1: Consistency for Shared Memory Multi-Processors
Arvind, MIT (USA)
Richard Noah Zucker, Jean-Loup Baer
A Performance Study of Memory Consistency Models
Pete Keleher, Alan L. Cox, and Willy Zwaenepoel
Lazy Consistency for Software Distributed Shared Memory
Kourosh Gharachorloo, Anoop Gupta, John Hennessy
Hiding Memory Latency using Dynamic Scheduling in
Shared-Memory Multiprocessors
See you in Australia!
Allan Gottlieb
New York University
Program Committee Chariman
1992 International Symposium on Computer Architecture (ISCA)
I wish to register my displeasure at this. The whole point of
producing a 64-bit processor is to be able to do 64-bit arithmetic.
If you still have programmers that hardwire sizeof(long int) ==
sizeof(char *) == sizeof(int) == 4 into code, then I suggest you have
two alternatives:
1. Fire the bastards.
2. lock them in a room with a version of K&R that is not more than 4 years
old.
If you have old code that was written under these assumptions, then you do
indeed have a problem, but you should not try to hold the rest of the world
back because of some poor design decisions made by someone else.
>I didn't see anything in the ANSI C spec that allowed "long long" or
>such other new syntax. Are we going to have a plethora of nonstandard
>extensions?!?
The ANSI C spec also does not specify sizeof(long int) == sizeof(char*)
== sizeof(int) == 4. long long has been adopted by a number of people,
including gnu.
Kevin McCurley
Sandia National Laboratories
>I beseech Digital and any other Alpha C compiler writers to have mercy
>upon us poor software engineers who are stuck with sizeof(long int) == 4
>code (heck, forget code, how about DATA FILES that are just structs
>written to disk!?!?). Please provide compability switches or whatever
>that makes sizeof(long int) == sizeof(char *) == sizeof(int) == 4.
>True 64 bit integer math isn't worth the portability headaches of
>making long int be 64 bits.
>
>The real world has 32 bits on the brain and it will take a long time
>before that changes.
>I didn't see anything in the ANSI C spec that allowed "long long" or
>such other new syntax. Are we going to have a plethora of nonstandard
>extensions?!?
Personally, I think longlong is an awful syntax, but it's out there in
GNU C and some other compilers, and various others are coming out with it,
and it's a portable way to talk about 64-bit ints that works across both
32- and 64-bit machines, so here it comes.
There have been many arguments here [and among a small group of companies
that has a very strong interest in this topic] about the right defaults
and sets of options for going from 32-bit C to 64-bit C.
Although not everyone agrees with me, I conclude that regardless of the
defaults, you need to be able to handle things with storage layouts that
look the same as the 32-bit versions, at least from everything except pointers,
which fortnately seldom show up in data structures that get exchanged amongst
programs frequently. You will certainly run into programs that will compile
trivially if you an support the standard 32-bit model. On the other hand,
code better start getting cleaned up to take advantage of 64-bit addresses...
One way or another, all of the widely-used RISCs either do, or will,
within next few years, support 64-bit programming models.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
Overflow and carry are not exceptionally hard, although they are at
the end of the carry network in the adder.
The condition code side effect is usually the problem in pipelined
implementations, because of the prioritization required for accesses.
However, condition codes can be rather cleanly handled by advanced
implementations, to the point where I almost wish that there were more
things that could be done with condition codes. The problem with the
traditional condition codes is that most of them contain information
that can be directly deduced from the result, and those that don't, OF
and CF, aren't really very frequently used.
--
Andy Glew, gl...@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Pkwy,
Hillsboro, Oregon 97124-6497
This is a private posting; it does not indicate opinions or positions
of Intel Corp.
Intel Inside (tm)
4.11.3 Memory Barrier
Format:
MB !Memory format
Operation:
{Guarantee that all subsequent loads or stores will not access
memory until after all previous loads and stores have accessed
memory, as observed by other processors.}
Exceptions:
None
Instruction mnemonics:
MB Memory Barrier
Qualifiers:
None
Description:
The use of the Memory Barrier (MB) instruction is required only in
multiprocessor systems.
In the absence of an MB instruction, loads and stores to different
physical locations are allowed to complete out of order on the
issuing processor as observed by other processors. The MB
instruction allows memory accesses to be serialized on the issuing
processor as observed by other processors. See Chapter 5 [which I
have no intention of reproducing here! PK] for details on using the
MB instruction to serialize these accesses. Chapter 5 also details
coordinating memory access across processors.
Note that MB ensures serialization only; it does not necessarily
accelerate the progress of memory operations.
4.11.5 Trap Barrier
Format:
TRAPB !Memory format
Operation:
{Stall instruction issuing until all prior instructions are
guaranteed to complete without incurring arithmetic traps.}
Exceptions:
None
Instruction mnemonics:
TRAPB Trap Barrier
Qualifiers:
None
Description:
The TRAPB instruction allows software to guarantee that in a
pipelined implementation, all previous arithmetic instructions will
complete without inucrring any arithmetic traps before any
instructions after the TRAPB are issued. For example, TRAPB should
be used before changing an exception handler to ensure that all
exceptions on previous instructions are processed in the current
exception-handling environment.
These sections are typical of the Alpha architecture in that they specify
behavior without specifying mechanisms.
---Pete
kai...@heron.enet.dec.com
+33 92.95.62.97
Your basic point has merit. Unfortunately, if you take the mathematical
limit of your function NC = 1.01*C, you get infinity if you "add" enough
features. Cray's have not had condition codes from day one, and the
"different" way of testing things has not restricted their usefulness
in the least. In fact, it probably has a beneficial effect on program
correctness.... it is often too easy to do a comparison followed by a
few other instructions (one of which affects the condition code) and
then make a bogus branch. Can't do this on a Cray.
It would also seem that no matter how deep the pipeline, if you need to
"trap" an overflow (or whatever) you can contrive an instruction stream
that will allow you to do so. Granted, you might have to insert a bunch
of nops or whatever so that only one overflow can occur in the pipe, but
at least you will know WHAT overflowed. Then, inserting BIGNUM or anything
else would be straightforward.
--
!Robert Hyatt Computer and Information Sciences !
!hy...@cis.uab.edu University of Alabama at Birmingham !
> Personally, I think longlong is an awful syntax, but it's out there in
> GNU C and some other compilers, and various others are coming out with it,
> and it's a portable way to talk about 64-bit ints that works across both
> 32- and 64-bit machines, so here it comes.
Well, when I was on X3J11 I suggested allowing a syntax something like
FORTRAN, int*4, int*8, etc, where the value of N could be a power of two
and there was no hardware support implied, it could be as slow as needed
if it worked. A few people liked the idea, but most didn't like using
something from FORTRAN, thought it would add a lot of size to the
library, etc, etc.
So now we have a bunch of undeterminate sized data objects which people
diddle to try and get the compiler to use as they want.
It was pointed out to me that we did add signed bitfieleds to the
language, so you can get around file io problems to a reasonable extent
by using them in a structure. However the keyword 'packed' for structure
was not added, so some compilers have a pragma, some have a command line
option, and some don't support it. The argument was that it shouldn't be
in the language because it's not a common problem. I guess it wasn't
then, but right now I don't care how slowly access to badly alligned
data might be, it can't be slower than doing it in C explicitly.
I would like to see the next version of the language include unalligned
access library routines at least, and conversion to/from IEEE 32/64 bit
floats. Then they could be used in a portable way.
--
- bill davidsen (davi...@crd.ge.com)
GE Corp. R&D Center; Box 8; Schenectady NY 12345
>Cray's have not had condition codes from day one, and the
>"different" way of testing things has not restricted their usefulness
>in the least.
Crays are not general purpose computers; they are already so different
that the lack of condition codes is a minor difference. Processors
like Alpha aim at a wider market.
Does anyone run lisp on a cray? Any comments?
--
John Carr (j...@athena.mit.edu)
While this is good advice to people employing such programmers, note that
the original complaint was about code, not programmers. Firing the clots
will not make their code go away (more's the pity). In practice, however
deplorable such practices may be, you will sell more machines if you offer
at least a slow compatibility mode that will let such code run unfixed.
For example, assuming that you can do unaligned loads and stores is sinful
and unwise, but you'll nevertheless find an option in the Mips compilers
to generate alignment-insensitive load/store code. It's slower than normal
load/store operations, but it means you can get the code running without
having to scrub it clean first. According to John Mashey, the software
vendors *strongly* approve of this. I can see why -- it means they can
have working code to test or even ship, with cleanup and performance work
done when convenient rather than under forced draft.
Like it or lump it, there is a lot of badly-written code out there, nobody
has time to fix it all, and you or your customers probably want to run
some of it.
Crays are not general purpose computers; they are already so different
that the lack of condition codes is a minor difference. Processors
like Alpha aim at a wider market.
Does anyone run lisp on a cray? Any comments?
If you're running code that doesn't vectorize, does it really pay to
use a Cray?
-Mike
Absolutely nothing to prevent lisp on a Cray. Sure runs C, Pascal, Fortran,
etc. pretty well...... If a Cray isn't general purpose, then how is the
Alpha? (or the I860, RS4000, .....) While a Cray might burn up a vectorizable
loop, they have never "choked" on my sphagetti code.... :^)
Bob
Isn't cost effective, but it *is* faster. Crays are very fast scalar
processors too. Just ask Bob Hyatt, who runs chess programs on Crays
and posts regularly here.
I've used Lisp on a Cray, but not seriously. I just ran some bignum
code to see how fast it was. This was on a development machine at Cray
(back when I lived in MN where all my friends work for Cray), and it
was a port of one of the commercial Common Lisps (probably allegro or
lucid, can't remember). I think the machine was an X-MP but it might
have been a -2.
Anyway, it was really fast. Lots faster than Lisp on say, a SPARC.
I don't have any exact numbers though. Probably at least 5 times faster.
I don't think anyone would ever buy a Cray just to run Lisp, but if you
happen to have a Cray sitting around in the closet somewhere, why not?
Those beasties are *fast*.
I am interested in any research/documentation of exception models
and their implementation on superscalar microprocessors.
I'm looking at manuals for a few superscalar machines that are
already out (MC88110, i860, ...), but am also interested in research
documentation on this topic. Any references would be much appreciated!
Please reply via email and I'll summarize if there's any interest.
-----------------------------------------------------------------------
| Matt Holle Motorola Microprocessor |
| RISC Applications & Memory Technology Group |
| (ma...@oakhill.sps.mot.com) Austin, TX |
-----------------------------------------------------------------------
>Viking implements the stbar instruction, which is needed in every mutex lock
>primitive and context switch primitive. There are code samples in the manual.
>Adrian Cockcroft - adrian.c...@uk.sun.com or adrian.c...@sun.co.uk
>Sun Microsystems, 306 Science Park, Milton Road, Cambridge CB4 4WG, UK
>Phone +44 223 420421 - Fax +44 223 420257 Sun Mailstop: ECBG01
--
Don D.C.Lindsay Carnegie Mellon Computer Science
When Sun introduced the Sparc-based machines, few people (H. Rubin, excepted) complained about the architectural change, since floating point format, byte ordering, etc. were preserved. Both machines looked the same to C code. One
was just faster, hence better.
DEC's Alpha supports BOTH Vax-float and IEEE-float. Presumably, they could
have found other uses for the gates required for this. This support, in an
increasingly IEEE-fp-only world, suggests that they are sensitive to the use
of computers in applying existing solutions.
With this sensitivity, it is hard to imagine (although it would please me
greatly!) that DEC would fail to provide a compiler mode in which extended
the life of 32-bit-or-die programs.
-- Michael Jones m...@sgi.com 415.335.1455 M/S 7L-552
Silicon Graphics, Advanced Systems Division
2011 N. Shoreline Blvd., Mtn. View, CA 94039-7311
Putting aside the Cray and a potential argument of applications;
Providing that the Alpha (or any other processor) provides a neat and
efficient mechanism for detecting errors - without condition codes,
Why should this mechanism exclude such a chip from the "wider market" ?
--
Rupert Pigott, impoverished student at Warwick University.
If you so wish, you may degrade yourself by mailing me at :
cst...@csv.warwick.ac.uk