Later in the article, Figure 2 is particularly interesting; its caption reads
"Processor performance relative to the 68020 versus cache size
(where the 68020 equals 1)."
For the cache sizes actually used in the 68040 (4Kbytes), the
performance plotted in Figure 2 [68040 normalized to 68020] is in
the range 3.6X to 4.3X, depending upon the workload. Most of the
benchmarks shown are at 4.1X.
So, the data and the claim that 68040==20VAXmips implies that the earlier
68020 has a "sustained performance level of 4.9 VAX-equivalent MIPS"
(4.9 = 20/4.1). Does anybody seriously believe this?
About the most impartial data I could find was for the Hewlett Packard
HP9000 model 370 machine. This uses a 68030 (not 68020) at 33 MHz (not
25 MHz) and achieves a geometric mean of 3.9 SPECmarks [ref. SPEC
newsletter v1#1]. It seems reasonable to suspect the 68020 is no
better than the 030 in performance {else who'd want the 030?}, so we
conclude that the 020's performance is, at most, 3.9 VAX-equivalent
MIPS. This makes the 68040 a 16 VAXmips machine (at most), not 20 VAXmips
as advertised.
Of course the best method would be to lay hands on an actual computer
system that uses the 68040 and benchmark it; presumably Motorola
and/or NeXT and/or HP will do this someday. Prediction: the SPECmark
will be significantly below 20.0.
Disclaimer: I'm biased. Check out the SPEC newsletter and the issue
of IEEE Micro to see if I've distorted the
facts. (I assert that I haven't)
--
-- Mark Johnson
MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94086
(408) 524-8308 ma...@mips.com {or ...!decwrl!mips!mark}
It's good for portability to develop code on the _least_ forgiving
machine that you can find.
I once did it the other way around, to my sorrow. We had a large
compiler that ran on a DEC-20, and was supposedly working. I ported
it to a VAX with BSD 4.2, and there were no end of seg faults. Why?
Well, the 20 had a virtual address space of 256 KW. So, a large
compiler which generated a wild address would reference something
that existed. In a 4 GB space, that changed. Further, it turned out
that the average wild pointer was used for reads, gathering data
which resulted in a boolean decision. So, there had always been a
fair chance that the branch would go the right way!
Then I ported it from the VAX to a Sun-3. Another disaster! Sun had
intelligently made page zero illegal, so that nul pointers would
trap.
If the original development had been done on the most unforgiving
machine, the ports would have been started a bit later - but would
have been finished sooner.
In general, the best way to develop portable code is to force the
original development to happen on two machines at once. This means
that the original team is still there during that crucial first port.
Plus, portability issues can surface _before_ the design is frozen.
--
Don D.C.Lindsay
** Opinions wanted, raging debated expected. Join now! Avoid the rush! **
In article <40...@mips.mips.COM>, ma...@mips.COM (Mark G. Johnson) writes:
> The June 1990 issue of _IEEE_Micro_ contains an article about the
> Morotola 68040, written by some of its designers. The article agrees
> with some of the advertising copy, saying "The sustained
> performance level is 20 VAX-equivalent MIPS and 3 Mflops at a clock
> speed of 25 MHz." (1st paragraph, 4th sentence).
. . .
> So, the data and the claim that 68040==20VAXmips implies that the earlier
> 68020 has a "sustained performance level of 4.9 VAX-equivalent MIPS"
> (4.9 = 20/4.1). Does anybody seriously believe this?
...
> MIPS. This makes the 68040 a 16 VAXmips machine (at most), not 20 VAXmips
> as advertised.
> Of course the best method would be to lay hands on an actual computer
> system that uses the 68040 and benchmark it; presumably Motorola
> and/or NeXT and/or HP will do this someday. Prediction: the SPECmark
> will be significantly below 20.0.
Background: This gives me an opportunity to raise a
question that I would like addressed. I am systems manager
for a reasonably sized comp.sci. department (about 60
machines, with a new lab or two under construction that
will take this to about 120 by the start of term).We have
a couple of Sun 3/280s, a couple of Sun 4/280s, the usual
passle of Sun 3/50s, our first SPARCs in place with more
on the way, about 1/2 dozen NeXT with more on the way,
plus the usual ragtag collection of widows and orphens (a
MIPS, a few HPs, a uVax II, etc).
My home machine used to be a Sun 3/280 (now I have a NeXT,
yes there's a few out there!) Most of our students are on
the Sun 4/280 pair, along with the usual collection of
grad student projects, imported freeware, professors'
code, etc, etc. I myself have done fair sized programming
projects on our earlier (now departed) Vax 11/780s and the
Sun 3's (plus GASP! PCs) but my exposure to programming on
the SPARC architecture has been mostly through word of
mouth from my staff and the user community (Sigh, I must
be a _REAL_ manager now :-(
Preamble: From my own exposure to user complaints about
unportable software, etc I would allege that that the SPARC
architecture appears to be inherently less forgiving of the
programmer than the 68k architecture. I keep receiving
reports of software that ran fine on machine 'x' (often
68k machines, Vaxen etc) but that either wont compile or
wont run on the Sun 4's. I have had this explained to me
(by a defender of the SPARC architecture) as being due to
the fact that the SPARC architecture is less forgiving of
poor programming (eg byte alignment problem in structures,
ignoring warning messages, etc). It was stated to me that
"well-wrtten programs work, if not it's the programmer's
fault."
Now to the question: First, is it true that SPARC is
inherently less forgiving? If so, is it due to
"limitations" of the architecture (eg. byte-alignment
restrictions) or is this a "feature" through which sloppy
programmers are now being taught the error of their ways?
More specifically, If I accepted the rumours that porting
to the SPARC was harder as an axiom could I argue that
this was a "bug" of the architecture, not a feature, in the
same way that difficulty addressing a 65k array on an 8088
machine is a residual bug due to limited segment size of
that chip?
Now, a few caveats, disclaimers, etc. Almost all the
programming problems I have heard about were in C, which
is an inherently low level language (although at least one
very large Pascal program, a compiler project, never was
ported successfully) and perhaps this should have been
sent to comp.lang.c. I _would_ like answers to be centred
around the problems of C programmers if this is a
reasonable thing to do.
The original motivation for this posting was as a followup
to a discussion I had that originated from my observation
that I seem to hear of more porting problems with the Sun
4's than the Sun 3's. One of my staff disputed my claim,
claiming that the problem was with poor C programmers, not
the SPARC (or RISC in general).
As I have not attempted a serious program on the machine,
nor am I all that familiar with its internals, I do _NOT_
claim to be capable to make the comparison, but have
trouble swallowing the claim the bugs due to byte-alignment
restrictions showing through (if in fact, that is what is
happening) is the programmer's fault. I think the
"near/far" abomination I used to have to use in the 8088 C
compilers showed a fault of the 8088 architecture, raising
up through the language to bit the programmer (and I
understand that smarter compilers now can hid this from
me, well so my poor friends still programming DOS machines
claim).
Given the (admittedly second-hand) reports of problems, is
this not a similar hardware problem raising up through the
language? Or am I really out to lunch here? If so, I
apologize in advance, so don't flame.
Note, I am _NOT_ disputing the worth of RISC
architectures, which have given us dramatic performance
gains. I'm just asking if we've paid a price in
"servicability"?
So, here's the clincher. My collegue claims there are NO C
programs that will:
a) Pass lint on the Sun 3
b) Pass lint on the Sun 4
c) Run, giving a "correct" answer on the Sun3, without crashing.
d) Either crash or give the "wrong" answer on the Sun 4.
In effect, any crash would be due to faulty programmer,
not a faulty architecture.
Is this true? Is the reverse true (ie crash on a Sun 3,
but not on the Sun 4)? Can we draw valid conclusions about
the frequency of either? Both?
Enquiring minds want to know. If the appropriate programs
exist, I would especially like to see them (email please).
Otherwise, I'll follow comments in the groups.
I now await the verdict of others more knowledgeble than
myself...
- peterd
>Now to the question: First, is it true that SPARC is
>inherently less forgiving?
No, not always. The following program reflects a bug that actually
occurred on software developed on a Sun-4. Things ran fine until the
software was ported to a Sun-3:
---------------
main()
{
pr3( "Hello" , "world" , "!\n" ) ;
pr3( "This" , "is wrong" ) ;
}
pr3( s1 , s2 , s3 )
char *s1, *s2, *s3 ;
{
printf( "%s %s %s" , s1 , s2 , s3 ) ;
}
--------------End of program
As you see, the second call to pr3() lacks one parameter. When run
on a Sun-4, compiled with either -O or not, the output is:
hello world !
This is wrong !
When run on a Sun-3, the output is:
hello world !
This is wrong (null)
Now you might argue that the Sun-3's answer is more correct than the
Sun-4's, but in the abovementioned software the first call had left the
correct parameter on the register stack for a subsequent call. Of course
all this has more to do with parameter-passing conventions than
RISC vs. CISC.
Now this is just one counterexample. In general my feelings towards
software development are: "If it runs on s SPARC, it runs everywhere",
i.e., if you want to send reliable software out to the world, develop on
a machine with a highly optimizing compiler, not one of those
C-is-just-a-macro-assembler-kind-of-thing.
>So, here's the clincher. My collegue claims there are NO C
>programs that will:
>a) Pass lint on the Sun 3
>b) Pass lint on the Sun 4
>c) Run, giving a "correct" answer on the Sun3, without crashing.
>d) Either crash or give the "wrong" answer on the Sun 4.
pr3: variable # of args. ts.c(9) :: ts.c(4)
printf returns value which is always ignored
++marcel beem...@fwi.uva.nl
"So they destroyed their planet, why?", "That was better for the economy..."
>** Opinions wanted, raging debated expected. Join now! Avoid the rush! **
-Preamble: From my own exposure to user complaints about
-unportable software, etc I would allege that that the SPARC
-architecture appears to be inherently less forgiving of the
-programmer than the 68k architecture. I keep receiving
-reports of software that ran fine on machine 'x' (often
-68k machines, Vaxen etc) but that either wont compile or
-wont run on the Sun 4's. I have had this explained to me
-(by a defender of the SPARC architecture) as being due to
-the fact that the SPARC architecture is less forgiving of
-poor programming (eg byte alignment problem in structures,
-ignoring warning messages, etc). It was stated to me that
-"well-wrtten programs work, if not it's the programmer's
-fault."
SPARC requires fullword alignment for 32-bit operands and DOUBLEword
alignment (e.g. 8-bytes) for 64-bit operands, in memory. This is more
restrictive than the 68K, which was prepared (at the 68020 and above) to
accept byte alignment for all operand sizes. The 68000 only required 16-bit
alignment at worst, but then it was only a 16-bit machine. I am not surprised
that there is trouble trying to use a data structure on the SPARC that was
created on a machine with less restrictive alignment requirements.
It seems to me that, if downward compatibility is absolutely required, it is
the C (or other language) compiler's fault, not the user's. After all, the
SPARC is fast enough that executing extra code to piece together misaligned
data operands is a small price to pay. :-)
Data misalignment problems are independent of how well a program is written,
especially if the program was written for another target architecture.
Marc Kaufman (kau...@Neon.stanford.edu)
You are comparing apples to oranges: SPECmarks and VMIPS are not the
same at all.
HP 9000/350 (68020 @ 25 Mhz) is rated at 5.4 VMIPS and 1.7 SPECmarks.
If (crudely) the 040 is 4.1X this machine it will rate at 22 VMIPS and
7 SPECmarks.
Sanjay Uppal
NN9T phone (408) 447-3864
Hewlett-Packard (IND) uucp: ...!hplabs!hpda!uppal
NS arpa: uppal%hp...@hplabs.hp.com
My experience with porting code between DEC-20, VAX BSD 4.3, Sun-3,
NeXT, Sun-4, DEC RISC, Macintosh, MS-DOS, and Sequent suggests that
all of these systems are "unforgiving" in their own unique ways. The
DEC-20 will nail you if you think that bytes are in any way related to
the unit of addressing or if you think you know in what direction the
stack grows. MS-DOS will nail you if you think that pointers will fit
inside an int or if you use strings larger than 64K. Macintosh will
nail you on 64K limits and memory management too -- malloc()/free() is
not how you want to use memory on a Mac. VAXen and other
little-endian machines will nail you if you think that a 16-bit
quantity can be copied to a pair of bytes. And so on.
It seems to me that using an ANSI compiler and ANSI-style prototyping
saves an amazing amount of time, more than initially developing on any
particular architecture.
_____ | ____ ___|___ /__ Mark Crispin, 206 842-2385, R90/6 pilot, DoD#0105
_|_|_ -|- || __|__ / / 6158 Lariat Loop NE "Gaijin! Gaijin!"
|_|_|_| |\-++- |===| / / Bainbridge Island, WA "Gaijin ha doko ka?"
--|-- /| |||| |___| /\ USA 98110-2098 "Niichan ha gaijin."
/|\ | |/\| _______ / \ "Chigau. Gaijin ja nai. Omae ha gaijin darou"
/ | \ | |__| / \ / \"Iie, boku ha nihonjin." "Souka. Yappari gaijin!"
Hee, dakedo UNIX nanka wo tsukatte, umaku ikanaku temo shiranai yo.
>The June 1990 issue of _IEEE_Micro_ contains an article about the
>Morotola 68040, written by some of its designers. The article agrees
>with some of the advertising copy, saying "The sustained
>performance level is 20 VAX-equivalent MIPS and 3 Mflops at a clock
>speed of 25 MHz." (1st paragraph, 4th sentence).
>
>Later in the article, Figure 2 is particularly interesting; its caption reads
> "Processor performance relative to the 68020 versus cache size
> (where the 68020 equals 1)."
>
>For the cache sizes actually used in the 68040 (4Kbytes), the
>performance plotted in Figure 2 [68040 normalized to 68020] is in
>the range 3.6X to 4.3X, depending upon the workload. Most of the
>benchmarks shown are at 4.1X.
>
>So, the data and the claim that 68040==20VAXmips implies that the earlier
>68020 has a "sustained performance level of 4.9 VAX-equivalent MIPS"
>(4.9 = 20/4.1). Does anybody seriously believe this?
>
>About the most impartial data I could find was for the Hewlett Packard
>HP9000 model 370 machine. This uses a 68030 (not 68020) at 33 MHz (not
>25 MHz) and achieves a geometric mean of 3.9 SPECmarks [ref. SPEC
>newsletter v1#1]. It seems reasonable to suspect the 68020 is no
>better than the 030 in performance {else who'd want the 030?}, so we
>conclude that the 020's performance is, at most, 3.9 VAX-equivalent
>MIPS. This makes the 68040 a 16 VAXmips machine (at most), not 20 VAXmips
>as advertised.
For much of the stuff we run, 16Mhz 68020 + 68881 runs at about .8 VAX
mips. (Scientific technical applications). So it seems to me that
25Mhz could be no faster than 25/16 * .8 = 1.25 VAX mips.
Ron Fox | F...@MSUNSCL.BITNET | Where the name
NSCL | F...@CYCVAX.NSCL.MSU.EDU | goes on before
Michigan State University | MSUHEP::CYCVAX::FOX | the quality
East Lansing, MI 48824-1321 | | goes in.
USA
Agreed, and do it on as drastically different machines as you possibly
can. Two word default integer and four word default integer and integers
stored big and little endin as examples. If you really want to take a bruising
try an EBCDIC and ASCII machine :-).
The only thing I don't like about this approach is that you have to live with
the poorest machine/os/compiler implementations you want to run your software
on.
--
Kenneth Ng: Post office: NJIT - CCCC, Newark New Jersey 07102
uucp !andromeda!galaxy!argus!ken *** NOT k...@bellcore.uucp ***
bitnet(prefered) k...@orion.bitnet or k...@orion.njit.edu
SPARC requires fullword alignment for 32-bit operands and DOUBLEword
alignment (e.g. 8-bytes) for 64-bit operands, in memory.
Are you sure? I think doubles can be loaded from 4 byte boundaries.
Witness the following code, which runs fine on this SS1:
main()
{
char data[12];
double d, *dp;
dp = (double *)(data + 0);
*dp = 66.6;
printf("%g\n", *dp);
dp = (double *)(data + 4);
*dp = 66.6;
printf("%g\n", *dp);
}
Data misalignment problems are independent of how well a program is written,
especially if the program was written for another target architecture.
no way. A correctly written program will have no problems with
alignment on any architecture. Can you give an example of correct
code that will fail? (I am assuming that well written programs are
correct :)
Marc Kaufman (kau...@Neon.stanford.edu)
--
Scott Draves Space... The Final Frontier
w...@cs.brown.edu
uunet!brunix!wsd
In article <WSD.90Ju...@unnamed.cs.brown.edu> w...@cs.brown.edu (Wm. Scott `Spot' Draves) writes:
>In article <1990Jul13.0...@Neon.Stanford.EDU> kau...@Neon.Stanford.EDU (Marc T. Kaufman) writes:
>> SPARC requires fullword alignment for 32-bit operands and DOUBLEword
>> alignment (e.g. 8-bytes) for 64-bit operands, in memory.
>Are you sure? I think doubles can be loaded from 4 byte boundaries.
>Witness the following code, which runs fine on this SS1:
>[sample C code removed]
This really depends on how doubles get loaded/stored. If the compiler
uses the load-double instruction, the data better be aligned on a
double (8 byte) boundary. If two load-word instructions are used,
word alignment is fine. I haven't figured out when compilers use
which instructions, but I have seen both.
---
Thorsten von Eicken (t...@sprite.berkeley.edu)
More correctly, the SPARC implementation of C under SunOS, at least, is,
in general, less forgiving of the programmer than the 68K implementation
of C under SunOS. Much of this is a characteristic of the architecture,
although some could perhaps be worked around in software.
The same is true of many other pairs of UNIX C implementations as well,
for reasons having to do with the architecture, with the compiler, with
the OS, etc..
>I keep receiving reports of software that ran fine on machine 'x' (often
>68k machines, Vaxen etc) but that either wont compile or
>wont run on the Sun 4's. I have had this explained to me
>(by a defender of the SPARC architecture) as being due to
>the fact that the SPARC architecture is less forgiving of
>poor programming (eg byte alignment problem in structures,
>ignoring warning messages, etc). It was stated to me that
>"well-wrtten programs work, if not it's the programmer's
>fault."
I'd agree with said defender. I'm not sure what the "byte alignment
problem in structures" is; the SPARC C compiler puts structure members
on the boundaries they require, so it's not as if a "poorly-written
structure" can cause the program to blow up by violating an alignment
restriction.
In the Sun C implementation, by default you cannot, say, have a buffer
full of structures that are not necessarily aligned on the proper
boundary for that structure (basically, the most restrictive alignment
boundary for all the members of the structure), and just convert a
pointer to an arbitrary byte in that buffer into a pointer to a
structure and expect it to work. However, you can specify the
"-misalign" flag to the compiler, and it will generate code that lets
you do this (basically, it tests the alignment of the pointer before
using it; if it's properly aligned, it uses it normally, otherwise it
calls a library routine to access the members of the structure.
However, there's nothing unique to SPARC about this; there's nothing
even unique to RISC about this! I can think of machines generally
thought of as "CISC" that impose alignment restrictions similar to those
of the SPARC, e.g. all but the most recent of the AT&T WE32K chips.
Even the PDP-11 and 68K prior to the 68020 required 2-byte alignment for
quantities larger than one byte.
>More specifically, If I accepted the rumours that porting
>to the SPARC was harder as an axiom could I argue that
>this was a "bug" of the architecture, not a feature, in the
>same way that difficulty addressing a 65k array on an 8088
>machine is a residual bug due to limited segment size of
>that chip?
You can argue anything you want. Whether others will agree with you is
another matter.
I, for one, would not agree with you if you made that argument. As
shown by the "-misalign" flag to the compiler, you certainly *can*
access misaligned quantities on a SPARC; it just takes more code, and
extra work by the compiler.
While this does impose a performance penalty, so does accessing
misaligned data on the architectures with which I'm familiar that let
you do so directly. The performance hit of doing so on SPARC (or other
strict-alignment architectures) is probably greater than that on
68020-and-up (or other loose-alignment architectures); I don't have any
numbers for that, though.
However, if references like that are sufficiently rare in most cases
that whatever performance gain was obtained by leaving hardware support
for that out of SPARC (either by devoting the circuitry to some other
function, or by removing an extra delay from the data path, or by
getting a basically-faster chip out the door sooner, or...) outweighs
the performance loss of doing misaligned accesses in software in most
cases, it may be the right thing to do to leave it out, at least for
those cases.
There may well be problems for which the cost of leaving alignment
restrictions in the instruction set outweighs whatever benefits you got
from leaving it out, but that just argues against using strict-alignment
machines for those particular problems.
Now, the alignment restriction isn't the *only* characteristic of SPARC
and its C implementations that punishes sloppy programming. Another is
the calling sequence used for passing structures and unions to
subroutines; in most such calling sequences, you can get away with
passing a member of a union to a routine that expects the union, or vice
versa. That's not the case with the SPARC calling sequence; I think
this is, however, the result of the choice of calling sequence used for
structures and unions 4 bytes or less long, rather than something
imposed by the instruction set architecture.
>Now, a few caveats, disclaimers, etc. Almost all the
>programming problems I have heard about were in C, which
>is an inherently low level language (although at least one
>very large Pascal program, a compiler project, never was
>ported successfully)
What machine was the target for the compiler? If the target machine was
the same as the machine on which the compiler was being run, "porting"
is more than just recompiling - 68K machine code doesn't generally run
on SPARC machines without some extra work being involved.
>The original motivation for this posting was as a followup
>to a discussion I had that originated from my observation
>that I seem to hear of more porting problems with the Sun
>4's than the Sun 3's. One of my staff disputed my claim,
>claiming that the problem was with poor C programmers, not
>the SPARC (or RISC in general).
The bulk of the problems I've seen in porting code to SPARC have been
due to 1) the aforementioned "passing unions to routines" problem and 2)
code that "knows" how C argument lists are laid out in memory, rather
than using the "varargs" mechanism. In neither case are those due to
alignment problems.
Neither of these are due to anything peculiar to RISC (CISC compilers
have been known to pass arguments in registers, rather than on the
stack, as well, and use of a register-based calling sequence
contributed, at least in part, to both problems), and neither would have
been problems had the programmer not cheated.
>Given the (admittedly second-hand) reports of problems, is
>this not a similar hardware problem raising up through the
>language?
The alignment problem can be hidden by the language, at least with more
recent compilers, as I'd noted.
>Note, I am _NOT_ disputing the worth of RISC
>architectures, which have given us dramatic performance
>gains. I'm just asking if we've paid a price in
>"servicability"?
We probably have paid some such price; I suspect it was worth it.
And you know of some other approach to running your software on the same
set of machine/os/compiler implementations that *doesn't* require this?
(No, "choose a smaller set of implementations" doesn't count; that's the
same approach, just with a smaller set of machines you consider
"interesting".)
Sounds like either we don't agree on what VAX MIPS mean or there's something
seriously wrong with your system or benchmarks. The 68000 at 8MHz is rated at
about .8 MIPS. The 68020 at 16MHz should run about 3-4 MIPS (factor of 2 for
the clock rate, factor of 2-3 for the better design). OK, processors run below
rated performance with depressing frequency for reasons like not enough cache
memory, badly written code etc. but by a factor of 4??? Are you sure you
weren't running a benchmark that proves your compiler produces code that only
runs 25% as fast as it should?
Russell Wallace, Trinity College, Dublin
rwal...@vax1.tcd.ie
"To summarize the summary of the summary: people are a problem"
I thought that originally, a VAX-mip (= speed of Vax 11/780 on j
random code) was .4 million instructions per second (.4 "native mips"
?). This would put the 68020 at 4.9*.4 ~ 2.0 "native mips". This is
a widely quoted figure (2-3 mips) for the 68020, and believable for a
16Mhz system, judging from cycle counts in the motorola manual.
On the other hand, a VAX-mip is an 11/780 mip, isn't it? A VUP is
something entirely different, based on microvaxes, n'est-ce pas?
disclaimer: This may be completely wrong, please don't flame me.
Don W. Gillies, Dept. of Computer Science, University of Illinois
1304 W. Springfield, Urbana, Ill 61801
ARPA: gil...@cs.uiuc.edu UUCP: {uunet,harvard}!uiucdcs!gillies
The problem is, of course, how do you define "correctly written
program". If you define code that are not portable as poorly
written, than your statement is meaningless.
Stanley Chow BitNet: sc...@BNR.CA
BNR UUCP: ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831 ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.
I'm not suggesting that some companies misrepresent their products.
Nooooooooooo!
"of course I don't speak for HP, I can't even speak for myself"
I would be tempted to conclude that the relationship of MIPS to
SPEC is not a linear one (for example MIPS ?= SPEC^2). Might
this resolve the differences?
Paul Chamberlain | I do NOT represent IBM tif@doorstop, sc30661@ausvm6
512/838-7008 | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif
Some machines SPEC numbers and MIPS ratings are almost the same.
Remember that the VAX 11/780 has a SPEC mark of 1.
...
This is true by construction. The speed of a VAX 11/780 running the
SPECmark suite is defined to be 1 SPECmark.
It would be amazing if other machine SPECmarks were closely correlated
... it would mean that one would not need a SPEC suite ....
--
Keith H. Bierman |*My thoughts are my own. !! kbie...@Eng.Sun.COM
It's Not My Fault | MTS --Only my work belongs to Sun* k...@chiba.Eng.Sun.COM
I Voted for Bill & | Advanced Languages/Floating Point Group (415 336 2648)
Opus<k...@eng.sun.com> "When the going gets Weird .. the Weird turn PRO"
I recently posted a fairly tedious and involved question
concerning the difficulties of programming on the SPARC as
compared to the 68k architectures. I have had a number of
users complain that SPARC seems inherently less forgiving
of the programmer, and questioned whether this might not
reflect problems with the architecture. I would like to
thank all those who took the time to answer, either through
email (about 15 to date!) or by posting.
I also would like to clarify somewhat my original posting,
plus comment on some of the many email I recieved.
The general consensus among the SPARC programmers who
wrote seems to be "Yes, the architecture is more
demanding, but this was a design trade-off which yielded a
faster machine. It is documented, thus it is a feature. If
your program fails, it is invariably due to programmers
violating the rules, and is thus poor programming practice
and the fault of the programmer." (my apologies for this
gross simplification, I know the issue is more complex
than that).
The general consensus also seems to be that the
architecture may be somewhat unforgiving, but yields good
results if the rules are followed, ie it is not inherently
"buggy". One writer (I can't recall if it was in email or
a posting) alleged that SPARC is an excellent development
platform, as if your program behaves on SPARC, it should
port to almost any less rigid architecture with a minimum
of effort.
Most who wrote or posted seemed to have missed one part of
my question, where I compared the documented alignment
restrictions of SPARC with the segment pointer problems of
early C compilers on 8088 machines. Here again, a
documented hardware feature has "risen into" the language,
in this case leading to language extensions - "near" and
"far" pointers) as a "workaround". In both cases, smarter
compilers can hide much of this, but probably not all if
you program in C.
It's been a while since I programmed PCs and I'm told the
newer compilers are smart enough to work around a lot of
this, but I was also told at the time we had to worry
about it that this was a "feature" of the architecture (in
the sense that segments were a feature) and they were to
be endured as part of the price of having such a cheap,
readily accessible machine.
A couple of people took my posting as a shot at Sun, or
SPARC, which it really wasn't. I do feel that such things
as structure alignment restrictions, union passing
problems, the need for "near" or "far" pointers, etc all
reflect hardware issues rising into the software layer
and, although probably inevitable in a language such as C,
are a strike against machine architectures where they
occur.
Of course, the benefits of such compromises may outweigh
the costs (as seems to be the case in SPARC) but they
certainly don't qualify for my rigid definition of a
feature (ie if I was offered the identical machine, but
without the "feature" would I be more inclined or less
inclined to buy the machine?). I suspect that most users,
given the choice, would prefer that they not have to worry
about how to access structures, preferring to leave that
to the compiler.
For what it's worth, the SPARC alignment "feature" _does_
qualify for my less rigid definition of a feature (ie if
it's documented it's a feature, if it's not, it's a bug. ;-)
So, without wanting to prolong a discussion by prompting a
flamewar, I would thank all those who wrote in favour of
the SPARC and say that, for my vote, the known, documented
restrictions on alignment are _NOT_ a feature, but a bug.
They are to be endured as the price to pay for a faster
machine, but I wouldn't argue (as some have done) that
they constitute a feature. Your mileage may vary...
While here, I would also like to answer a couple of points
raised in one posting, so here goes:
In article <36...@auspex.auspex.com>, g...@auspex.auspex.com (Guy Harris) writes:
. . .
> problem in structures" is; the SPARC C compiler puts structure members
> on the boundaries they require, so it's not as if a "poorly-written
> structure" can cause the program to blow up by violating an alignment
> restriction.
>
> In the Sun C implementation, by default you cannot, say, have a buffer
> full of structures that are not necessarily aligned on the proper
> boundary for that structure (basically, the most restrictive alignment
> boundary for all the members of the structure), and just convert a
> pointer to an arbitrary byte in that buffer into a pointer to a
> structure and expect it to work. However, you can specify the
> "-misalign" flag to the compiler, and it will generate code that lets
> you do this (basically, it tests the alignment of the pointer before
> using it; if it's properly aligned, it uses it normally, otherwise it
> calls a library routine to access the members of the structure.
>
> However, there's nothing unique to SPARC about this; there's nothing
> even unique to RISC about this! I can think of machines generally
> thought of as "CISC" that impose alignment restrictions similar to those
> of the SPARC, e.g. all but the most recent of the AT&T WE32K chips.
> Even the PDP-11 and 68K prior to the 68020 required 2-byte alignment for
> quantities larger than one byte.
Here was a particularly clear (to me!) exposition of the
issue involved (thanks!). To summarize (hopefully without
changing the poster's meaning :-), I can't easily
manipulate down to the byte level on a SPARC architecture,
without telling the compiler I want to do this, and
without taking a performance hit. Acceptable? Maybe. A
feature? Hardly. A bug? I'd say "perhaps", one we've been
conditioned to accept as a feature because the designers
documented it. And the argument that other designers
pulled the same stunt is hardly a rousing defense.
> >More specifically, If I accepted the rumours that porting
> >to the SPARC was harder as an axiom could I argue that
> >this was a "bug" of the architecture, not a feature, in the
> >same way that difficulty addressing a 65k array on an 8088
> >machine is a residual bug due to limited segment size of
> >that chip?
>
> You can argue anything you want. Whether others will agree with you is
> another matter.
Well, I've cast my vote on this one, that's enough for
me...
> I, for one, would not agree with you if you made that argument. As
> shown by the "-misalign" flag to the compiler, you certainly *can*
> access misaligned quantities on a SPARC; it just takes more code, and
> extra work by the compiler.
>
> While this does impose a performance penalty, so does accessing
> misaligned data on the architectures with which I'm familiar that let
> you do so directly. The performance hit of doing so on SPARC (or other
> strict-alignment architectures) is probably greater than that on
> 68020-and-up (or other loose-alignment architectures); I don't have any
> numbers for that, though.
>
> However, if references like that are sufficiently rare in most cases
> that whatever performance gain was obtained by leaving hardware support
> for that out of SPARC (either by devoting the circuitry to some other
> function, or by removing an extra delay from the data path, or by
> getting a basically-faster chip out the door sooner, or...) outweighs
> the performance loss of doing misaligned accesses in software in most
> cases, it may be the right thing to do to leave it out, at least for
> those cases.
I know I'm bucking an entire industry trend here, as RISC
seems to have been a huge commercial success, but this
argument strikes me as suspiciously like that of a car
salesman who argues that if I'm willing to buy a car
without brakes or seatbelts, I can get one that is that
much faster or cheaper. Sure, but I _am_ getting less car.
I for one am waiting with considerable interest to see if
the 68040 arrives too late to resist this trend to
minimalist computing... :-)
> There may well be problems for which the cost of leaving alignment
> restrictions in the instruction set outweighs whatever benefits you got
> from leaving it out, but that just argues against using strict-alignment
> machines for those particular problems.
Agreed, and I agree that a lot of workstations with these
restrictions are being sold. I just can't escape the
feeling that I'm being fed a drop of snake oil here ("Sure
it's tough on you, but the machine is that much faster so
it's a feature!") To be fair, that does not appear to be
this poster's attitude, but I do sense this attitude in the
general thread I started [WARNING - GROSS GENERALIZATION! SORRY! ]
. . .
> >Now, a few caveats, disclaimers, etc. Almost all the
> >programming problems I have heard about were in C, which
> >is an inherently low level language (although at least one
> >very large Pascal program, a compiler project, never was
> >ported successfully)
>
> What machine was the target for the compiler? If the target machine was
> the same as the machine on which the compiler was being run, "porting"
> is more than just recompiling - 68K machine code doesn't generally run
> on SPARC machines without some extra work being involved.
Actually, the program was a multi-thousand line Pascal
program (!!) written on a Sun 3 to implement a compiler
for Sisal, a specialized language for either dataflow or
parallel processing (I can't remember which and the prof
concerned is involved in research in both areas). A
successful compile would give us an executable that would
take Sisal programs and output something closer to a SPARC
executable. The program, to my knowledge, was never
successfully compiled but it may _all_ be due to a buggy
Pascal compiler (no, no version numbers available). I
can't only state with confidence that they had a lot of
problems with a program that worked (apparently correctly)
on Sun 3's.
Sorry this is so hazy, but I wasn't directly involved in
any of this, I tried to make that clear in the original
posting. Much of my concerns arose from "hearsay"
comments, from a variety of users. I am really not
qualified to teach, I'm actually soliciting testimony from
those who are.
. . .
> The bulk of the problems I've seen in porting code to SPARC have been
> due to 1) the aforementioned "passing unions to routines" problem and 2)
> code that "knows" how C argument lists are laid out in memory, rather
> than using the "varargs" mechanism. In neither case are those due to
> alignment problems.
>
> Neither of these are due to anything peculiar to RISC (CISC compilers
> have been known to pass arguments in registers, rather than on the
> stack, as well, and use of a register-based calling sequence
> contributed, at least in part, to both problems), and neither would have
> been problems had the programmer not cheated.
I left these in as they appear to be valid comments from
someone "who knows". I figured that if you made it this
far, you were entitled to a knowledable comment or two!
> >Given the (admittedly second-hand) reports of problems, is
> >this not a similar hardware problem raising up through the
> >language?
>
> The alignment problem can be hidden by the language, at least with more
> recent compilers, as I'd noted.
>
> >Note, I am _NOT_ disputing the worth of RISC
> >architectures, which have given us dramatic performance
> >gains. I'm just asking if we've paid a price in
> >"servicability"?
>
> We probably have paid some such price; I suspect it was worth it.
Fair enough. But can I assume that on everyone's wish list
for Xmas would be a SPARC architecture but without the
alignment hit? Does it matter? Some people who wrote
talked of 20k lines of code that compiled without error,
then a few hundred lines that caused grief for weeks.
There _IS_ a price being paid for this...
- peterd
The compiler seems to generate ldd/std instructions for doubles that it
knows to be aligned properly: automatics, statics and globals. Accessing
a double through a pointer though generates a 2 instruction sequence. There
is a new compiler switch in 4.1 though, '-dalign' that causes the compiler
to always use ldd/std for doubles.
--
Larry Noe (n...@unx.sas.com) "And I saw a sign on Easy Street,
SAS Institute Inc. 'Be Prepared to Stop'" -- Don Henley
Michael Slater, Microprocessor Report msl...@cup.portal.com
707/823-4004 fax: 707/823-0504
I generally refer to features that I don't like as "misfeatures". A "bug"
is when the system doesn't conform to the vendor's description. A
misfeature is when it does conform, but the vendor's description doesn't
describe a system that behaves as I'd prefer. Sometimes a misfeature is
due to a design choice (e.g. making aligned memory accesses faster at the
expense of unaligned accesses); other times it's simply the lack of a
feature that a user would like (e.g. a COMPILE-ALGOL-IMMEDIATE
instruction).
>I know I'm bucking an entire industry trend here, as RISC
>seems to have been a huge commercial success, but this
>argument strikes me as suspiciously like that of a car
>salesman who argues that if I'm willing to buy a car
>without brakes or seatbelts, I can get one that is that
>much faster or cheaper. Sure, but I _am_ getting less car.
It frequently depends on the application whether this is a feature or
misfeature. Most people wouldn't buy a bicycle without brakes or multiple
speeds. But racing bicycles have neither. It's not a matter of "less
bicycle", but merely that the bicycle is optimized for speed rather than
user-friendliness.
Or, returning to your car analogy, what if it were the emission control
system rather than the brakes or seatbelts that were being proposed for
removal. An environmentalist would consider this a misfeature, while a
performance enthusiast would consider it a feature. Tradeoffs frequently
have to be made, and whether the chosen path is best depends on your
perspective.
--
Barry Margolin, Thinking Machines Corp.
bar...@think.com
{uunet,harvard}!think!barmar
We have a vax 780 with FPA (== 1 MIPS) and sun 3/60 at 16 MHz with 68881, that
is the same configuration as above. All my comparisions show the sun to be
between .5 and .8 of the vax for pure computational tasks (no io). One
exception, TeX runs about twice as fast on the sun ... why ? In all cases,
I used optimization, with the fortran compiler and standard libraries (TeX
is written in pascal or C), on real problems we have run on many other machines
too. So Russells data are correct for me. Who has a 68020/68881 that runs
3-4 times faster than a 780 ?
Paul Bartholdi, Geneva Observatory
Yup. I have been non-disclosed on the new NeXT product line
and thus was allowed to see their new colour machine when
I was in California in June. It was running a beta version
of their 2.0 release of NeXTstep and has a separate board
to do colour (I can say this as I've read this elsewhere,
I'm not giving out any big secrets). There 68040 upgrade
(already announced) should be shipping as soon as chips
come in quantity.
I did not have the chance to do anything like
benchmarking and it was a little difficult to assess
performance with so many changes (colour, 68040, beta
software, etc) but there was no question that it's faster
than a 68030 NeXT (!) :-) I have the quoted numbers from NeXT, but
I suspect I'm on thin ice here so will hold my tongue. You
should have numbers within a month or two...
I understand NeXT is just about ready to start the
production line on their 68040 products, and are waiting
for 68040's in quantity for Motorola. Expect announcements
at the end of the summer, machines soon after, if all goes
well (and I hope it does, as we plan to buy a passle of
these things, and I'm in the ring if they are very late
with all of this! :-0
- peterd
However, the classic "MIPS" rating, which gradually became standardized as
VAX 11/780 MIPS, doesn't include floating point performance. SPECmarks
contain quite a bit of floating point information. Even if the SPECmarks,
when designed, were scaled to make a VAX 11/780 equal to 1 at some point
(eg, an impossible task given compiler/OS variations), that is still not
the same as permanently equating VAX MIPS and SPECmark ratings.
Certainly the SPECmark is a better number for overall machine performance,
though as I understand it, the reason that all the SPEC benchmarks are
quoted in a report, as well as the composite that gives you a SPEC number,
is that no one considers the single SPECmark number to be all-telling. It
is also meaningless to quote a "SPECmark for the 68040", since that's a
system benchmark. Certainly "SPECmark for the VAX 11/780" and "SPECmark
for the HP 9000/360" are meaningful numbers, at least as far as these
things go.
--
Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"
{uunet|pyramid|rutgers}!cbmvax!daveh PLINK: hazy BIX: hazy
"I have been given the freedom to do as I see fit" -REM
Trivial example: consider the std libc bcopy which takes two pointers and a
count. Most machine specific implementations move the data in units larger
than a character at time. Under what conditions should the implementor of
this commonly used library worry about checking the alignment of the
pointers before starting the copy?
But do you also want to pay for it? As an example take the 68040.
Suppose it has to load the following misaligned Long Word:
Address Data
$0 xLLL
$4 Lxxx
To load this Long Word starting at address $1, it actually does 3 memory
cycles (assuming cache miss). The first one to load:
$0 xLxx, Reading a byte.
The second one loads:
$0 xxLL, Reading a word.
And the third one to load:
$1 Lxxx, Reading the last byte.
If you don't believe me, check Section 8.2.1, page 8.8, Fig 8.4 of the
user manual. I guess they couldn't afford the smarter way (out of
microcode or something like that). And no, the second read of $0 will
not be a cache hit.
Now the question is, if you as a programmer know how misaligned access
is going to cost precious memory cycles, are you going to avoid them?
I know I do, and find the (software development) cost not that high.
++marcel beem...@fwi.uva.nl
"So, they destroyed their planet. Why?" "That was better for the economy."
I'd go with confusion over what "MIPS" measures.
>We have a vax 780 with FPA (== 1 MIPS) and sun 3/60 at 16 MHz with 68881, that
>is the same configuration as above. All my comparisions show the sun to be
>between .5 and .8 of the vax for pure computational tasks (no io). One
MIPS is a measure of integer CPU performance, so I wonder why people are
being so careful to tell us about the floating point hardware. If you want
to talk floating point, the metric is MFLOPS, not MIPS.
Gerry Gleason
And do not forget that the Vax 11/780 is technically not a 1 mips machine.
The clock cycle is 200ns.I guess assuming a performance rating of unity
makes life easier for the marketing types to synthesize other more 'useful'
performance numbers :)
--
Peace,
/pgr
"Most of my heroes don't appear on no stamps" - Public Enemy (Chuck D.)
{ames,prls,pyramid,decwrl}!mips!paulr or pa...@mips.com
Misaligned reads (or writes to copy back pages) that are cachable
will generate a line read to load the cache. For reads, data will
be sent directly to the IU/FPU as it is received from the bus.
For a longword read at address 1, the IU will only be stalled until
the second long word of the burst is received. The 68040 will
generate two line reads if a misaligned operand crosses a cache line
boundary.
The manual does not clearly explain how the 68040 handles these cases
(i.e. cachable, non-cachable, no allocate on write to write through).
I have sent a correction to the appropriate people. BTW I know the
68040 data cache works this way because I designed it.
Essentially always, unless the count is very small. Even on machines that
handle misalignment, if the alignment on the two areas is compatible, it
is better to copy enough initial bytes to align the pointers and then do
an aligned copy for the bulk of the data.
(Also, a quibble: bcopy may be "std", but the *standard* routine for
doing this is memcpy. :-))
--
NFS: all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology
and its performance and security too. | he...@zoo.toronto.edu utzoo!henry
>And do not forget that the Vax 11/780 is technically not a 1 mips machine.
>The clock cycle is 200ns.
Well, if the critter did one instruction per clock cycle, like a RISC
machine, it would only have to have a 1000ns clock cycle to be a 1 MIPS
machine. Obviously it takes multiple clocks for the average VAX instruction
anyway, but the bigger issue is what, exactly, constitutes an instruction
in the first place. I mean, the Transputer people have been claiming 17MIPs
for years, yet running real code a single T800 does integer stuff about
as fast as a 68020. So for real comparisons, the further away from the
concept of MIPS you get, the better.
Strangely enough, the marketing folks at most companies still quote MIPS
and Dhrystone 1.1 figures, since they produce larger and more amazing numbers
than SPECmarks. "These go to 11", etc.
>Peace,
>/pgr
And I'm sitting here typing this on a 25MHz 68030 based machine name "kahuna",
telnetted to a Mips-based computer sold by DEC, named "cbmvax". All these
MIPS flying around, and a Commodore C64 could probably get this job done...
>And do not forget that the Vax 11/780 is technically not a 1 mips machine.
>The clock cycle is 200ns.
Depends what you mean by MIPS, other than the accurate Meaningless Index of
Processor Speed. It's true, a 780 typically executes about 600K instructions
per second. The 1 MIPS figure came about when someone observed that the 780
is about as fast as an IBM mainframe (370/158?) that was rated 1 MIPS.
Evidently it accomplished somewhat more work per instruction than a 370 did.
IBM mainframes have been rated in MIPS for 25 years. IBM 360/370 MIPS are
somewhat more meaningful than others since the comparison is among machines
that implement the same architecture and at least at first was calibrated to
reality, i.e. a 1 MIPS 360 was one that actually executed a million
instructions per second in some instruction mix. It was also easier to do
such benchmarks on 360s since pesky things like caches and restartable string
moves didn't mess up the measurements.
--
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
jo...@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
Marlon Brando and Doris Day were born on the same day.
No, you can't easily access *data not aligned on its natural boundary*
without telling the compiler you want to do that. You can load a single
byte from memory from any (byte) boundary you want.
As for the performance hit, well, even though e.g. the 68020 lets you
load up unaligned 2-byte and 4-byte quantities "in hardware", you still
take a performance hit for it (as I noted in the posting to which you're
responding).
>I know I'm bucking an entire industry trend here, as RISC
>seems to have been a huge commercial success, but this
>argument strikes me as suspiciously like that of a car
>salesman who argues that if I'm willing to buy a car
>without brakes or seatbelts, I can get one that is that
>much faster or cheaper. Sure, but I _am_ getting less car.
Nope. It's more like "If you're willing to buy a car without an
automatic transmission, you can get one that's faster or cheaper";
that's not an *exact* analogy - can anybody suggest a more exact one?
How *do* you do shifts at compile-time rather than run-time? :-)
>Actually, the program was a multi-thousand line Pascal
>program (!!) written on a Sun 3 to implement a compiler
>for Sisal, a specialized language for either dataflow or
>parallel processing (I can't remember which and the prof
>concerned is involved in research in both areas). A
>successful compile would give us an executable that would
>take Sisal programs and output something closer to a SPARC
>executable. The program, to my knowledge, was never
>successfully compiled but it may _all_ be due to a buggy
>Pascal compiler (no, no version numbers available).
From "never successfully compiled" can I infer, then, that the program
was never successfully *executed*, either, but that the a version of the
code generator (whatever it was that generated "something closer to a
SPARC executable") had been written to generate SPARC code?
If so, it sounds as if either:
1) the problem was, indeed, an problem with the Pascal compiler;
or
2) they never got the code generator to compile.
In both cases, the most likely problem would seem to be generating code
for SPARC - either in the Pascal compiler (Sun's compilers, at least,
have a mostly common front end for all architectures, so it's the back
end where I'd expect big differences betwen Sun-3 and SPARC). SPARC is
different enough from the 68K (as are a number of other architecturs,
some of which even allow unaligned accesses) that there's no reason to
assume *a priori* that the big problem there had anything to do with
alignment.
>Fair enough. But can I assume that on everyone's wish list
>for Xmas would be a SPARC architecture but without the
>alignment hit?
If I could get it with the same performance (i.e., no performance loss
for properly-aligned references), without sacrificing any capabilities,
and at a minimal extra cost, I'd take it; however, the lack of hardware
support for unaligned references has, as I noted, not really gotten in
my way, so I can think of lots of things I'd put before that on my Xmas
list.
I agree with everything henry says above.
I should have said "well known" instead of "std" bcopy :-) :-)
I also see I did not make the question I wanted answered clear. Let me try
again. I was assuming (but should have said as henry points out) that the
source pointer was initially aligned with a partial word move before the
block transfer began. What I intended to ask was when is it worth the
trouble to load/shift/store or load partials or whatever it takes to avoid
off-alignment accesses when the src and dest do not match alignment?
But what I really was interested in is the following question:
What guidelines would you give an architect about how slow he
get away with making off-aligned accesses before it will start:
(1) causing internal library people to rewrite code to avoid the problem?
(2) causing compiler code generation oddities to avoid the problem?
(3) causing customer application code people to recode assembly libraries
to avoid the problem?
(4) cause a general stink in the marketplace because of the problem?
We already have some evidence that an infinite time for off-alignment
does cause at least some customers to be unhappy. (like Sparc :-) :-))
However, such evidence shows that it is not fatal either. There are a lot
of Sparc's out there.
Part 2 of the question: does the answer depend on market segment?
Are software performance expectation for a workstation different from a
supercomputer? (I suspect they are)
How about a case I have personal experience with: a language interpreter
that executes precompiled binary code sequences. The program originally ran
on PCs that make no requirement for int and long alignments, so the code
sequences were packed together in the obvious way (making use of sizeof()
operators to determine how far to advance the execution pointer). When
this program was ported to a 68000, the alignment requirement caused a lot
of changes.
It's not a hard problem to solve in any particular case, but we had to make
sure that every case was found. Not trivial. This was written in C, by
the way.
Something to keep in mind when writing code that might need to be portable!
--Jonathan Griffitts
AnyWare Engineering
(303) 442-0556
> The problem is, of course, how do you define "correctly written
> program". If you define code that are not portable as poorly
> written, than your statement is meaningless.
If the program, when run through lint, does not print "possible pointer
alignment problem" except for pointer values returned by malloc, and does
not generate any other lint messages, it will not crash when compiled on
another architecture because of an alignment problem.
C programs that are UNNECESSARILY non-portable are poorly written. Programs
that blithely assume int, long, and pointers are the same thing out of
programmer laziness or that throw casts around with abandon are poorly
written.
Sometimes machine-specific code is required. Such code should be isolated
and commented well. Even if, when you write the program, you think it
doesn't need to be portable, chances are at some point down the road
you, or someone else, will need to port it. Either make this job easy,
or leave your name off the code, because some future programmer will
curse your name otherwise.
--
Joe Buck
jb...@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
For what it's worth here are 2 programs that pass lint and run on a 68000
based Sun-3 and abort due to alignment errors on a SPARC Sun-4:
main() { int i[2], *j = i;
(void) printf("%d\n", *(int *) ((int)j + 2)); }
union u { int i; int *ip} un; int ai[10];
main() { un.ip = ai; un.i++; (void)printf("%d\n", *un.ip); }
Lint never guarantees correctness. Even still, I prefer the CPU that runs
fast over the one that hides stupidisms from me.
Don then gives two programs -- the first one casts back and forth between
a pointer and an int to force incorrect alignment of the pointer; the second
screws around with unions, writing the union as an int and then reading it
as a pointer. His second program would also crash a PDP-11.
Good point, Don. Writing a union as one type and reading it as another is
a non-portable operation, and lint won't complain. This type of thing should
never be done without surrounding comments explaining the assumptions (for
example, is the machine big-endian or little-endian? What required alignment
is assumed? What is assumed about the relative sizes of data types? Etc).
Casting between ints and
pointers is also dangerous, as I said elsewhere in my article. Lint isn't
perfect, but it's a valuable tool, and I question the professionality of
Unix programmers who don't use it.
-----------
I noticed in IBM advertisements that 1 MIPS relative to the VAX-11/780
was based upon the VAX-11/780 doing 1757 (?) Dhrystones/sec. This was
Version 1.1 of the Dhrystone. I don't know what particular compiler
or what degree of optimizations were used (ahem, just minor points of
course :-)). One VAX-11/780 may be different from another too ...
In any event IF we ASSUME the 1757 Dhrys/sec is accurate for a 1 MIPS
VAX-11/780 reference then we can go on and derive other possibly
equally meaningless numbers (based upon other compilers with various
sorts of optimizations applied).
Well, I know from personal measurements that the Amiga with 25 MHz
68030 and fast nibble mode dram (with burst mode on) can do about
8200 Dhrys/sec (this is Dhrystone V2.1 which is said to be
less optimizable than version V1.1 :-), but otherwise it is the same
benchmark).
These results indicate that the 68030 at 25 MHz is a 8200/1757 = 4.6 MIPS
machine.
------------
Actually 5 MIPS for the 25 MHz 68030 is not all that unreasonable. Afterall
most 68030 instructions execute in 3 Clock Cycles (implies 8.3 MIPS peak
operation). Also if one operates in "synchronous" mode then many
instructions execute in 2 CC's (implies 12.5 MIPS peak performance). However
if we could assume (there I go again) typically 5 CC's per 68030 instruction
then this implies a 5 MIPS machine. This seems very reasonable to me, but
I have no hard data to examine .....
Al Aburto
abu...@marlin.nosc.mil
How do processors that handle off alignment deal with getting a page
fault in between the multiple transfers? Couldn't this get really
hairy? I mean, consider this:
$A : $XDDD <- Page 1
$A + 4 : $DXXX <- Page 2
Where "A" is the last address on page 1. "D" represents the
misaligned word that you want to load and "X" is don't care.
What if page 2 is paged out? When does the CPU notice that it's paged
out and what does it do about it? Does it abort the transaction
entirely, bring page 2 back in, and restart? If so, what if, because
of the page replacement algorithm, page 1 gets blown away by bringing
in page 2? If you do bring in the first three bytes and put them into
one of the registers, what kind of state does that leave the paging
code to deal with, since paging isn't handled by the hardware? This
has nasty implications for instruction fetch of variable length
instructions also. In this case, what if the misaligned transfer is
an instruction fetch? I mean, ick!
I've gotta think that all these corner cases have got to add a lot of
checking to the CPU design which has to slow it down, not to mention
the possiblity of having a case that you didn't think of and
introducing a bug. I think I'd rather deal with the alignment, have
my CPU run fast, and have a greater assurance that it was designed
correctly.
Dave Roberts
d...@hpfcla.fc.hp.com
In article <104...@convex.convex.com> pat...@convex.com (Patrick F.
McGehearty) writes:
>Trivial example: consider the std libc bcopy which takes two pointers and a
>count. Most machine specific implementations move the data in units larger
>than a character at time. Under what conditions should the implementor of
>this commonly used library worry about checking the alignment of the
>pointers before starting the copy?
Essentially always, unless the count is very small. Even on machines that
handle misalignment, if the alignment on the two areas is compatible, it
is better to copy enough initial bytes to align the pointers and then do
an aligned copy for the bulk of the data.
Doing aligned moves of aligned blocks of storage is a win on most
machines, as Spencer says. Not only a memory copy routine should detect
and exploit the (hopefully fairly common) case where the source and
destination are already naturally aligned, it should also, on machines
that make it easy, try to artificially align the bulk of the copy
operation.
One problem is that when destination and source are be aligned
differently you have to choose whether to align the copy w.r.t. the
source or the destination. It is best to align the destination,
especially on write thru cache machines, and sometimes by a spectacular
margin.
Example: if we have to copy 73 bytes from address 102 to address 251, we
should (assming 4 bytes is the optimal block copy word size):
split the 73 bytes in three segments, of 3, 17x4=68, 2 bytes.
copy 3 bytes from address 102 to 245
copy 17 words from address 105 to address 248
copy 2 bytes from address 173 to address 316
Note that the word by word copy has a source that is not word aligned,
but the destination is. Many machines can cope with unaligned fetches
fairly well, but unaligned stores are usually catastrophic.
My usual example is the VAX-11/780, which had an 8 bytes buffer
between the CPU and the system bus leading to memory, and write thru.
Each byte written could cause an 8 byte read from memory, and an 8 byte
write back to memory, ...
As Spencer and myself have already remarked, this means that a suitable
sw memory copy operation can easily beat hardware memory copies, for
suitably large copy sizes, and by a large margin.
Yet another reason for having simple CPUs and avoid microprograms (if
you can afford the instruction fetch bandwidth, or use compact
instruction encodings, e.g. a stack architecture).
--
Piercarlo "Peter" Grandi | ARPA: pcg%cs.abe...@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: p...@cs.aber.ac.uk
In a word, yes. This is one of the major reasons why essentially all RISC
processors insist on "natural" alignments: so that no operand can cross
a page boundary.
If you want *real* fun, consider that unaligned operands can overlap.
Think about the implications of overlapping operands that span a page
boundary between a normal page and a paged-out read-only page in a
machine with two levels of virtual-address cache and a deep pipeline...
with an exception-handling architecture that was nailed down in detail
on much slower and simpler implementations, and can't be changed. This
is the sort of problem that makes chip designers quietly start sending
their resumes to RISC manufacturers... :-)
Regretably, simply choosing to build a RISC doesn't obviate all those
awful problems. In machines of any flavor (RISC or CISC) where the backend
which detects the pagefaults can't talk to the front-end which needs
to stop fetching the wrong instructions because, say, the speed of light
on real circuit boards is so pokey, traps are nightmares of astounding
proportions. Interrupts aren't as bad because they are not required
to be particularly synchronized with the instruction stream.
But those atomic, synchronous traps are genuine gut busters in both
hardware and software.
-Mike
I had to port some tcp/ip stuff to a strict alignment machine, and there
is a 32-bit field on a non-32 bit boundary somewhere in the IP header
(I think, the details might be wrong, it's long ago).
This meant I had to pack and unpack the field by hand. yuck.
--
--
Een volk dat voor tirannen zwicht | Oral: Jack Jansen
zal meer dan lijf en goed verliezen | Internet: ja...@cwi.nl
dan dooft het licht | Uucp: hp4nl!cwi.nl!jack
| What if page 2 is paged out? When does the CPU notice that it's paged
| out and what does it do about it? Does it abort the transaction
| entirely, bring page 2 back in, and restart? If so, what if, because
| of the page replacement algorithm, page 1 gets blown away by bringing
| in page 2?
You simply restart the instruction. Using an LRU scheme the 1st page
would not get paged out unless the physical mapping was 1 page/process.
As a reasonable constraint you want 8 pages/process anyway, so you avoid
this.
Note 1: yes "simply" is a relative term in this case, relative to doing
half the instruction and then handling the fault.
Note 2: pages for code, stack, source of copy instruction, dest of
copy instruction. If you allow unalligned access you need two pages in
each area to handle access over a boundary, that totals eight. No, I
wouldn't want to actually run a system with that little memory.
--
bill davidsen (davi...@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"Stupidity, like virtue, is its own reward" -me
The "correct answer" to this seems to be something of a religious question.
It depends on whether you think network packets are defined as structures
or as byte streams. If you call them structures you have both alignment
problems and big-endian/little-endian problems. If you call them byte
streams you have neither problem, but the code looks uglier.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
jeff kenton --- temporarily at jke...@pinocchio.encore.com
--- always at (617) 894-4508 ---
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
If you want *real* fun, consider that unaligned operands can overlap.
Think about the implications of overlapping operands that span a page
boundary between a normal page and a paged-out read-only page in a
machine with two levels of virtual-address cache and a deep pipeline...
with an exception-handling architecture that was nailed down in detail
on much slower and simpler implementations, and can't be changed. This
is the sort of problem that makes chip designers quietly start sending
their resumes to RISC manufacturers... :-)
For the interested, the MU5 supercomputer from Manchester was virtually
like this (except that they were designing it from scratch). They solved
the problem by strict design discipline, as you can find in "The MU5
computer system", Ibbett & Morris (MacMillan). Basically they found out
that the only sensible way out is to have restartable, not continuable,
instructions. Saving processor state on a fault is a very bad idea if it
is complicated. You substitute for this the easier problem of
idempotency.
> In article <884...@hpfcso.HP.COM> d...@hpfcso.HP.COM (Dave Roberts) writes:
> >How do processors that handle off alignment deal with getting a page
> >fault in between the multiple transfers? Couldn't this get really
> >hairy? ...
>
> In a word, yes. This is one of the major reasons why essentially all RISC
> processors insist on "natural" alignments: so that no operand can cross
> a page boundary.
>
> If you want *real* fun, consider that unaligned operands can overlap.
> Think about the implications of overlapping operands that span a page
> boundary between a normal page and a paged-out read-only page in a
> machine with two levels of virtual-address cache and a deep pipeline...
> with an exception-handling architecture that was nailed down in detail
> on much slower and simpler implementations, and can't be changed. This
> is the sort of problem that makes chip designers quietly start sending
> their resumes to RISC manufacturers... :-)
Sorry guys. I didn't mean to sound so dumb when I wrote that. :-) I
with RISC products at a board level, so I already knew the answer was
"yes", I was just looking for a somewhat quantitative explanation of
what a designer has to go through to make this work and what
techniques were used. How much of a slow down does this cause with
the processor? The original argument was that not supporting
misalignment was a misfeature (or bug in the original authors
vocabulary). Anyway, I was just wondering how much faster you could
go with a given design if you didn't have to support this stuff. What
I was trying to convey was the idea that it isn't just as easy as
supporting multiple accesses to memory but that the problem actually
can go all the way down to the memory management and exception
handling level, and that stuff seems to be the most gut wrenching and
error prone level of a design. It also tends to impact performance
pretty heavily the more complicated you make it.
Anyway, I was just wondering if there was any CISC designer out there
that could say something like, "Yea, well, we could have made that 25
MHz part run at 40 MHz if we hadn't had to support all the
misalignment gunk." Some people will still want to have the
misalignment, but it's nice to see what price you're paying to get it
so that you can evaluate whether you really need it.
Dave Roberts
Hewlett-Packard Co.
d...@hpfcla.fc.hp.com
Sigh. I don't have hard data on the relative merits of Vax, IBM and other
MIPS, but I first started seeing this canard when some trade press
idiot divided the clock rate by the average cycle count of the instruction
set. This is, of course, totally bogus. You have to pay attention to
the pipeline depth and weight the cycle counts according to relative
frequency of use, interlock delays, etc. In short, on something the complexity
of a Vax, you have to run a performance benchmark. I haven't kept up with
the SPEC ratings of IBM versus DEC boxes; would someone care to follow-up with
real numbers?
Jim Leinweber (608)262-0736 State Lab. of Hygiene/U. of Wisconsin - Madison
ji...@sente.slh.wisc.edu uunet!uwvax!uwslh!jiml fax:(608)262-3257
--
Jim Leinweber (608)262-0736 State Lab. of Hygiene/U. of Wisconsin - Madison
ji...@sente.slh.wisc.edu uunet!uwvax!uwslh!jiml fax:(608)262-3257
> In article <884...@hpfcso.HP.COM> d...@hpfcso.HP.COM (Dave Roberts) writes:
>
> ----------
Yea, I agree with you if LRU is the strategy, but how do you know what
page replacement strategy is being used? This isn't usually something
that the hardware specifies. It is usually (read always :-) left to
the O/S to implement. The O/S could implement a FIFO strategy and
page 1 could have been the first page in. I'm not disagreeing, just
raising some questions.
I understand that the problem can be solved and many of the techniques
for solving it. What I'm interested in is: how much you pay for the
solution? How much time is spent trying to solve it? How much does
it cost in terms of overall performance? How often is it used? If
it's more expensive than doing aligned transactions even for
processors that support it, do users tend to try and make all their
transactions aligned? If so, why have it and thereby slow down
everything? Why not leave software to deal with the case when a user
has to do unaligned transactions? Do users really want to pay the
price all the time for this support or would they rather take a big
hit every so soften? Obviously some will and some won't but that's
the case for any architectural decision. These are the questions I'm
really trying to answer.
Dave Roberts
d...@hpfcla.fc.hp.com
In my opinion, almost all of the whining about alignment comes from people who
write code with `write(fd, some_struct_pointer, sizeof (struct foo))' and
expect it to be portable. Never mind alignment, think about machines with
different type sizes, different byte order, or different floating-point
formats. You can do it, but don't cry to me when it stops working.
Network packets are defined as a sequence of bytes. The second arguments to
read and write are char pointers. This suggests that you should only read and
write arrays of bytes. You can convert between these and whatever internal
form you want.
It may seem like a pain to do this, but it pays off. I wrote programs which
read and write complicated data formats, with weird floating-point numbers and
all. Then i went to a machine with a different byte order and floating-point
format, and they ran fine, without changing a single line of code.
In article <884...@hpfcso.HP.COM> d...@hpfcso.HP.COM (Dave Roberts) writes:
| Yea, I agree with you if LRU is the strategy, but how do you know what
| page replacement strategy is being used? This isn't usually something
| that the hardware specifies.
A fair question. Just as the hardware spec doesn't say the compilers
have to allign things on word boundaries, if the hardware is such that
certain software practices are needed, then they *are* implicitly given
in the spec.
| It is usually (read always :-) left to
| the O/S to implement. The O/S could implement a FIFO strategy and
| page 1 could have been the first page in.
In which case the restart would cause a page fault, the first page
would come back in, and the second restart would complete.
|
| I understand that the problem can be solved and many of the techniques
| for solving it. What I'm interested in is: how much you pay for the
| solution? How much time is spent trying to solve it? How much does
| it cost in terms of overall performance? How often is it used? If
| it's more expensive than doing aligned transactions even for
| processors that support it, do users tend to try and make all their
| transactions aligned? If so, why have it and thereby slow down
| everything? Why not leave software to deal with the case when a user
| has to do unaligned transactions?
If you believe that when writing systems programs that sometimes you
will have to access data which is not alligned, and I do, then the
question is only if it should be done in hardware or software. This
arbitrary data can come from another machine (not always even a
computer), or be packed to keep volume down.
If it is being done in software the source code has to contain a check
for misallignment, which in turn means that the format of a pointer *on
that machine* must be known, as well as the allignment requirements. Bad
and non-portable. Or, you can simply access every data item larger than
a byte using the "fetch a byte and shift" method. This requires that the
byte order of the data, rather than the machine, be known. I think
that's probably the only portable way.
Alternatively the hardware can support unalligned fetch. It doesn't
have to be efficient, because you would have to make an effort to make
the fetch logic slower than software, it just has to work. This makes
the program a bit smaller, and assuming that the chip logic is right, it
prevents everyone from implementing their own try at access code.
If the hardware could produce a clear trap for unalligned access (not
the general bus fault, etc) the o/s could do software emulation. From
the user's view that would look like a hardware solution. This is like
emulating f.p. instructions in the o/s when the FPU is not present, and
does not represent a major change in o/s technology.
Note that this is not a RISC issue, in that the bus interface unit
already may be doing things like cache interface, multiplexing lines,
controlling status lines, etc. The BIU is not really RISC in that sense,
it functions like a coprocessor if you draw a logic diagram, who's
function is to provide data, which can go in the pipeline or into the
CPU.
| Do users really want to pay the
| price all the time for this support or would they rather take a big
| hit every so soften? Obviously some will and some won't but that's
| the case for any architectural decision. These are the questions I'm
| really trying to answer.
You assume that there is a price all the time, and I'm pretty well
convinced that the BIU in processors which have this capability, such as
the 80486, don't have a greater latency for alligned access than the
equivalent unit in SPARC or 88000. Not having the proprietary on chip
timing I can't be totally certain, obviously.
I think the real question is "should unalligned access be provided
outside the user program?" I think the answer is yes. Obviously it can
be done better in hardware, but if a chip is so tight on gates that it
can't be without compromising performance elsewhere, then just a
separate trap for quick identification of the problem by the o/s would
be a reasonable alternative.
| In article <11...@carol.fwi.uva.nl> beem...@fwi.uva.nl
| (Marcel Beemster) writes:
| >In general my feelings towards
| >software development are: "If it runs on s SPARC, it runs everywhere",
|
| It's good for portability to develop code on the _least_ forgiving
| machine that you can find.
Back when I worked at Data General, some of us (and a lot of
customers) thought the DG MV/Eclipse computers were the least
forgiving. Among it's features:
1) Four different representations for pointers (2 different bit,
1 byte, and 1 word/double word pointers). The bit pointers
were rarely used in high level languages. One format used two
double words, one for the word, and the other as a positive
offset from that word. The other format had the word address
shifted left 4 bits, and the bit offset inserted. The word
pointer format used the top bit for indirection if the
instruction itself used indirection, and the next 3 bits were
the segment/ring bits, with 0 being the OS, and 7 being user
programs. The byte pointer format shifted everything left one
bit, dropping the indirection bit. This meant that for user
programs the use of a byte pointer where a word pointer was
expected or vica versa would cause a segmentation violation.
Because multiple indirections where not used in the high level
languages, it meant for user programs, you could check at
runtime whether a pointer was of the appropriate sex, by
looking at the top bit. This lead to several C compiler
options to insert such checks in user code, and to enable the
checks in the library. Typically though if a program had been
linted, it would be passing type correct pointers (except to
qsort though) and all would be fine. It was really amazing to
find how many sloppy programs there out there.
Because 'all struct pointers smell the same' rules, to allow
pointing to an unknown structure, all structure and unions
were required to be aligned on a 16-bit word boundary.
2) Dereferencing a null pointer always caused a segmentation
violation (inward address trap in MV-speak), since ring 0 was
the operating system, and protected from outer rings except
for calls to a protected gate list. There was no way with
linker switches to get it to change behavior.
3) Characters were unsigned by default, since there was no sign
extending load byte sequence. I did put in a compiler option
for VAX weanies who couldn't read the warning in K&R that
plain chars might be unsigned.
4) Shifting signed ints right was logical rather than arithmetic
(again K&R allows this) because the arithmetic shift was
slower than logical (which in itself was no speed demon).
5) The MV hardware stack grew upwards from low memory, rather
than downwards as it is in nearly every other machine in
existance. It was also surprizing the number of people who
still do not use varargs/stdargs (I just got a request in
today to add a GCC option to add a warning for this case,
since we are uncovering it in porting commands to OSF/1).
Another side effect of the stack growing the 'wrong' way, is
that it causes havoc for sbrk, since it wants to grow in the
same direction.
6) The compiler and linker reordered variables based on the size
and alignment of the variables. Thus programs that expect
successive extern (or auto) declarations to be contigous were
surprized.
7) The floating point format used is the IBM 360 format rather
than the newer IEEE format. This meant you got more exponent
range for floats, and less for doubles as compared to IEEE.
Also because the base is in hex, and not binary, it leads to a
loss of average precision of 2-3 bits. The one kludge it did
allow was the single precision format was exactly the same as
the double precision format, except for the extra bits.
8) The MV is big endian which causes the usual problems when
importing little endian VAX code (ie, taking the address of an
int, and treating the pointer a pointer to a character).
We did have some companies who ported their software to our machines
thank us for finding the bugs in their code. Of course we had the
others who said that their code ran on two different flavors of VAX,
and therefore was portable.
--
Michael Meissner email: meis...@osf.org phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA
Do apple growers tell their kids money doesn't grow on bushes?
The IBM RS/6000 has hardware support for misaligned accesses that
fall within one of their (large) cache lines. They trap for misaligned
accesses that span two cache lines.
In other words, they pay for a shift network, but not the
sequencing.
--
Andy Glew, andy...@uiuc.edu
Propaganda:
UIUC runs the "ph" nameserver in conjunction with email. You can
reach me at many reasonable combinations of my name and nicknames,
including:
andrew-fo...@uiuc.edu
andy...@uiuc.edu
stick...@uiuc.edu
and a few others. "ph" is a very nice thing which more USEnet
sites should use. UIUC has ph wired into email and whois (-h
garcon.cso.uiuc.edu). The nameserver and full documentation are
available for anonymous ftp from uxc.cso.uiuc.edu, in the net/qi
subdirectory.
Note that there are two degrees of misalignment:
1) within a word, and
2) crossing a word (& possible page) boundary.
For 1):
If the realignment hardware is not in your main fetch path because it
would impact your cycle time, then it will likely mean an extra stage
of processing for instructions which use it, which can add various bits
of complexity. Considering that, plus
1) a 4-way mux isn't a serious time sink, and
2) how much, or even whether, it influences the cycle time is
technology and implementation dependent
then you are likely just going to stick it in the main fetch path
and do it efficiently, w.r.t. layout, etc. Now, if the end user
does pay for this, it isn't likely going to be in performance, because
even though it might influence the cycle time, it won't. Chips come in
"standard" operating frequencies these days (e.g. 16,20,25,30,40,50);
The difference that a 4-way mux might make would tend to be taken care
of by the process tweaking that's done to get to the desired frequency.
In this case, the realignment hardware influences yield rather than
cycle time, hence cost rather than the performance. I can't think of
any processor that doesn't support this degree of realignment (some
better than others).
For 2):
This, IMHO, is one of the more significant things that differentiates
"RISC" from "CISC". The notion of one instruction making multiple
references to memory tends to make RISC designers get red in the face
and jump up and down. (Yes, I'm well aware of the 29K's load and store
multiple instructions, and while I'm not fond of them, there are some
significant differences between that and handling unaligned accesses.)
The extra control complexity this introduces is a signficant increment,
especially considering all the nightmarish endcases that have already
been described in this thread. The added complexity is dependent on
architecture and implementation, and tends to be worse for stores, but
at any rate it tends to increase design/debug time, and more importantly
can cause much hair pulling and resume writing when one attempts really
high performance implementations. (I've know people who thrive on such
complexity, for complexity's sake - they should be removed from the gene
pool (0.5 :-). With the realestate one has to play with these days,
you can find room for the complexity to keep the performance up, but it
still influences the cost (and number of errata after release).
I don't know of any "new" architecture chips with decent performance
that support realignment across words in one instruction. Why do
the common CISC chips support it?
1) it's not as big an increment in complexity (no smiley)
2) backwards compatibility (i.e. they have no choice)
In summary, the cost you will tend to see will be $ more than performance,
although at the high end of the performance spectrum you might pay in
performance as well - that's hard to say, since processors which support
word-crossing accesses tend to have a lot of other complexities which
influence cost/performance as well.
What makes sense depends on the intended applications, of course. It
may indeed make some network software run significantly faster, for
instance. But if that network software consumed 5% of all the cycles
of all the processors I had sold, and such hardware support would
*double* the n/w sfw performance, I still wouldn't risk screwing
up everything else to go for an aggregate 2.5% performance improvement.
----------------------------
Dave Christie
My humble opinions only.
I've wondered about that. Getting naturally aligned operands requires a
4:1 mux on the low byte, a 2:1 on the next byte, and nothing on the
upper bytes. To get all alignments requires a 4:1 mux on all bytes. It
merely makes all bytes be just as bad as the low byte. In theory.
In practice, loading and wire delays probably have as much impact as logic.
--
ba...@apple.com (408)974-3385
{decwrl,hplabs}!amdahl!apple!baum
Quite simply, they have two "partial load" instructions: between them
they constitute an unaligned load.
If unaligned data is untypical, this sounds like an ideal compromise.
Besides, maybe Herman Rubin can use it in his assembler programs ...
--
Don D.C.Lindsay
>No one has mentioned the solution used by the MIPS R3000.
>Quite simply, they have two "partial load" instructions: between them
>they constitute an unaligned load.
I hate to sound like Herman Rubin, but it would be nice if they provided
a way to access these instructions from C. "#pragma misaligned" perhaps.
-- Richard
--
Richard Tobin, JANET: R.T...@uk.ac.ed
AI Applications Institute, ARPA: R.Tobin%uk.a...@nsfnet-relay.ac.uk
Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin
1) VUP: 1 VAX 11/780 == 1, on some (internal) set of about 100 benchmarks
that DEC uses, mixture of languages, with fair amount of floating-point.
As far as I know, DEC has generally given VAXen VUPs ratings, NOT mips-
ratings.
2) MVUP = MicroVAX II Unit of Performance, used by Digital Review magazine,
based on timings of about 30 FORTRAN codes.
------------
Regarding mips-ratings, and SPEC, and such:
1) Unadorned mips-ratings are essentially useless, and ought to be
eradicated forever. Read the June 20, 1990 Wall Street Journal
article ("MIPS Keeps Slipping as Speed Standard") for a succinct
summary of the rationale for this statement.
2) Although vendors are (at least mostly) internally consistent,
each has its own idea of mips-ratings, such that:
if vendor A has a machine rated 10 mips
and vendor B has one rated also at 10 mips
it is quite possible for one of the two machines to run 1.75X faster
on the SPEC integer subset. Put another way, SPEC integer performance
ranges anywhere from 55% to 97% of the corresponding published mips-ratings,
depending on vendor & product. (This was from 2Q90 data, may have
changed by now.)
3) Any time you see, in the press, a statement like:
"Vendor A announces new XX-mips machine", unless you have other data,
or are very familiar with conversion factors from that product line
to others, you would be wise to replace it with:
"Vendor A announces machine, for which no meaningful performance
rating has been given."
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash
DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Actually, it is (or can be) even worse than that. Consider the
following VAX gem:
addl3 *0(r1),*0(r2),*0(r3)
Assume that the instruction itself (which is 7 bytes long) crosses
a page boundary, that r1, r2, and r3 contain 0x1ff, 0x3ff, and 0x5ff
respectively, that the longword at 0x1ff..0x203 contains 0x7ff, that
the longword at 0x3ff..0x403 contains 0x9ff, and that the longword at
0x5ff..0x603 contains 0xbff. Then we need:
2 pages for the instruction
2 pages for 0(r1) (0x1ff..0x203)
2 pages for 0(r2) (0x3ff..0x403)
2 pages for 0(r3) (0x5ff..0x603)
2 pages for *1ff (0x7ff..0x803)
2 pages for *3ff (0x9ff..0xa03)
2 pages for *5ff (0xbff..0xc03)
--
total 14 pages for one `simple' `addl3' instruction.
(Imagine the fun with a six-argument instruction like `index'!)
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: ch...@cs.umd.edu Path: uunet!mimsy!chris
(New campus phone system, active sometime soon: +1 301 405 2750)
| Actually, it is (or can be) even worse than that. Consider the
| following VAX gem:
|
| addl3 *0(r1),*0(r2),*0(r3)
|
| Assume that the instruction itself (which is 7 bytes long) crosses
| a page boundary, that r1, r2, and r3 contain 0x1ff, 0x3ff, and 0x5ff
| [ details ]
| total 14 pages for one `simple' `addl3' instruction.
I suspect that you could actually get something like this in actual
programs. It happens on machines which don't force the instructions to
be an even size. This is probably a worst worst case, but I could
believe that it would happen.
There seem to be enough advantages and penalties to unalligned access
to prevent a definitive good or bad decision. Certainly it will take a
few gates on the chip, but probably will not slow alligned access. It
will be a lot faster than doing the same thing in software, but it's not
a common thing to do. There doesn't seem to be a portable way to decide
if an address is alligned or not, since pointer formats vary, so a
software solution has to assume all accesses are unalligned.
> addl3 *0(r1),*0(r2),*0(r3)
...
> total 14 pages for one `simple' `addl3' instruction.
>(Imagine the fun with a six-argument instruction like `index'!)
I'm told this is why the VAX has a 512-byte page -- a 128KB box
was planned (but never made), and in the worst-case (an instruction
straddling 40 pages), would have deadlocked at <40 pages available.
The page-size thus had to be less than one-fortieth of the page-list
of a bare-minimum (1977!) system.
--
Gideon Yuval, gid...@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)
MIPS processors generate a trap with an Address Error Exception.
In RISC/os, this is turned into a SIGBUS signal.
--
Charlie Price cpr...@mips.mips.com (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086
Mostly they trap. There are probably one or two design groups that were
foolish enough to just ignore the low bits.
--
The 486 is to a modern CPU as a Jules | Henry Spencer at U of Toronto Zoology
Verne reprint is to a modern SF novel. | he...@zoo.toronto.edu utzoo!henry
The second least useful thing I can think of is to cause a hardware trap to
software, which would presumably be horrendously slow. I would be completely
happy with that. Write a software trap handler to deal with it, and write your
compiler such that it never happens if avoidable.
--
Roland McGrath
Free Software Foundation, Inc.
rol...@ai.mit.edu, uunet!ai.mit.edu!roland
On the Motorola 88000 a misaligned access causes a trap into the kernel. There
is a bit in the status word which can override this, in which case the CPU
assumes zero for the appropriate number of low order bits. Kernels can either
signal the offending process on a fault, or complete the offending access and
continue.
The VAX does this for quad word loads. VAX LISP takes advantage of
this feature to get two free tag bits on LISP pointers that don't
need to be masked before being dereferenced. Seems useful to me.
>The second least useful thing I can think of is to cause a hardware trap to
>software, which would presumably be horrendously slow. I would be completely
>happy with that.
I'd like this behavior to be defined by the user process. Perhaps as
a signal? Seems similar to SIGBUS. Anyway, set a bit someplace so you
trap or not, based on what kind of process you are. You probably
always want to take the trap in system mode, and panic (at least
in UN*X) since it indicates a bug of some kind.
>Write a software trap handler to deal with it, and write your compiler such
>that it never happens if avoidable.
If your compiler is doing the right thing, the only time you'll take
an interrupt is if your code is buggy. This sort of thing can be caught
statically by lint, but there may be some situations in which
dynamic behaviour can produce the same event, like in programs
that generate their own code.
Interesting idea.
Jim
| documented), the least useful thing I can think of for them to do (within the
| bounds of reason) is to do the access wrong (which seems likely since if they
| want the low-order two bits of the address to always be zero, they might well
| not pay attention to them). Is this what is done?
On some machines, yes. On others a trap results, allowing the kernel
to complete the access in software if desired. I believe that the 88K
has a flag to trap or just zero the low bits or the address, I know the
Honeywell DPS series zeroed the low bit of the address on a doubleword
access. This resulted in many amusing "learning experiences."
Just out of curiosity, can anyone give some live examples where software
takes advantage of the mode where the CPU just zeroes the low-oorder
bits and conitnues, as in the 88K? (or, I think(?), in the RT/PC).
The only case I've seen is a low level, PROM based, debugger on the 88k which
runs in this mode. Presumably, this is done to save error checking or trap
handling when the user types a misaligned address.
I don't see that this gains much, and it runs the risk of masking real bugs
in the program itself.
>Just out of curiosity, can anyone give some live examples where software
>takes advantage of the mode where the CPU just zeroes the low-oorder
>bits and conitnues, as in the 88K? (or, I think(?), in the RT/PC).
The optimizing compilers I used to work on, did all of their dynamic
memory management though IDL (Interface Description Language). Our
IDL implementation gave us nice things - garbage collection, debug
support, and the ability to move any rooted data structure to/from a
file.
IDL objects were tagged, and tags contained a two-bit code which was
stored in the low end of a pointer. The IDL runtimes had to pack and
unpacked them. (Guy Steele determined which code was commonest, and
represented it as 00.)
Would we have used the hardware mode? No. The IDL runtimes initially
consumed about half the cycles of a big compile, but I fixed that in
the conventional way (amortization,inlining,caching,etc). After the
fix, our cycles were elswhere, and a special hardware mode would have
made no difference.
Plus, I would not want to turn off the hardware checks while the rest
of the code was running. If this meant constantly turning a mode on
and off, then forget it.
--
Don D.C.Lindsay
There was at least one design group (an SEL machine) that noticed that they
could encode the operand width using the low order address bits. In this
way they were able to save a bit in the instruction thus providing an
additional address bit. Of course, there was no such concept as misaligned
access reference using this scheme.
Marv Rubinstein
In Lucid's Common Lisp, they used the two low-order bits as the tag.
On a machine that simply ignored the low-order bits on dereferencing,
they could avoid masking those bits off. I don't know whether they
did so.
Another advantage of this scheme is that the tag bits for
short-integers are 0. No masking is needed before addition, and
overflows can be detected by hardware. You do have to remember to
shift integers before output.
The Gould Powernode machines had 32 bit words, 24 bit addresses. The
top 8 bits were ignored. In our Lisp, we put the tag bits there.
Again, there was no need to mask before dereferences. The tag bits
for positive short ints were 0x0 and for negative short ints were
0xff, so machine addition could be used without masking. However, you
had to check for overflow into the tag bits.
(Note: I didn't work for Lucid. As I recall, I learned this with
their debugger. I think the trick has been floating around for a
while.)
Brian Marick
Motorola @ University of Illinois
mar...@cs.uiuc.edu, uiucdcs!marick
SEL became Gould CSD, and then...
The last Gould machines used the low order bits of the offset field
(which was present on all memory access instructions) as part of the
width encoding. They did not, however, use the low order bits of the
registers, and they did produce misaligned traps if the low order bits
of the final address, ignoring the low order bits of the offset
literal, were incorrect.
This meant that you could not have an odd register, and add 1 to
it via the addressing mode to make a correctly aligned even address.
It never caused me any problems - it even found a few bugs.
--
Andy Glew, a-g...@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
In article <MCGRATH.90...@homer.Berkeley.EDU> mcg...@homer.Berkeley.EDU (Roland McGrath) writes:
| On machines that "don't handle misaligned accesses", what do they do when one
| happens anyway?...
he...@zoo.toronto.edu (Henry Spencer) writes:
| Mostly they trap. There are probably one or two design groups that were
| foolish enough to just ignore the low bits.
Ah yes, many older mainframes... I remember DPS-8s with affection, but
not for all the decisions they made. A doubleword register load from an
odd address used the floor(address), causing an optimization that used
double loads to load the wrong two variables.
**MY** that was hard to spot.
--dave
--
David Collier-Brown, | dav...@Nexus.YorkU.CA, ...!yunexus!davecb or
72 Abitibi Ave., | {toronto area...}lethe!dave
Willowdale, Ontario, | "And the next 8 man-months came up like
CANADA. 416-223-8968 | thunder across the bay" --david kipling
When emulating other (older) architectures that permit non-aligned accesses,
there are two basic choices with architectures of this type:
1 examine the nonaligned memory access and read ALIGNED bytes,
ALIGNED halfwords and/or ALIGNED words as appropriate and
shift and mask them together.
2 do two NON-ALIGNED reads of the two memory words the nonaligned word
straddles ( non-aligned-addr and non-aligned-addr+4 ) and then
shift and mask.
The low order bits are ignored by the read, but are used by the
masking and shifting code.
My memory of this is fuzzy, but in most cases, (2) is more efficient.
===============================================================================
Steve Schlesinger, NCR/Teradata Joint Development 619-597-3711
11010 Torreyana Rd, San Diego, CA 92121 steve.sc...@sandiego.ncr.com
===============================================================================
>The only case I've seen is a low level, PROM based, debugger on the 88k which
>runs in this mode. Presumably, this is done to save error checking or trap
>handling when the user types a misaligned address.
I can't imagine a *worse* place to turn off a trap detecting possible
errors than in a debugger.
Jonathan Ryshpan <...!uunet!hitachi!jon>
>jke...@pinocchio.encore.com (Jeff Kenton) writes:
>> The only case I've seen is a low level, PROM based, debugger on the 88k
>> which runs in this mode.
In article <4...@hitachi.uucp> j...@hitachi.UUCP (Jon Ryshpan) writes:
> I can't imagine a *worse* place to turn off a trap detecting possible
> errors than in a debugger.
I think we can assume that it restores the state while running the debugged
code; it just turns off the trap for internal use. I don't think detecting
errors in the PROM monitor does you much good, anyway, so why worry? :-}
--
-Colin
Unfortunately, it doesn't. As I said in my (partially quoted) posting, I'm
not convinced that the small saving in debugger code is worth the loss of
general error checking it costs you.
It's possible to make use of the fact that most machines *don't*
ignore the low-order bits to get "free" type checking. Suppose you
make the tag for cons cells 3, and address them something like this:
move -3(r1), r0
This will cause a trap if the tag is not 3. Whether the type check is
really free depends on whether you would have been able to use a faster
addressing mode if you didn't want the check.
}The Gould Powernode machines had 32 bit words, 24 bit addresses. The
}top 8 bits were ignored. In our Lisp, we put the tag bits there.
}Again, there was no need to mask before dereferences. The tag bits...
As someone who went through a conversion from MVS to MVS/XA I find
this perfectly horrible.
/* End of text from primerd:comp.arch */
You don't need to wait; the Gould NP line had a lot of compatability
with the Powernodes; but they did use all 32 bits of address.
Has anyone mentioned GNU Emacs using high order address bits yet?
mike
--
Michael Fischbein, Technical Consultant, Sun Professional Services
Sun Albany, NY 518-783-9613 sunbow!msf or mfisc...@east.sun.com
These are my opinions and not necessarily those of any other person
or organization. Save the skeet!
One of the easier parts of porting Powernode lisp to the NP was
dealing with the fact that tag bits were now valid address bits. We
knew how we'd do it well in advance.
Taking advantage of architectural quirks isn't automatically evil.
You just have to weigh the costs against the benefits.
Unless you know what the architects might come up with in the future
in the way of architectural extensions, you can't know all the possible
costs.
--------------------
Dave Christie My opinions only.
If your software design makes use of information hiding, then all code
that depends on the architectural feature will be localized to a very
small portion of code, so converting to a new architecture will be easy.
If you don't use this kind of discipline, yes, porting your software to
a new architecture, or even the next generation of the same architecture,
will be extremely difficult, because cruft that depends, say, on the
meanings of particular bits will appear all over the place.
Yet another reason I'm growing to like C++ more and more...
--
Joe Buck
jb...@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
>>Taking advantage of architectural quirks isn't automatically evil.
>>You just have to weigh the costs against the benefits.
>
>Unless you know what the architects might come up with in the future
>in the way of architectural extensions, you can't know all the possible
>costs.
Within a company, "architects" should have long range plans for the
direction of the architecture, and should be able to give software
developers for that architecture an idea of what the costs will be.
Of course, the plans will change, sometimes breaking previously
stated cost goals (but hopefully not breaking previously stated
compatibility rules).
If you haven't got a rough idea of where you are headed in 5 years'
time (I'd like to say 10 years' time, but most US companies don't
think that far ahead (except maybe IBM)) you aren't architecting,
you're implementing.
Fine. I don't imagine anyone in this business is just 'implementing'.
And nobody is just 'architecting'. The two are very much interdependent.
Of course you have long range plans, but the further out you go the
less focused they are, because all the things that drive architectural
development (implementation technology and techniques, software directions,
customer needs) aren't terribly predictable beyond five years.
I assume the architectural quirks that Brian was referring to are the
grey areas in any architectural definition covered by such phrases as
"undefined operation", "implementation dependent", "reserved", and my
personal favorite, "try this and you'll be shot" (never get this one
past the publications dept, though...). They might better be called
implementation quirks. These grey nooks and crannies exist because
their behavior doesn't need to be defined for most people's purposes
(except maybe to say they won't compromise any protected mode of
operation), and the tighter you tie up an architectural definition, the
tighter you tie your hands for future implementations and extensions. So
grey areas that don't need to be defined will remain grey in future plans.
(And then there's the corners you haven't even covered because in the
limited amount of time you have to put together a user's guide you can't
think of all the unobvious ways clever software developers will try
something.)
As for communicating future plans to software developers, if your product
is successful at all, you can't possibly review everyone's use of the
quirks. Major developers will get access to the architects and future
plans under non-disclosure, and if one is really counting on a quirk
behaving a certain way they would be given serious consideration, or
shown how such a seemingly innocuous little thing could have significant
impact on your planned super-duper hyper-scalar biological implementation
in sea-moss. (But then, I don't think major developers are prone to
counting on quirks.) A lot of smaller developers will generally have to
live with the grey areas and assume worst-case potential cost for using a
quirk: the cost of doing it all over again without the quirk on the very
next implementation of the target architecture. (Which may very well be
trivial, but then, you often don't know the true cost of doing something
until you've actually done it.) To tie in another couple of threads,
use of quirks is an extension of the HLL/assembler portability tradeoffs,
and is akin to breaking the timing rules.
Sorry to be long-winded, but what the heck, traffic's been light lately.
-------------------------------
Dave Christie I don't speak for AMD, and I'm sure they appreciate that.
"I don't think major developers are prone to counting on quirks."
Ahh, if only things were as simple in the PC world as they appear
to be in the mini/workstation world...
Probably the classic example is the ongoing battle between Apple
and many developers of software for the Macintosh over Adherence
to the Rules (as laid down in Inside Mac, the tech notes, the
human interface notes and other miscellaneous pieces of documentation)
versus the urge to take hardware-specific shortcuts. The most famous
offender is Microsoft, and as far as I recall the most stupid thing
they ever did (and there were some humdingers) was some scheme
in Excel 1.x for determining if a floating-point coprocessor was
present: the standard environment query wasn't good enough for them,
instead they came up with their own ingenious test that worked with
a 68881 but failed with a 68882.
In the DOS world, one example that comes to mind is one of the
problems that came to light as the 80486 chip was making its
first few public appearances: it turned out that AutoCAD was
neglecting to clear a "reserved" bit before loading a new value
into an 80386 status register, and the '486 didn't like this
at all. So they changed the relevant '486 instruction to ignore
the setting of that bit. I guess this isn't a case of "counting on
quirks" so much as neglecting to check certain low-level sources
of possible future incompatibility *very* thoroughly. The end
result being that the hardware vendor has to work around the software
vendor's mistakes...
PS: regarding "try this and you'll be shot"--read any Apple
documentation lately? It's full of statements like this--and worse.
Lawrence D'Oliveiro fone: +64-71-562-889
Computer Services Dept fax: +64-71-384-066
University of Waikato electric mail: l...@waikato.ac.nz
Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
..that time of year when the kennels become a melting pot of molting pets...
All "reserved bits" should be implemented "Read as zero, trap on
nonzero write". Or, at least, "Read as zero, software should write old
value or zero".
But nobody gets it right. The FUTUREBUS+ draft didn't have the
reserved bit usage policy defined (for its CSRs). Hopefully they will
in the next draft.
and Lawrence D'Oliveiro replies:
> Ahh, if only things were as simple in the PC world as they appear
> to be in the mini/workstation world...
Sometimes the reliance on quirks seems almost intentional, as for
example, the IBM PC ROM-BIOS. The early, slick, products (Lotus123,
MS flight sim.) didn't stick to the rules because the code that ran
according to the rules was slow. By using undocumented interrupts and
system calls the software got speed but the user got stuck for the
copyrighted IBM-BIOS. Was someone in marketing smart enough to actually
lay this deviousness out ahead of time, or did it just sort of happen?
I don't know, but IBM sold alot of PCs while the cloner makers tried to
come up with a ROM-BIOS that would support all the undocumented "features".
David States
What I meant is simple. Tying software architecture to a support
architecture is always a matter of evaluating risks and benefits.
There is no difference in kind between the decisions to depend on
POSIX, on SunOS, on the 68K, or on a particular architectural feature
of the Gould PN machines.
I suspect there were once pre-released versions of Lotus123 and MS flight
simulator that may have pretty much stuck to the rules. When they saw the
performance wasn't as good as they wanted, they looked for ways to speed
it up. That is likely to be when the oddities mentioned before may have
been added.
Another possibility is that the writers of Lotus123 and MS flight
simulator didn't realize those things were against "the rules", but I'd
like to think of Kapor and Gates as being good enough software design
engineers and software project managers, to know when they are breaking
"the rules".
Paul Scherf, Tektronix, Box 1000, MS 60-850, Wilsonville, OR, USA 97070
pau...@orca.WV.Tek.com 503-685-2734 ...!tektronix!orca!paulsc
[Someone asks what machines that require alignment do with unaligned
addresses in loads. His `least desirable' scenario is that the
processor completely ignores the low address bits.]
In article <140...@sun.Eng.Sun.COM> jpu...@raptor.Eng.Sun.COM
(James M. Putnam) writes:
>The VAX does this for quad word loads. VAX LISP takes advantage of
>this feature to get two free tag bits on LISP pointers that don't
>need to be masked before being dereferenced.
Since there is no mention of this in the VAX architecture handbook, I
tried it out to make sure. An unaligned movq reads the unaligned quadword.
If r0 contains 0x1003 and memory at 0x1000 looks like this:
0x1000: 0x00
0x1001: 0x01
0x1002: 0x02
r0 -> 0x1003: 0x03
0x1004: 0x04
0x1005: 0x05
0x1006: 0x06
0x1007: 0x07
0x1008: 0x08
0x1009: 0x09
0x100a: 0x0a
then after a `movq (r0),r0', r0 contains the value 0x06050403 and
r1 contains the value 0x0a090807.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain: ch...@cs.umd.edu Path: uunet!mimsy!chris
Something he shouldn't have about VAX quad-word loads ignoring
low-order bits, and this being a useful feature for LISP.
>Since there is no mention of this in the VAX architecture handbook, I
>tried it out to make sure. An unaligned movq reads the unaligned quadword.
>--
>In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
>Domain: ch...@cs.umd.edu Path: uunet!mimsy!chris
I stand corrected and thanks to Chris for pointing it out. During a
conversation I had with Barry Margolin some time ago about this very
topic (non-lispm architecture implementations of LISP pointers), I got
the impression that VAX quad-word loads ignored bits in the address.
Sincere apologies for the misinformation. Does anyone know of a machine
that has this "feature", or I did construct this out of whole cloth?
Jim
The 88000 will let you run in this mode. It still seems like laziness to me.
----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
----- jeff kenton --- temporarily at jke...@pinocchio.encore.com -----
----- --- always at (617) 894-4508 --- -----
----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
Jim,
What may well have happened is that some early LISP implementer just
"tried it" and found that on his vax the low order bits were ignored.
This makes a dandy feature if you're trying to fake a tagged architecture,
so maybe he went ahead and used it. *BUT* since it isn't in the manual,
DEC is free to change it whenever they feel the need, and they *do* feel
that need from time to time, so the code in question wouldn't even be
portable within the vax line. You may both have been right. :-)