RISCversus CISC

Andy "Krazy" Glew

unread,

Feb 12, 2011, 3:14:16 PM2/12/11

to

http://semipublic.comp-arch.net/wiki/RISC_versus_CISC

* [[RISC (Reduced Instruction Set Computer)]]
* [[CISC (Complicated Instruction Set Computer)]]

[[RISC]] and [[CISC]] have been discussed elsewhere, outside this wiki,
in great detail.
I will not try to add much to what has already been written.

However, although I am a RISC sympathizer, I may take a contrarian
stance, talking about some of the issues and problems with RISC.
The RISC philosophy was certainly influential. However, in many ways it
failed.
I may try to talk about those failures.
And how the so-called [[RISC Revolution]] warped a generation of
computer architects.

= Conclusion =

I did not really want to write this article. There is not much to add
to all that has been written about the so-called [[RISC Wars]] except to say
that they caused a lot of thrashing, amounted to less than promised,
although they did bring some improvements.

However, I could not really write a wiki on computer architecture
without mentioning [[RISC versus CISC]].
The topic has been sitting on the top of my to-do list since I started
writing this wiki/book.

I would prefer, however, to discuss interesting aspects of RISC computer
architecture, that may not be discussed in many other places,
than to rehash this old debate.
These pages have mostly not yet been written:

* [[Breaking out of the 32-bit RISC instruction set straitjacket]]
** [[variable length RISC instruction sets]]
** [[16-bit RISC instruction sets]]
** [[RISC instruction sets with prefix instructions]]

* [[post-RISC proliferation of instructions]]

* [[How do you count instructions in an instruction set architecture?]]

* [[Microcoded instructions and RISC]]
* [[Hardwired instructions and RISC]]
* [[Hardware state-machine sequenced instructions and RISC]]

Many of these latter issues are of the form "XXX and RISC", and really
should be only of the form "XXX",
except that one of the leftovers of the [[RISC era]] is that it is
considered obligatory to explain
how a feature supports or opposes the [[[RISC philosophy]].

= Background =

The late 1970s and early 1980s were an era of increasing complexity of
increasing capability and complexity in computers.
The mainframe IBM 360 family was reaching the peak of its influence.
The relatively simple PDP-11 minicomputer
was supplanted by the more complex DEC VAX.
Microprocessors were marching forward: the Intel 8086 started along the
road to world domination in 1978,
the Motorola 68000 started along the road to elegant failure in 1979.
Intel had been talking about the
[http://en.wikipedia.org/wiki/Intel_432 Intel iAPX 432]
microprocessor for a while, with features such as bit granular instructions,
garbage collection in hardware and microcode,
a software object model supported by hardware, etc.
People were trying to solve the so-called software gap with hardware, by
building computers that more closely mapped
whatever language implementation they were targeting.

Complexity seemed increasing.
In actuality, much of the software complexity was moved into microcode.
Woe betide languages and operating systems that did not match the
language and operating system use cases the computer was designed to
support.

= The RISC Revolution =

Into this came a bolt of sanity:
<UL>
David A. Patterson and David R. Ditzel. 1980. The case for the reduced
instruction set computer. SIGARCH Comput. Archit. News 8, 6 (October
1980), 25-33. DOI=10.1145/641914.641917
http://doi.acm.org/10.1145/641914.641917
</UL>
Heck - it was not even published in a refereed journal!

One must also mention the [[IBM 801 minicomputer]], arguably the first RISC.

The late 1980s and early 1990s were full of RISC computers, especially
microprocessors: [[Power]], [[PowerPC]], [[MIPS]], [[SPARC]], [[Motorola
88K]].
Not to mention many more failed companies.
All seemed destined to replace the IBM mainframe, the VAX minicomputer and,
then, as the importance of the PC market grew, the PC microprocessor.
But only the 68000 Apple PC fell to the PowerPC RISC onslaught.
The VAX and other minicomputers died out.
But the IBM mainframe, and the Intyel x86, continued, the latter
spectacularly.

Now, by 2010,
* IBM is slowly transitioning to [[Power]], but the IBM Z-series
mainframe stays strong
* the PC world is still ruled by Intel [[x86]]
** Intel's RISC and VLIW projects fell by the wayside
* [[PowerPC]] and [[MIPS]] have been relegated to [[embedded computing]]
* [[ARM]] is the biggest RISC success story, in embedded and,
particularly, in low power, cell phones, etc.
** [[ARM]] is likely to be Intel's big competition
* Sun [[SPARC]] survives, barely - but Sun itself got swallowed by Oracle
** Sun/Oracle x86 boxes predominate even here
* DEC is long dead, as is [[Alpha]]

= What Happened? =

Many RISC companies went after the high priced, high profit margin,
workstation or server markets.
But those markets got killed by [[The March of the Killer Micros]],
specifically, the x86 PC microprocessor.
It is telling that ARM is the most successful "RISC", and that ARM
targetted embedded and low power, the low end,
lower than the PC market, rather than the high end.

[[PowerPC]] and [[MIPS]] made a concerted attack on the Intel x86 PC
franchise.
Microsoft even provided Windows NT support.
But when Intel proved that they could keep the x86 architecture
competitive,
and stay the best semiconductor manufacturer, the "RISC PC" withered.

Intel proved they could keep the x86 PC microprocessor competitive in
several stages:
* the i486 proved that the x86 could be pipelined.
** Up until then one of the pro-RISC arguments was that CISCs were too
complicated to pipeline. But, see the next section
** I was thinking about Motorola 88Ks about this time, when the i486
started being talked about, and I realized - RISC had no fundamental
advantage
* the Intel P5/Pentium did in-order superscalar
* the Intel P6, first released as the Pentium Pro, then Pentium II, did
out-of-order
** briefly, the fastest microprocessor in the world, beating even DEC Alpha

Some say that the P6 killed RISC.

A more nuanced view is that RISC was a response to a short term issue:
the transition from board level to chip level integration.
When only a small fraction of a board level computer could fit on a chip,
RISC made more sense. When more could fit on a chip, RISC made less sense.

Not no sense at all. The RISC principles always have value.
But less sense, less of a competitive advantage.
Unnecessary complexity is always wasteful.

Moving on...

= The CISCs that survived were not that CISCy =

The CISCs that failed - DC VAX and the Motorola 68000 - were the most CISCy.
Most instructions were variable length.
Some frequently used instructions could be very long.
Many instructions had microcode.
Many operations had side effects.
They had complicated addressing modes - elegant in their generality,
but coompliocated, sometimes neceessitating microcode just to calculate
an address.

The CISCs that survived - the IBM 360 mainframe family, and the Intel
x86 - were not that CISCy.

Sure, they had some complicated, almost impossible to implement without
microcode, instructions.
Just consider x86 FAR CAL through CALL GATE.

However, most of their instructions were simple:
ADD src_reg += dest_reg_or_mem.
True, Note [[load-store]].
But [[load-op]] pipelines were not that hard for the IBM mainframes, and
the Intel i486 and P5, to implement.
And the Intel P6 showed that the microarchitecture could be
[[load-stiore]], even though the ISA was not.

The IBM 360 family has relatively simple instruction decoding.

The Intel x86 has painful instruction encodings, ranging from a single
byte up to 15 bytes.
But the most frequently used instructions are small.
The really complicated instruction encodings, with prefixes, proved
possible to (a) implement in hardware, but (b) at a significant
performance cost, on the order of 1 cycle per prefix originally.

Most important, these instruction sets had few side effects.
Yes, the x86 has condition codes.
But most x86 instructions overwrite all of the arithmetic condition
codes (INC and DEC not affecting the carry flag being the notable
exception).
This avoided the sort of RAW hazard on the condition codes that would
have been required for an OOO implementation of, say, the Motorola 68000.

= Didn't RISC win anyway? =

Did not RISC win anyway? After all, doesn't a modern CISC processor
translate to RISC uops internally?

Well, yes and no. Let's look at some of the RISC principles proposed in
early papers

* fixed length instructions - 32 bit
** at the [[macroinstruction set level]], not so much
** at the [[microinstruction set]] or [[UISA]] level, maybe
*** maybe- but definitely not "small". Is it a RISC if the
(micro)instructions are 160 bits wide
*** even here, the recent trend is to compressing microcode. Some are
small, some take more bits.
** the most popular surviving RISC instruction sets have 16 bit subsets
to increase code density

* simple instruction decoding
** ISA level, no
** UISA level - undocumented. Probably. But, again, very wide!!

* software floating point
** nope!!

* large uniform register set
** in the early days, not so much
** over time, the register set has grown. As has the complexity of the
ISA encoding, [[REX bytes]] etc.

* small number of instructions
** definitely not!!!
** microcode instruction sets have long been full of many instructions,
particularly widget instructions
** even macroinstruction sets have increased dramatically in size since
1990. More than quadrupled.

Some have said that the point of RISC was not reduced instruction count,
but reduced instruction complexity.
This may be true - certainly, this was always the argument that I used
*against* rabid RISC enthusiasts who were trying to reduce the number of
instructions in the instruction set.
But nevertheless, there were many, many, RISC zealots and managers who
evaluated proposals by counting instructions.

= What's left? =

The most important intellectual survivor of the RISC philosophy has
been to aversion to complicated microcode sequences.
Expanding instructions to 1 to 4 [[uop]]s may be okay,
but taking a cycle or more to branch into a [[microcode sequencer]],
perform the operation, and branch back is a performance penalty nobody
wants to take.

The number of instructions in instruction sets has increased
dramatically over the past decade.
But the vast majority of these are instructions that can be implemented
directly by hardware,
by combinatoric logic,
or by simple state machines in the case of operations such as divide.

One may conjecture that the original RISC aversion to floating point
was actually to microcode floating point:
when the most important floating point operations like [[FADD]] and
[[FMUL]] became pipelined,
capable of being started one or more per clock
even though the operation took several cycles of latency,
the objections to FP on a RISC died away.

== Bad Effects ==

We are still paying the cost of certain RISC-era design decisions.

For example, Intel MMX is irregular.
It does not have all the reasonable combinations of
datasize={8,16,32},
unsaturated, signed and unsigned saturation.
This irregularity was NOT because of hardware complexity,
but because management was trying to follow RISC principles by counting
instructions.
Even when providing all regular combinations would have made the
hardware simpler rather than harder.
(Aside: validation complexity is often used as an argument here, against
regularity. The complexity of validating all regular combinations grows
combinatorically.)

AMD x86-64 is not a bad instruction set extension.
But life might be easier in the future, if x86 does not die away, if it
had been more regular.
More RISC-like.
But in this way RISC was its own enemy:
RISC did not achieve the often hoped for great performance improvements
over CISC.
RISC reduces complexity, which does not directly improve performance.
So people went chasing after a Holy Grail of VLIW performance, which
also did not pan out.

= Conclusion =

I did not really want to write this article. There is not much to add
to all that has been written about the so-called [[RISC Wars]] except to say
that they caused a lot of thrashing, amounted to less than promised,
although they did bring some improvements.

However, I could not really write a wiki on computer architecture
without mentioning [[RISC versus CISC]].'

I would prefer, however, to discuss interesting aspects of RISC computer
architecture, that may not be discussed in many other places,
than to rehash this old debate:

* [[Breaking out of the 32-bit RISC instruction set straitjacket]]
** [[variable length RISC instruction sets]]
** [[16-bit RISC instruction sets]]
** [[RISC instruction sets with prefix instructions]]

* [[post-RISC proliferation of instructions]]

* [[How do you count instructions in an instruction set architecture?]]

* [[Microcoded instructions and RISC]]
* [[Hardwired instructions and RISC]]
* [[Hardware state-machine sequenced instructions and RISC]]

Many of these latter issues are of the form "XXX and RISC", and really
should be only of the form "XXX",
except that one of the leftovers of the [[RISC era]] is that it is
considered obligatory to explain
how a feature supports or opposes the [[[RISC philosophy]].

MitchAlsup

unread,

Feb 12, 2011, 4:32:12 PM2/12/11

to

On Feb 12, 2:14 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

In general, a nice fairly balanced article.

> AMD x86-64 is not a bad instruction set extension.
> But life might be easier in the future, if x86 does not die away, if it
> had been more regular.

I'm going to argue, that minor irregularities are irrelevent once you
get a data path that contains millions of transistors.

Back in the RISC days, we used to argue and fret about whether we
should encode these 4 instructions in this order or that order to save
a single gate of logic. I have come to believe that all of this is
completely unnecessary, both back then and now.

In effect, the data paths do not process native instruction ANYWAYs;
they process the regurgetations of what the decoder spits out. Since
both I and A have gone down the path of rewriting the bit patterns so
that the data path operates on a fairly uniform instruction encoding,
all the horseplay concerning prefixes, variable lengths,
irregularities is a decoder problem that seldom makes its presence
known to the vast majority of the chip. The decoders on these modern
Great Big Out of Order chips is tiny compared to almost anything else.

Even microcode is often the PROPER way to embody certain functionality
(Wilkes notwithstanding). If nothing else, microcode enables backward
compatability, at low costs and especially for instructions that are
not used very often, and sometimes provides a means to implement
certain instruction set functionality BETTER than what can be done by
(micro)instruction expansion. REP MOVS is the obvious example, where
microcode can check in parallel all the various boundary conditions
and execute the widest width REP MOVS operation the data path can
support based on the actuall addresses and lengths. Doing this in SW
cannot make use of the parallelism that HW can provide. Nor do the
primatives that support these need to be exposed to the instruction
set and become fixed forever more.

We had a tree of logic that could take the arbitrary x86 instruction
and end up with a 17-bit pattern that fully represented the embodiment
of the instruction, could be used to index into the microcode ROM, and
drive the data path instruction/operation logic directly. The totality
of this logic was significantly smaller than a 32-bit adder. Once
engineered, it really did not need much diddling from going from MMX
based 32-bit x86 through SSE-4 based 64-bit x86. So, after this tree
of logic, the rest of the machine had all the RISC principles anyone
could annotate or desire.

As a side note, I might add that the typical first generation RICS
processor had a 6-bit major opcode and an 11-bit minor opcode which
just so happens to be the same number of bits in the internal
representation. Yet the internal representation can represent more
than 2000 instructions, while the RISC form represents less than a
couple hundred.

Mitch.

Anne & Lynn Wheeler

unread,

Feb 12, 2011, 5:56:37 PM2/12/11

to

early 70s IBM had the "future system" effort to completely replace
360/370 ... with complex hardware to address all sorts of issues:
http://en.wikipedia.org/wiki/IBM_Future_Systems_project

above references this item
http://www.jfsowa.com/computer/memo125.htm

some more information here:
http://www.cs.clemson.edu/~mark/fs.htm

this has quote about trying to compete with the clone controller
vendors:
http://www.ecole.org/Crisis_and_change_1995_1.htm

from above:

IBM tried to react by launching a major project called the 'Future
System' (FS) in the early 1970's. The idea was to get so far ahead that
the competition would never be able to keep up, and to have such a high
level of integration that it would be impossible for competitors to
follow a compatible niche strategy. However, the project failed because
the objectives were too ambitious for the available technology. Many of
the ideas that were developed were nevertheless adapted for later
generations. Once IBM had acknowledged this failure, it launched its
'box strategy', which called for competitiveness with all the different
types of compatible sub-systems. But this proved to be difficult because
of IBM's cost structure and its R&D spending, and the strategy only
resulted in a partial narrowing of the price gap between IBM and its
rivals.

... snip ...

Other references are that during the FS period ... all sorts of internal
efforts (viewed as possibly competitive) were killed off ... including
370 hardware&software products (since FS was going to completely replace
360/370) ... which allowed (370) clone processor vendors to gain market
foothold. Then, after the FS demise there was mad rush to replenish the
370 software&hardware product pipeline. misc. past posts mentioning
FS
http://www.garlic.com/~lynn/submain.html#futuresys

There have been a number of articles that the corporation lived under
the dark shadow of the FS failure for decades (deeply affecting its
internal culture).

I've periodically claimed that the example of FS motivated John to go to
exact opposite extreme for 801/risc in the mid-70s.
http://www-03.ibm.com/press/us/en/pressrelease/22052.wss

misc. past emails mentioning 801, iliad, romp, rios, power, etc
http://www.garlic.com/~lynn/lhwemail.html#801

The corporation had a large number of different microprocessors
... developed for controllers, engines used in low-end & mid-range 370s,
various other machines (series/1, 8100, system/7, etc). In the late 70s
there was an effort to converge all of these microprocessors to 801. In
the early 80s, several of these efforts floundered and some number of
the engineers leave and show up on risc efforts at other vendors.

There is folklore that after FS demse, some number of participants
retreated to Rochester and did the S/38 with some number of FS features.
Then the S/38 follow-on (AS/400) was one of the efforts that was to have
one of these 801 micro-engines. That effort floundered (also) and there
was a quick effort to do a CISC engine. Then a decade later, AS/400
finally did migrate to 801 (power/pc).

There was a presentation by the i432 group at annual Asilomar SIGOPS
... which claimed a major problem with i432 was it was a) complex and b)
silicon; all "fixes" required brand new silicon.

I had done a multiprocessor machine design in the mid-70s (never
announced or shipped) that was basically 370 with some advanced features
somewhat akin to some of the things in i432 ... but it was a heavily
microcoded engines ... and fixes were new microcode floppy disk.

--
virtualization experience starting Jan1968, online at home since Mar1970

Venkatesh Srinivas

unread,

Feb 12, 2011, 7:43:57 PM2/12/11

to

On Feb 12, 2:14 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> * [[Breaking out of the 32-bit RISC instruction set straitjacket]]
> ** [[16-bit RISC instruction sets]]

The Berkeley RISC II (back in 1981?) gained support for 16-bit
instructions, which were expanded to the same format as the 'normal'
32-bit width instructions very early in the pipeline. I imagine this
was a fairly easy way to add this to the front-end of the RISC I.

I imagine the wins from allowing variable instruction widths would
only be from the increased code density (and the lighter load on the
memory buses and caches); is there anything else to consider?

Thanks,
-- vs

Andy "Krazy" Glew

unread,

Feb 12, 2011, 9:40:56 PM2/12/11

to

Several of the early superscalar designs could go superscalar on 2
16-bit instructions that fit within an aliged 32 bit packet.

IIRC Gould NP1? 2? was like this. I think also perhaps Tandem.

Quadibloc

unread,

Feb 14, 2011, 1:50:40 PM2/14/11

to

I prefer CISC to RISC, but I still want a simple and elegant
instruction set, like that of the IBM 360 or the 68020, not a messy
one like that of the x86.

The success of the x86, though, has nothing to do with any technical
issue, so it is not due to a failure on the part of the RISC
community. It is the availability of software for Windows and DOS
running on the x86 architecture that has led to this success - nothing
more, and nothing less.

And of course RISC machines, to be competitive, now that technology
permits the construction of processors like the Pentium - with not
just hardware floating-point, but elaborate high-efficiency floating-
point, reminiscent of the IBM 360/195 - could not stick to a purist
approach and stick to single-cycle instructions.

Instead of looking at the "failures" of RISC, despite favoring CISC, I
think it's more useful to look at the successes of RISC.

Look at an IBM System/360 Model 65.

Big computer. Fills a room.

Now, look at document A22-6884-3, System/360 Model 65 Functional
Characteristics, page 9.

If you don't have a copy handy, while Bitsavers is down, some of their
mirrors are still up...

<http://bitsavers.trailing-edge.com/pdf/>

But you don't need to download the book. The main thing is, it shows
what the arithmetic circuitry of that computer consists of.

Two adders. One that adds two 60-bit numbers, and one that adds two 8-
bit numbers.

That will mean that elaborate sequences of multiple-precision
instructions aren't required to do double-precision floating-point
mathematics.

But multiplication is done step by step - with microcode, since it
does have multiply instructions. If you want a Wallace tree, you need
to get at least a Model 91, although one is also an optional add-on
for the Model 85.

Even using the small-scale integration of the time - and remembering
that IBM used less-dense SLT, not monolithics - a 60-bit adder is
still just a handful of circuit boards.

So, while the core memory filled a lot of those boxes... what is left
over after the core memory is *mostly all control logic*, not the
arithmetic unit that gets the actual _work_ done.

Given that, architectures like SIMD, that let one control unit control
a large number of arithmetic units in doing the same thing - and you
could have features like "each ALU set local flag 4 if the number in
it is negative", and then later "each ALU multiply, or do nothing if
flag 4 is clear" - have an obvious attraction.

And so does RISC, which simplifies that awful control unit a bit.

And so did microcode (which, of course, the System/360 was already
using).

The reason why everything else but the x86 is RISC, though, isn't
because of the electricity wasted by the transistors in the control
unit. It's because RISC simplifies design, and hastens time-to-market.

Nobody but Intel - and, just barely, AMD - has a big enough market to
justify the extra expenditure in designing a microprocessor that's
more complicated than it has to be.

So, RISC, far from having failed, is victorious - even if it might not
seem so from the prominence of the one (or two - remember System/360,
now z/Architecture or System z) exceptions kicking around because of
inertia.

John Savard

Torben Ægidius Mogensen

unread,

Feb 15, 2011, 9:39:45 AM2/15/11

to

"Andy \"Krazy\" Glew" <an...@SPAM.comp-arch.net> writes:

> http://semipublic.comp-arch.net/wiki/RISC_versus_CISC

As you hint, counting the number of instructions in an ISA is futile: If
you use a bit to select two alternative behaviours, is it one
parameterised instruction or two separate instructions?

Since counting instructions is futile, many have (as you say) argued
that RISC is really about reducing instruction set complexity. This has
much more merit, and the early RISC designs certainly avoided many
sources of complexity, such as having several non-sequeintial memory
accesses, which implies possible abort at several stages during
execution and so on. Even so, the game was (IMO) not about making the
absolutely simplest possible instruction set -- you had to weigh this
against how many instructions (or cycles) you need to implement common
high-level language constructs. But this was achieved not by having a
direct mapping from high-level constructs (such as function calls or
multi-dimensional array lookup) to single instructions, but by providing
general instructions that could be combined to make relatively short
sequences of instructions that implement these _and_ exploit special
cases to generate shorter or more efficient sequences.

The latter is significant: You rely on compiler technology to select
instruction sequences optimised for specific instances rather than using
a single general instruction that covers all possible instances. While
hardware can detect special cases at runtime, this requires runtime
tests which cost power, time and hardware complexity. With increasing
transistor budgets, the complexity is not so much of a problem, and with
translation into micro-instructions, the time overhead is also minimal,
but you can argue that you pay a bit in power.

Nevertheless, the compiler issue is worth to examine. I would argue
that many of the failed RISC designs did so by overestimating what the
compiler could do. For example, DEC had to add byte-access instructions
after failing to make the compiler generate efficient code for strings
etc. using only word-access instructions and Intel definitely relied far
too much on speculative (i.e., untried) compiler technology with their
EPIC design.

So maybe a restating of the RISC principle is codesign of compilers and
architecture, so you solve issues where they are easiest to solve
without losing significant efficiency.

Torben

Anne & Lynn Wheeler

unread,

Feb 15, 2011, 10:17:58 AM2/15/11

to

tor...@diku.dk (Torben Ægidius Mogensen) writes:
> As you hint, counting the number of instructions in an ISA is futile: If
> you use a bit to select two alternative behaviours, is it one
> parameterised instruction or two separate instructions?

re:
http://www.garlic.com/~lynn/2011c.html#7 RISCversus CISC

the statement in the 70s about (801/)RISC was that it could be done in a
single chip. later in the 80s, (801/)RISC was instructions that could be
executed in single machine cycle. Over the decades, the definition of
RISC has been somewhat fluid ... especially as the number of circuits in
a chip has dramatically increased.

Andy "Krazy" Glew

unread,

Feb 15, 2011, 8:00:41 PM2/15/11

to

On 2/15/2011 6:39 AM, Torben �gidius Mogensen wrote:
> "Andy \"Krazy\" Glew"<an...@SPAM.comp-arch.net> writes:
>
>> http://semipublic.comp-arch.net/wiki/RISC_versus_CISC
>
> As you hint, counting the number of instructions in an ISA is futile: If
> you use a bit to select two alternative behaviours, is it one
> parameterised instruction or two separate instructions?
>
> Since counting instructions is futile, many have (as you say) argued
> that RISC is really about reducing instruction set complexity. This has
> much more merit, and the early RISC designs certainly avoided many
> sources of complexity, such as having several non-sequeintial memory
> accesses, which implies possible abort at several stages during
> execution and so on.

One of the names I suggested for RISC was RAMM - Reduced Addressing Mode
Machine. Inspired by funky addressing modes - didn't the 68K have such
a mode?

MitchAlsup

unread,

Feb 16, 2011, 12:08:57 AM2/16/11

to

On Feb 15, 7:00 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> One of the names I suggested for RISC was RAMM - Reduced Addressing Mode
> Machine. Inspired by funky addressing modes - didn't the 68K have such
> a mode?

Th 68K was rather clean in its address modes pretty much like the
PDP-11
The 68020 had some terrible extensionis that could be performed in
straightline code faster than the processor itself could perform the
address mode semantics.

Mitch

Quadibloc

unread,

Feb 16, 2011, 9:27:29 AM2/16/11

to

On Feb 15, 10:08 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> Th 68K was rather clean in its address modes pretty much like the
> PDP-11
> The 68020 had some terrible extensionis that could be performed in
> straightline code faster than the processor itself could perform the
> address mode semantics.

And here, the way I saw it was that the 68000 had very nice addressing
modes, but with one horrible, stupid omission that made it like the
Philco 2000 - a machine whose design I didn't even believe when it was
first described to me - that when you specified both a base and index
register, the displacement field was shorter.

Which meant, of course, that you couldn't use the standard method of
allocating base registers to handle arrays, but instead would
basically need an address constant for each array that you would have
to load into a base register temporarily.

It was unfortunate that fixing this in the 68020 meant long
instructions, but if this was slow, that was an implementation defect,
not an architectural problem.

There was a right way to do this. The IBM System/360 showed the right
way.

John Savard

Tim McCaffrey

unread,

Feb 16, 2011, 6:33:07 PM2/16/11

to

In article
<133bd133-cb8d-4129...@s11g2000yqh.googlegroups.com>,
Mitch...@aol.com says...
>
>On Feb 12, 2:14=A0pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>

>wrote:
>
>In general, a nice fairly balanced article.
>
>> AMD x86-64 is not a bad instruction set extension.

>> But life might be easier in the =A0future, if x86 does not die away, if i=

>t
>> had been more regular.
>
>I'm going to argue, that minor irregularities are irrelevent once you
>get a data path that contains millions of transistors.
>

Having written a code generator for AMD64, I don't think it is all that great.

First, it unnecessarily brings along some baggage from 386 that gives a
compiler headaches: dynamic shifts can only use CL, RAX/EAX/AX/AL is special
to several instructions. Limiting immediates and address offsets to 32 bits
is also annoying (except for a few cases), causing the need to load a register
with a 64 bit address in lots of cases (for the code generator I worked on).
With the demise of INC/DEC reg, there are basically no useful 1 byte
instructions. IF the instruction set had been redone (basically, same
functionality with a few instructions more regular, 64 bit offset &
immediates, and a couple new string instructions that emulated things like
strcpy & strcmp, and 16 bit offset conditional jumps) I think the code size
would have been substantially reduced over what current AMD64 allows. Not
because there are fewer instructions, but because the code density would be
higher. That would result in better I-Cache hit rate, hence better
performance.

Oh well.

- Tim

Chris Gray

unread,

Feb 16, 2011, 7:48:34 PM2/16/11

to

Quadibloc <jsa...@ecn.ab.ca> writes:

> And here, the way I saw it was that the 68000 had very nice addressing
> modes, but with one horrible, stupid omission that made it like the
> Philco 2000 - a machine whose design I didn't even believe when it was
> first described to me - that when you specified both a base and index
> register, the displacement field was shorter.

Mode "Address Register Indirect with Displacement" has a signed 16 bit
displacement

Mode "Address Register Indirect with Index" has a signed 8 bit
displacement

I assume those are what you are referring to?

> Which meant, of course, that you couldn't use the standard method of
> allocating base registers to handle arrays, but instead would
> basically need an address constant for each array that you would have
> to load into a base register temporarily.

Eh? Given that the "large" displacement is only 16 bits, you would only
be doing something like that if you are using the "small model" of
memory, where you can only address 64K of memory. Very few programs
did that. The machine has mode "Absolute Long Address" that directly
provides a 32 bit address. You can also load 32 bit immediates directly.

The short, 8 bit, displacement only has an affect when you have a pointer
to a structure in a register, and wish to access an element of an array
within that structure, and the offset of the array within the structure
is > 127. With a big enough structure, even the long displacement isn't
big enough. In either case, you simply use an alternate instruction
sequence that allows the needed offset size. That's what I did in my
Amiga Draco compiler.

--
Experience should guide us, not rule us.

Chris Gray

Andy "Krazy" Glew

unread,

Feb 17, 2011, 12:10:34 AM2/17/11

to

Looking at http://www-scm.tees.ac.uk/users/u0000408/020adr/020mode.htm

I think that the addressing mode that caused me concern was e.g.
(12,A2],D3,9)

or in C syntax, with char* A2, c

*(*(A2+12))+D3+9

i.e. a double indirect addressing mode, a register indirection, some
offsets, followed by a memory indirection, and some more offsets.

For a long time my motto was "a RAMM/RISC/SEISM processor should not do
pointer chasing."

Michael S

unread,

Feb 17, 2011, 2:48:45 AM2/17/11

to

On Feb 17, 7:10 am, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> On 2/16/2011 6:27 AM, Quadibloc wrote:
>
>
>
> > On Feb 15, 10:08 pm, MitchAlsup<MitchAl...@aol.com> wrote:
>
> >> Th 68K was rather clean in its address modes pretty much like the
> >> PDP-11
> >> The 68020 had some terrible extensionis that could be performed in
> >> straightline code faster than the processor itself could perform the
> >> address mode semantics.
>
> > And here, the way I saw it was that the 68000 had very nice addressing
> > modes, but with one horrible, stupid omission that made it like the
> > Philco 2000 - a machine whose design I didn't even believe when it was
> > first described to me - that when you specified both a base and index
> > register, the displacement field was shorter.
>
> > Which meant, of course, that you couldn't use the standard method of
> > allocating base registers to handle arrays, but instead would
> > basically need an address constant for each array that you would have
> > to load into a base register temporarily.
>
> > It was unfortunate that fixing this in the 68020 meant long
> > instructions, but if this was slow, that was an implementation defect,
> > not an architectural problem.
>
> > There was a right way to do this. The IBM System/360 showed the right
> > way.
>
> > John Savard
>

> Looking athttp://www-scm.tees.ac.uk/users/u0000408/020adr/020mode.htm

>
> I think that the addressing mode that caused me concern was e.g.
> (12,A2],D3,9)
>
> or in C syntax, with char* A2, c
>
> *(*(A2+12))+D3+9
>
> i.e. a double indirect addressing mode, a register indirection, some
> offsets, followed by a memory indirection, and some more offsets.
>
> For a long time my motto was "a RAMM/RISC/SEISM processor should not do
> pointer chasing."

x86 indirect call through in-memory pointer also double indirect. Is
it materially simpler?

nm...@cam.ac.uk

unread,

Feb 17, 2011, 6:36:05 AM2/17/11

to

In article <Z7mdnUmqNbiEd8vQ...@giganews.com>,

Andy \"Krazy\" Glew <an...@SPAM.comp-arch.net> wrote:
>
>However, although I am a RISC sympathizer, I may take a contrarian
>stance, talking about some of the issues and problems with RISC.
>The RISC philosophy was certainly influential. However, in many ways it
>failed.

As I have posted before, I regard one of the main reasons for
that is that it was treated as a religious dogma rather than an
engineering principle by too many people who should have known
better. The main problem that I have seen that cause is the
following:

'Hard' cases are often left to software, without the hardware core
providing the software with the primitives needed to implement the
requirement efficiently and reliably, even if the primitives were
easy to specify and trivial to implement. And then the 'solution'
was to extend the RISC by putting the requirement in hardware,
because the requirement is too slow, rather than providing the
primitives.

As you say, like most of the other aspects, that is not limited
to nominally RISC architectures, but particularly affected them.
It has been the cause of a lot of the claims of poor performance
of RISC versus CISC.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Feb 17, 2011, 11:25:10 AM2/17/11

to

Michael S <already...@yahoo.com> writes:
>On Feb 17, 7:10=A0am, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
>wrote:
[68020 addressing mode for things like]

>> *(*(A2+12))+D3+9
>>
>> i.e. a double indirect addressing mode, a register indirection, some
>> offsets, followed by a memory indirection, and some more offsets.
>>
>> For a long time my motto was "a RAMM/RISC/SEISM processor should not do
>> pointer chasing."
>
>x86 indirect call through in-memory pointer also double indirect. Is
>it materially simpler?

Stuff like

jmp [eax]

can be seen as a load-and-operate instruction (where the operation is
the jump) with the ordinary indirect addressing mode. There is no
data memory access to memory whose address comes from memory, unlike
the example above. So I see it as a single indirection, not as a
double indirection.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Quadibloc

unread,

Feb 17, 2011, 11:37:32 AM2/17/11

to

On Feb 16, 5:48 pm, Chris Gray <c...@GraySage.com> wrote:
> The machine has mode "Absolute Long Address" that directly
> provides a 32 bit address.

Yes, but you couldn't index one of those.

You can have:

32 bit displacement
16 bit displacement + contents of one address register
8 bit displacement + contents of one address register + contents of
another address register

The first two modes are all right as two different ways to access
simple variables.

You set up your address registers to point to the 64K areas, in a much
larger area of main memory, that you are going to use.

But if you also want to use an array, suddenly it all falls apart. You
can't take those modes, and index them without otherwise changing
them, because you do NOT have available, on the 68000 (unlike the
68020) the modes

32 bit displacement + contents of one address register
16 bit displacement + contents of one address register + contents of
another address register

So, instead of just handling arrays the same way as regular variables,
but with the index pointing to an element within the array, you have
to access array elements in a special manner - load an address
register with the address of the start of the array, and use the 8 bit
displacement mode, for example.

This is precisely the mess that the System/360 architecture avoids.

John Savard

MitchAlsup

unread,

Feb 17, 2011, 11:54:00 AM2/17/11

to

On Feb 17, 10:25 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> Michael S <already5cho...@yahoo.com> writes:
> >x86 indirect call through in-memory pointer also double indirect. Is
> >it materially simpler?
>
> Stuff like
>
> jmp [eax]
>
> can be seen as a load-and-operate instruction (where the operation is
> the jump) with the ordinary indirect addressing mode.

{Thanks Anton:}

The data path looks at this like a Load that delivers a result.
The fetch path looks at this as "wait for a result before taking the
jump".

Since the data path and the fetch path HAVE TO HAVE these functions
anyway, YES it is simpler--significantly simpler--than a double
indirect.

Mitch

Tim McCaffrey

unread,

Feb 17, 2011, 11:59:58 AM2/17/11

to

In article <2011Feb1...@mips.complang.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at says...

>
>Michael S <already...@yahoo.com> writes:
>>On Feb 17, 7:10=A0am, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
>>wrote:
>[68020 addressing mode for things like]
>>> *(*(A2+12))+D3+9
>>>
>>> i.e. a double indirect addressing mode, a register indirection, some
>>> offsets, followed by a memory indirection, and some more offsets.
>>>
>>> For a long time my motto was "a RAMM/RISC/SEISM processor should not do
>>> pointer chasing."
>>
>>x86 indirect call through in-memory pointer also double indirect. Is
>>it materially simpler?
>
>Stuff like
>
>jmp [eax]
>
>can be seen as a load-and-operate instruction (where the operation is
>the jump) with the ordinary indirect addressing mode. There is no
>data memory access to memory whose address comes from memory, unlike
>the example above. So I see it as a single indirection, not as a
>double indirection.
>

How about jmp far ptr[eax] ?

You have to bounce through the segment tables for that one, and if it is a
task gate...

- Tim

Andy "Krazy" Glew

unread,

Feb 17, 2011, 12:06:06 PM2/17/11

to

x86 is a fairly simple instruction set, except for all of the segment
(and task gate, and ..) stuff. Which is nearly always relegated to
microcode.

Nowadays jmp far ptr[eax] is relatively uncommon. It is penalized wrt
performance.

My answer might have been different in the 1980s, when far calls were
much more common.

Terje Mathisen

unread,

Feb 17, 2011, 1:34:09 PM2/17/11

to

Tim McCaffrey wrote:
> How about jmp far ptr[eax] ?
>
> You have to bounce through the segment tables for that one, and if it is a
> task gate...

Task gate jumps and calls are very specialized operations that just
happens to overload the normal jmp and call opcode names.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Chris Gray

unread,

Feb 17, 2011, 2:05:17 PM2/17/11

to

Quadibloc <jsa...@ecn.ab.ca> writes:

> You can have:

> 32 bit displacement
> 16 bit displacement + contents of one address register
> 8 bit displacement + contents of one address register + contents of
> another address register

No, you have 8 bit displacement + contents of one address register +
contents of one data register. I think you know that.

> The first two modes are all right as two different ways to access
> simple variables.

> You set up your address registers to point to the 64K areas, in a much
> larger area of main memory, that you are going to use.

Huh?

That's a strange way to do code generation.

You will typically use one address register as a frame pointer (I tried not
doing that in Draco, but the debuggers really didn't like it). You will
use A0 for return values. That leaves 6. You often need 2 for scratch
(e.g. in C "*p++ = *q++") That leaves 4. You usually have better uses for
those registers than to have them sitting around pointing at some fixed
area of memory that has some variables in it.

> But if you also want to use an array, suddenly it all falls apart. You
> can't take those modes, and index them without otherwise changing
> them, because you do NOT have available, on the 68000 (unlike the
> 68020) the modes

> 32 bit displacement + contents of one address register
> 16 bit displacement + contents of one address register + contents of
> another address register

(Again, you mean a data register for the second one.)

> So, instead of just handling arrays the same way as regular variables,
> but with the index pointing to an element within the array, you have
> to access array elements in a special manner - load an address
> register with the address of the start of the array, and use the 8 bit
> displacement mode, for example.

So you use a scratch address register. Do an LEA to get the address of
the start of the array into an address register, then do an indexed load
from there. You often have to scale the index value separately. (The X86
is actually a bit better here in that it can scale the index by 1, 2, 4, 8.
But, I'll still take a 68020 over an X86 if I'm writing a code generator!)

The only recollection I have of people using a 68000 address register to
point to a fixed area of memory is when the program uses the "small model",
where all non-local variables fit in 64K. Then one A reg is used as a
base register, and points within the small static variable area (since
offsets are signed, it can't always point at the beginning of the area).

Digging out my ancient "System/370 Reference Summary", I see that offsets
are always 12 bits - you can only access 4K of offset without having to
load constants from memory (on 360/370 immediates are always 8 bits).
Now that I think of it, that was one of the nuisances of the 360/370 -
you are quite often loading constants from memory.

> This is precisely the mess that the System/360 architecture avoids.

I don't think so.

--
Experience should guide us, not rule us.

Chris Gray c...@GraySage.COM

Quadibloc

unread,

Feb 17, 2011, 3:44:44 PM2/17/11

to

On Feb 17, 12:05 pm, Chris Gray <c...@GraySage.com> wrote:

> No, you have 8 bit displacement + contents of one address register +
> contents of one data register. I think you know that.

I wasn't sure of that, but that is better, since you really want the
address register to function as a base register, while the data
register is what you would normally prefer to use as an index register
so that you can calculate indexes.

> > The first two modes are all right as two different ways to access
> > simple variables.
> > You set up your address registers to point to the 64K areas, in a much
> > larger area of main memory, that you are going to use.
>
> Huh?
>
> That's a strange way to do code generation.

DRAND START 0
* ON ENTRY TO SUBROUTINE, REGISTER 15 CONTAINS THE
* START ADDRESS OF THE SUBROUTINE
BC 15,10(15) BRANCH AROUND EMBEDDED ROUTINE NAME
DC X'5'
DC CL5'DRAND' USED FOR DEBUGGING
* REGISTERS 13 AND 14 ARE ALSO DEDICATED TO THE
* STANDARD CALLING SEQUENCE - REGISTER 13 CONTAINS
* THE SAVE AREA OF THE CALLING PROGRAM
STM 14,12,12(13) SAVE ALL REGISTERS OF CALLER
* NOW, BEGIN ESTABLISHING YOUR OWN ENVIRONMENT
BALR 12,0 SUBROUTINE JUMP TO NEXT INSTRUCTION
USING *,12 ASSEMBLER DIRECTIVE
* REGISTER 12 NOW CONTAINS THE STARTING ADDRESS OF THE REST
* OF THE PROGRAM
LM 9,11,BASES
USING DATA,9,10,11
...

BASES DC A(DATA,DATA+4096,DATA+8192)

Strange? This is _standard_. So much so that the USING directive is
actually designed so that if you list multiple registers after the
address, it automatically indicates that the first register is assumed
to point to the given address, the second one to the address plus
4096, and so on, so that one doesn't even have to explicitly say

USING DATA,9
USING DATA+4096,10
USING DATA+8192,11

> Digging out my ancient "System/370 Reference Summary", I see that offsets
> are always 12 bits - you can only access 4K of offset without having to
> load constants from memory (on 360/370 immediates are always 8 bits).

It is true that the offsets, being only 12 bits in length instead of
16, were uncomfortably short, so that one ended up using too many of
the general registers for base registers. This was a limitation of the
architecture - a different limitation from the one the 68000 was
saddled with.

John Savard

Paul A. Clayton

unread,

Feb 17, 2011, 4:14:34 PM2/17/11

to

On Feb 12, 3:14 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:
> http://semipublic.comp-arch.net/wiki/RISC_versus_CISC
[snip]
I am looking forward to your "Why not RISC: Atomicity, Security,
Compatibility" article. I _think_ atomicity can be made RISC-like
by providing something like a generalization of AMD's ASF. If
'compatibility' refers to providing support for differences in
processor state and differences in how the state is managed, I
think a PALcode like mechanism could be as effective. (I do not
like what I perceive to be the x86 solution since it seems to
require that memory be reserved for saving context which the
application may never use. It seems to me that the executable
should provide the context size information in much the same
way that it might supply endian or other modal information.)
Fixed length instructions do seem to make extending an ISA more
difficult. Stateful processing--which could help in extending
an ISA--might violate RISC metaprinciples, although I do not
recall reading any RISC supporters criticizing such (perhaps
because it was generally recognized as funky even in the 1980s).

It seems that RISC philosophy was not only an effect of the
ability to have a single-chip processor if the processor is
sufficiently simple. The reduction in the cost of manufacturing
a processor may also have played a significant role. When bits
and gates are expensive, design effort is less expensive so a
more complex design is more attractive. In addition, when the
total cost of producing a system is reduced, more organizations
can afford to produce research systems. Advances in computing
performance and software sophistication also reduced the cost of
design by means of automation. (There may have been an
additional discrete point of resource adequacy here as well. A
simple design might be better handled with the limits of
processing performance and software sophistication.) The
transition to higher level languages (the plural aspect may
have had some significance), the availability of source code
(and the freedom to modify such--whether from more liberal
licensing or it being developed in-house), and the increase in
the volume of software (computers becoming more commonly used
[due to decreasing cost, increasing familiarity, etc.], software
development costs decreasing, and computers becoming more capable
allowing more sophisticated software)--these three factors may
have given a special window of opportunity for new hardware
Architectures and made more apparent how powerful a lever a good
compiler can be (and the greater and cheaper computing power
allows a compiler to be more sophisticated). (RISC may have
had the further advantage of separating concerns, enabling
more parallel development. If software developers are more
available than hardware developers, offloading complexity to
software could be advantageous as well.)

NOTE: the above is merely reasonable (I believe) speculation
based on very limited information.

(The increased commonness of Free/Open Source Software and
increased sophistication and availability of binary translation
along with some increasing concerns--the memory latency/bandwidth
wall, the thermal wall, the energy [battery life] wall--, the
costs of production, and perhaps other factors--these may
open a window of opportunity for substantial innovation.
[Compilers also seem to be getting more clever and there
appears to be more willingness to consider new programming
languages.])

From Hennessey and Patterson's Computer Architecture: a
Quantitative Approach, features that make a compiler easier
to write include:
1. Regularity (e.g., avoid special purpose registers)
2. Provide primitives, not solutions (avoid excess work; "solutions"
also tend to violate regularity)
3. Simplify trade-offs among alternatives
4. Provide instructions that bind the quantities known at compile
time
as constants

(It is a little surprising that the software schedulability advantage
of primitives over solutions was not mentioned. Such does not make
correctness easier but does allow the compiler to do work so that the
processor does not have to do that work, which is what property #4 is
about. I tend to disagree on the VAX call mask; H&P present it as an
example of inappropriate design since it "interprets a mask saying
what registers to save". While reencoding the register numbers does
involve extra work, such can reduce the amount of memory accessed--if
more than 4 registers are saved, the 16-bit mask is smaller. Such
also introduces an implicit operation which reduces code size. On
the negative side, such would typically add temporary state that
must be retained through a page fault [the PC might revert to the
CALLS instruction and redo all the work, but that strategy seems less
attractive for a VAX because of the amount of work thrown out;
similar issues come with delayed branches], such also sacrifices
specialization of save/restore [not all paths through the function
may need the maximum number of free registers] and scheduling of
memory accesses [assuming a moderately clever compiler]. The ARM
designers chose to provide a load/store multiple with a mask, and
the PowerPC designers chose to provide a load/store multiple that
expresses a range of registers by specifying the lowest register
number and ending the range at the highest possible register number;
so the code density benefit may be greater than the other costs for
some tasks [and compiler abilities].)

Adding the hardware aspects for ease of pipelining (which
discourages complex instructions and complex encodings) and the
memory hierarchy (register count was also largely influenced
by seeking to make register allocation easier for the
compiler), pushes the design philosophy toward RISC.

Complex instructions have some advantages. An instruction with
greater semantic content can often be implemented more efficiently
in hardware and can be encoded more concisely. The variability
of semantic content tends to encourage greater support for
variability in instruction length. Complex instructions do
tend to increase design complexity and the size of hardware.
(Interestingly, accelerators [and ideas like "Conservation Cores"]
seem to be becoming mainstream.) If infrequently used (e.g.,
due to the semantics not quite matching the desired operation),
complex instructions can actually reduce code density by
requiring more common encodings to be larger. Complex instructions
also tend to increase the minimum size/complexity of an
implementation
unless the Architecture can be cleanly subsetted. (The ARM M0 is
surprisingly small/simple but even the ARM subset that it implements
may have excess baggage. It is less clear how x86 would approach
that size.)

Paul A. Clayton
truly just a technophile

Chris Gray

unread,

Feb 17, 2011, 4:23:54 PM2/17/11

to

Quadibloc <jsa...@ecn.ab.ca> writes:

> LM 9,11,BASES
> USING DATA,9,10,11
> ...
>
> BASES DC A(DATA,DATA+4096,DATA+8192)

> Strange? This is _standard_. So much so that the USING directive is
> actually designed so that if you list multiple registers after the
> address, it automatically indicates that the first register is assumed
> to point to the given address, the second one to the address plus
> 4096, and so on, so that one doesn't even have to explicitly say
>
> USING DATA,9
> USING DATA+4096,10
> USING DATA+8192,11

I admit to having forgotten about that stuff. I didn't do enough 360/370
assembler to ever get into multiple USING statements. One very important
aspect here is that the 360/370 has around a dozen registers that it can
use as base registers. The 68K only has about 4. That makes a huge
difference as to whether the technique makes sense or not.

I'm pretty sure that neither of my compilers that emitted 360/370 code
used the technique. I didn't disassemble enough code to know whether
IBM's Fortran compiler would do so or not.

> It is true that the offsets, being only 12 bits in length instead of
> 16, were uncomfortably short, so that one ended up using too many of
> the general registers for base registers. This was a limitation of the
> architecture - a different limitation from the one the 68000 was
> saddled with.

It seems to me that the limitation in the 68K is the smaller number of
registers usable for pointers, and not the lack of an indexed addressing
mode with more than 8 bits of offset. I could point out that 8 bits is
just as close to the 360/370's 12 bits as 16 bits is. :-)

As soon as a modern programmer throws in a few decent sized static arrays,
the technique isn't going to work very well - you would end up with your
base registers only addressing one or two variables each, which is a very
poor use for valuable registers. Have the compiler sort the variables by
size and frequency of access, and then use one base register - that could
make sense.

Anyway, this conversation long since ceased to be terribly relevant to
comp.arch .

Michael S

unread,

Feb 17, 2011, 5:14:45 PM2/17/11

to

It's all depends on user's expectations.
For example, jump through far pointer, mentioned in a post below, is
expected to be slow. That makes HW implementation simple and
straightforward - relegate the work to microcode and don't worry.
On the other hand, jmp [eax] on the modern OoO processor is expected
to be at least as fast as an alternative two-instruction sequence of
load + indirect jump through register. At all possible circumstances,
no less.
I.e. if microarchitecture allows the potentially speculative or out-of-
order issue of the separate load higher up in instruction stream then
the same should be allowed for a load which is a part of indirect
jump through memory.
It seems to me, it would require, at very least, addition of "quasi-
architected" register which should have all properties of truly-
architected register with exception of visibility to software. And,
likely, to cover more cases you will want more than one such register.
If your microarchitecture have such ttype of registers anyway, then
fine - you won. But then you can use almost exactly the same mechanism
for efficient implementation of unequivocal double indirection of '20
style.
Am I missing something?

MitchAlsup

unread,

Feb 17, 2011, 5:21:42 PM2/17/11

to

On Feb 17, 3:14 pm, "Paul A. Clayton" <paaronclay...@gmail.com>
wrote:

> Adding the hardware aspects for ease of pipelining (which
> discourages complex instructions and complex encodings) and the
> memory hierarchy (register count was also largely influenced
> by seeking to make register allocation easier for the
> compiler), pushes the design philosophy toward RISC.

I have to disagree with you here, with limits.

Once a pipeline has been designed, that paipeline is capable of
dealing with fairly complex activities in a single "fell swoop". So,
for example, if one were to design a pipeline that had a stage to read
the data cache and a later stage to write the data cache, it is fairly
easy to make that pipeline support read-modify-write instruction
encodings. In addition, these instructions will flow through the
pipeline in a single cycle and appear to be efficient. In a typical
RISC instruction set, these will take three instructions instead of
just one.

However, that same pipeline will have considerable difficulty if the
instruction set wants to read indirect through the data cache (use it
twice in a single cycle), or deliver multiple results (load a register
and autoincrement another, or store a register and autodecrement a
register).

So simplicity and complexity are a spectrum not a binary choice. As
Andy pointed out a couple days ago, the simpler complex instruction
set survived (won), while the simple simple instruction sets
(university RISC) and complex simple instruction sets (MIPS, ALPHA,
1108, CDC 6600, CRAY) and the complex complex instruction sets (VAX,
68K, 432) did not.

Therefore the instruction set should be (to the limits possible) be
designed for a pipeline that can be expressed in small, medium, and
large forms. This is where RISC failed architecturally. RISC also
failed managerially when all (7) design teams "bit off more than they
could chew" in the second or third iteration of their architectures.
Brooks mentioined this in his famous bok as the "second system
syndrome".

On the x86 side, one of the reasons it won is the expressivity of the
address modes--even if these are tricky for the compiler, and even if
they are a little hard for the pipeline. Having two register and a
displacement (and memory based operands) partially makes up for the
lack of registers. (#60 also has this property).

Mitch

Quadibloc

unread,

Feb 17, 2011, 6:28:27 PM2/17/11

to

On Feb 17, 2:23 pm, Chris Gray <c...@GraySage.com> wrote:

> It seems to me that the limitation in the 68K is the smaller number of
> registers usable for pointers, and not the lack of an indexed addressing
> mode with more than 8 bits of offset.

I thought that with 8 data registers, and 8 address registers, the
68000 wasn't that far behind the 360, except perhaps in terms of
flexibility.

http://www.quadibloc.com/arch/arcint.htm

discusses how there is enough room to avoid both limitations:

(opcode: 7 bits)(dest reg: 3 bits)(index reg: 3 bits)(base reg: 3
bits)

fits into a 16 bit halfword, with a 16 bit displacement in the next
one.

When the base register field contains 000, its a register-to-register
instruction, with the index register field now used to indicate the
source register.

The web page referenced is a later development of an imaginary
architecture which started from that instruction format, but which has
been modified for more compact encoding of instructions.

John Savard

Mike Hore

unread,

Feb 17, 2011, 10:17:25 PM2/17/11

to

On 18/02/11 8:14 AM, Paul A. Clayton wrote:

>...

> The ARM
> designers chose to provide a load/store multiple with a mask, and
> the PowerPC designers chose to provide a load/store multiple that
> expresses a range of registers by specifying the lowest register
> number and ending the range at the highest possible register number;
> so the code density benefit may be greater than the other costs for
> some tasks [and compiler abilities].)

Just a small interesting point here - the PowerPC's lmw and stmw
instructions were never provided in 64-bit versions - there's no "lmd"
or "stmd". Maybe when the 64-bit extensions were being designed, the
designers felt these instructions weren't such a good idea. I actually
had an email from one of the designers strongly recommending I didn't
ever generate lmw or stmw, since they might later get relegated to software.

Cheers, Mike.

---------------------------------------------------------------
Mike Hore mike_h...@OVE.invalid.aapt.net.au
---------------------------------------------------------------

Andy "Krazy" Glew

unread,

Feb 17, 2011, 11:00:52 PM2/17/11

to

On 2/17/2011 1:23 PM, Chris Gray wrote:
> Quadibloc<jsa...@ecn.ab.ca> writes:
>
>> LM 9,11,BASES
>> USING DATA,9,10,11
>> ...
>>
>> BASES DC A(DATA,DATA+4096,DATA+8192)
>
>> Strange? This is _standard_. So much so that the USING directive is
>> actually designed so that if you list multiple registers after the
>> address, it automatically indicates that the first register is assumed
>> to point to the given address, the second one to the address plus
>> 4096, and so on, so that one doesn't even have to explicitly say
>>
>> USING DATA,9
>> USING DATA+4096,10
>> USING DATA+8192,11
>
> I admit to having forgotten about that stuff. I didn't do enough 360/370
> assembler to ever get into multiple USING statements. One very important
> aspect here is that the 360/370 has around a dozen registers that it can
> use as base registers. The 68K only has about 4. That makes a huge
> difference as to whether the technique makes sense or not.

One reason for base registers: before virtual memory, you could not
necessarily assume that your data would be loaded at any particular
fixed address. So, instead of putting the address into constants
embedded in the instruction, you loaded a base register - pointing to
the data block for the program, library, whatever - and ran with that.

More so: some multitasking OSes could move programs around on the fly.
They understood base register usage, so if they moved a "segment" (not
to be confused with an x86 segment, although not so far off either) they
could go and look for the base register containing that address, and
change that.
You were discouraged, on such systems, from storing segment bases
into memory. Or, if you did, you wanted to store them into a fixed
place, a segment base table, that the OS could also change.
Sound familiar?

This even survived for some time after virtual memory, because it was
cheaper to change such base register values than it was to do a full
linker or relocating loader.

But eventually virtual memory allowed people the luxury of assuming
fixed base addresses.

... Until DLLs, shared libraries, and ASR (Address Space Randomization)
changed that.

Back to the future.

EricP

unread,

Feb 18, 2011, 12:02:18 AM2/18/11

to

Paul A. Clayton wrote:
>
> (It is a little surprising that the software schedulability advantage
> of primitives over solutions was not mentioned. Such does not make
> correctness easier but does allow the compiler to do work so that the
> processor does not have to do that work, which is what property #4 is
> about. I tend to disagree on the VAX call mask; H&P present it as an
> example of inappropriate design since it "interprets a mask saying
> what registers to save". While reencoding the register numbers does
> involve extra work, such can reduce the amount of memory accessed--if
> more than 4 registers are saved, the 16-bit mask is smaller. Such
> also introduces an implicit operation which reduces code size. On
> the negative side, such would typically add temporary state that
> must be retained through a page fault [the PC might revert to the
> CALLS instruction and redo all the work, but that strategy seems less
> attractive for a VAX because of the amount of work thrown out;

In the VAX CALLG/CALLS/RET register save mask mechanism,
in order for the RET instruction to know what to pop
the original mask must also be saved on the stack.
RET pops the saved mask, and that tells it which registers to restore.
So in essence the list of pop instructions comes off the
read/write stack - which borders on self modifying code.

The problem is speculatively executing a RET because the
saved frame and mask might be written by intervening code.
So RET must have a mechanism to watch for stack changes
into the saved frame while it executes ahead.

Speculatively executing an x86 RET instruction also must consider
the possibility of writes to the stack's return address,
but the VAX was just _that_ much more complicated.

(DEC eventually began using JSB/RSB instructions at least for
run time library routines because all they did was push/pop the PC).

Eric

Stephen Fuld

unread,

Feb 18, 2011, 1:13:07 AM2/18/11

to

On 2/17/2011 2:21 PM, MitchAlsup wrote:

snip

> So simplicity and complexity are a spectrum not a binary choice. As
> Andy pointed out a couple days ago, the simpler complex instruction
> set survived (won), while the simple simple instruction sets
> (university RISC) and complex simple instruction sets (MIPS, ALPHA,
> 1108, CDC 6600, CRAY) and the complex complex instruction sets (VAX,
> 68K, 432) did not.

Yes, but . . . I agree on what won/survived (except see below) but I'm
not sure about the implication (that perhaps neither you nor Andy made)
that the reason for what survived versus what didn't was the ISA
differences. There were other reasons unrelated to ISA, or even any
purely technical reason. For example, if Moto had been ready and
willing, and IBM chose it instead of the 8088 for the original PC,
things might have turned out differently.

Also note that ARM has survived quite well, and I guess you would
classify it as complex simple.

Also saying X86 "won" is a bit misleading. It surely was the winner for
the desktop/laptop segments, as well as the small server segments, but
it hasn't made any inroads in phones or even *despite Intel's efforts)
tablets (at least so far). And while it is certainly "in the hunt" for
large servers, it isn't the clear winner.

With Windows 8 supporting ARM, the next few years should be interesting
as tablets become more and more powerful and the need for full blown
laptops and even desktops gets diminished.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Paul A. Clayton

unread,

Feb 18, 2011, 3:28:52 AM2/18/11

to

On Feb 17, 10:17 pm, Mike Hore <mike_hore...@OVE.invalid.aapt.net.au>
wrote:
[snip]

> Just a small interesting point here - the PowerPC's lmw and stmw
> instructions were never provided in 64-bit versions - there's no "lmd"
> or "stmd". Maybe when the 64-bit extensions were being designed, the
> designers felt these instructions weren't such a good idea. I actually
> had an email from one of the designers strongly recommending I didn't
> ever generate lmw or stmw, since they might later get relegated to software.

I _suspect_ that the non-support and possible deprecation may come
from an emphasis on performance over code density, particularly for
the 64-bit mode (compiler maturity may also play a role). (The
Variable
Length Encoding mode does provide load/store multiple word, so it
seems there was at least some incentive to keep it even in a new
instruction encoding.)

I think ARM's Thumb2 (which, unlike the original Thumb, is a complete
instruction set) excludes the load/store multiple instruction.

(For the early ARM processors, a complex instruction had an advantage
since there were no caches--and burst accesses had much greater
bandwidth.)

The tradeoffs are certainly different in the 2010s from those in the
1980s, but I have almost no idea about what an optimal Architecture
would be for near future systems and possibly even less idea about
how existing Architectures would best be evolved.

Paul A. Clayton
just a technophile

nm...@cam.ac.uk

unread,

Feb 18, 2011, 3:04:41 AM2/18/11

to

In article <ijl2ld$8rs$1...@news.eternal-september.org>,

Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>
>Also saying X86 "won" is a bit misleading. It surely was the winner for
>the desktop/laptop segments, as well as the small server segments, but
>it hasn't made any inroads in phones or even *despite Intel's efforts)
>tablets (at least so far). And while it is certainly "in the hunt" for
>large servers, it isn't the clear winner.

Also, it was Intel wot won it, not the x86!

A more serious point is that the x86 could be toppled within 5 years,
though I doubt that it will be - but it's unlikely to last another 25!
This has relatively little to do with architecture, though, and a hell
of a lot to do with international politics (including patent abuses
and similar practices).

Regards,
Nick Maclaren.

Mike Hore

unread,

Feb 18, 2011, 5:05:32 AM2/18/11

to

On 18/02/11 7:28 PM, Paul A. Clayton wrote:
> On Feb 17, 10:17 pm, Mike Hore<mike_hore...@OVE.invalid.aapt.net.au>
> wrote:
> [snip]
>> Just a small interesting point here - the PowerPC's lmw and stmw
>> instructions were never provided in 64-bit versions - there's no "lmd"
>> or "stmd". Maybe when the 64-bit extensions were being designed, the
>> designers felt these instructions weren't such a good idea. I actually
>> had an email from one of the designers strongly recommending I didn't
>> ever generate lmw or stmw, since they might later get relegated to software.
>
> I _suspect_ that the non-support and possible deprecation may come
> from an emphasis on performance over code density, particularly for
> the 64-bit mode (compiler maturity may also play a role).

Yes, this is probably right. (This was around '92 or '93.) All the
other PowerPC instructions, I think, only do at most 3 reads from the
registers and one write. lmw/stmw could do 32, depending on which way
you're going. Probably needing microcode, on any implementation, in
that timeframe? (Hardware's not my area, but I read here and learn
gratefully :-).

Torben �gidius Mogensen

unread,

Feb 18, 2011, 5:32:23 AM2/18/11

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

> Complex instructions have some advantages. An instruction with
> greater semantic content can often be implemented more efficiently
> in hardware and can be encoded more concisely. The variability
> of semantic content tends to encourage greater support for
> variability in instruction length. Complex instructions do
> tend to increase design complexity and the size of hardware.
> (Interestingly, accelerators [and ideas like "Conservation Cores"]
> seem to be becoming mainstream.) If infrequently used (e.g.,
> due to the semantics not quite matching the desired operation),
> complex instructions can actually reduce code density by
> requiring more common encodings to be larger. Complex instructions
> also tend to increase the minimum size/complexity of an
> implementation unless the Architecture can be cleanly subsetted.

There is also a disadvantage: Complex instructions are rarely
parameterised with all the possible special cases that can allow simpler
execution. A typical example is a CALL instruction that saves the
return address on a stack: It doesn't optimise for tail calls or leaf
calls. Using a jump-and-link instruction that just saves the return
address in a register allows the callee to just keep it there if it is a
leaf function, thus saving memory traffic. And at a tail call, you can
use a jump that doesn't modify stack or link register.

So I would, like Hennessy & Patterson, prefer to have primitives that
can be combined in various ways to take advantage of special cases. Up
to a point, anyway. I would not want to implement, say, multiplication
in software, as a hardware implementation can be a lot faster than
anything you could do in software, and software implementations
typically need extra registers and can have hard-to-predict jumps. But
I would want instructions for special cases such as shifts for
muliplication by powers of two.

Torben

Paul A. Clayton

unread,

Feb 18, 2011, 6:00:35 AM2/18/11

to

On Feb 17, 5:21 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> On Feb 17, 3:14 pm, "Paul A. Clayton" <paaronclay...@gmail.com>
> wrote:
>
> > Adding the hardware aspects for ease of pipelining (which
> > discourages complex instructions and complex encodings) and the
> > memory hierarchy (register count was also largely influenced
> > by seeking to make register allocation easier for the
> > compiler), pushes the design philosophy toward RISC.
>
> I have to disagree with you here, with limits.
>
> Once a pipeline has been designed, that paipeline is capable of
> dealing with fairly complex activities in a single "fell swoop". So,
> for example, if one were to design a pipeline that had a stage to read
> the data cache and a later stage to write the data cache, it is fairly
> easy to make that pipeline support read-modify-write instruction

Wouldn't this add 'significant' depth to early pipelines and so made
branches more expensive? The "superpipelined" MIPS R4000
only had eight stages--2 Icache, RF, EX, 2 Dcache, tag check,
and WB. Skewed pipelines (as used in Cell PowerPC core/PPE
and the XBox360 Xenon cores) are interesting (and especially
useful when there is more than one instruction of delay for a load),
but such might have been suboptimal for the mid-to-late 1980s.
(I have 20/40 [20/80?] hindsight!)

(IC, RF, AG, DC, EX, WB would seem to almost double the
branch misprediction penalty compared to IC, RF, EX, DC, WB--
even worse when a delayed branch can be used. The greater
work per instruction could compensate for this, and your
hindsight is likely significantly better than John L. Hennessy's
foresight, but I am still skeptical/do not understand.)

> encodings. In addition, these instructions will flow through the
> pipeline in a single cycle and appear to be efficient. In a typical
> RISC instruction set, these will take three instructions instead of
> just one.
>
> However, that same pipeline will have considerable difficulty if the
> instruction set wants to read indirect through the data cache (use it
> twice in a single cycle), or deliver multiple results (load a register
> and autoincrement another, or store a register and autodecrement a
> register).

Why would store and update be a problem? Two register reads
(address base and store value) and one register result (the other
result goes to memory) should not be a problem, correct?

How great is "considerable" for load and update? In my ignorance,
stalling the pipeline when there is a write-port hazard does not seem
like it would be very difficult. Would using read/write phasing of
two
register file ports add "considerable difficulty"? (Would it be
noticeably easier if software had to guarantee that the address
register was not the destination of the load--at the cost of an
unpredictable result value on violation, not hardware damage or
privilege violation?)

>
> So simplicity and complexity are a spectrum not a binary choice. As
> Andy pointed out a couple days ago, the simpler complex instruction
> set survived (won), while the simple simple instruction sets
> (university RISC) and complex simple instruction sets (MIPS, ALPHA,
> 1108, CDC 6600, CRAY) and the complex complex instruction sets (VAX,
> 68K, 432) did not.

I do not like complexity even being viewed as a spectrum because
the costs of complexity are not simple. Some aspects of complexity
have a somewhat persistent cost and others have more of a
learning curve. Simplification can also merely relocate complexity--
and as Nick points out this can greatly increase complexity. (The
lower in the interface stack something is implemented, the more
broadly it can be exploited [this alone encourages complexity]; yet
it is more critical that the lower parts of the interface stack be
correct and understandable [this discourages complexity].)

> Therefore the instruction set should be (to the limits possible) be
> designed for a pipeline that can be expressed in small, medium, and
> large forms. This is where RISC failed architecturally. RISC also

It seems that RISC fits the 'small' form in terms of core size (but
not
code size) and the 'large' form generally seems to make the
Architecture less relevant--at least if the Architecture is expected
to
be somewhat reasonable for the 'small' form..

x86 does not seem to fit the small form as well as RISC (correct?).

If early adoption of an Architecture is at the small form (early RISC
and most new Architectures for embedded systems?), one may
need later-burdens like delayed branches to bootstrap the adoption
of a new Architecture.

> failed managerially when all (7) design teams "bit off more than they
> could chew" in the second or third iteration of their architectures.
> Brooks mentioined this in his famous bok as the "second system
> syndrome".

I thought the MIPS R4000 (after R2000 and R3000) was decent--were
you referring to the R8000 (third significantly different
microarchitecture)? (http://en.wikipedia.org/wiki/MIPS_architecture :
"Although its FPU performance fit scientific users quite well, its
limited
integer performance and high cost dampened appeal for most users")

Was the overreaching Alpha the 21164 (second) or 21264 (third)?
Both had issues--the 21264's lower capacity but more sophisticated
branch predictor was less suited to some commercial workloads and
its direct-mapped off-chip L2 cache may have provided irregular
performance (it was certainly a large bite, but it seems to have been
chewed reasonably well), the 21164 had restricted integer issue for
a 4-wide superscalar (but it seems like a smaller bite if less well
chewed).

You have mention some of the aggressiveness of the Motorola
88120 (third implementation).

HP PA-8000?? ARM3, ARM6, or ARM7?? POWER's "Bellatrix"??
(http://en.wikipedia.org/wiki/IBM_POWER : "extremely ambitious")
SuperSPARC??

Paul A. Clayton

unread,

Feb 18, 2011, 6:22:56 AM2/18/11

to

On Feb 18, 5:05 am, Mike Hore <mike_hore...@OVE.invalid.aapt.net.au>
wrote:
[snip]

> Yes, this is probably right. (This was around '92 or '93.) All the
> other PowerPC instructions, I think, only do at most 3 reads from the
> registers and one write. lmw/stmw could do 32, depending on which way
> you're going. Probably needing microcode, on any implementation, in
> that timeframe? (Hardware's not my area, but I read here and learn
> gratefully :-).

My (ignorant) guess is that such would be implemented with a small
state
machine that generates load/store instructions by incrementing
through the register numbers, though microcode might not be
unreasonable.
(It is sad to me that these instructions [like x86 repeat move
instructions]
are not aggressively microarchitecturally optimized.)

Neither software nor hardware is my area, so I get to learn from
almost
everyone. (Sometimes the generosity of others is a bit overwhelming.)

Paul A. Clayton

unread,

Feb 18, 2011, 7:06:03 AM2/18/11

to

On Feb 18, 5:32 am, torb...@diku.dk (Torben gidius Mogensen) wrote:
[snip]

> There is also a disadvantage: Complex instructions are rarely
> parameterised with all the possible special cases that can allow simpler
> execution. A typical example is a CALL instruction that saves the
> return address on a stack: It doesn't optimise for tail calls or leaf
> calls. Using a jump-and-link instruction that just saves the return
> address in a register allows the callee to just keep it there if it is a
> leaf function, thus saving memory traffic. And at a tail call, you can
> use a jump that doesn't modify stack or link register.

I generally agree and that was part of the meaning of "semantics not
quite matching the desired operation".

> So I would, like Hennessy & Patterson, prefer to have primitives that
> can be combined in various ways to take advantage of special cases. Up
> to a point, anyway. I would not want to implement, say, multiplication
> in software, as a hardware implementation can be a lot faster than
> anything you could do in software, and software implementations
> typically need extra registers and can have hard-to-predict jumps. But
> I would want instructions for special cases such as shifts for
> muliplication by powers of two.

Idiom detection can be used to optimize sequences of general
operations, but I do not know how expensive such is.

The tradeoffs in Architecture seem to be immensely complex.
I do not understand how even microarchitecture can be done well;
condensing benchmarks into a single figure of merit, predicting five
years ahead of time what general microarchitecture will be ready
for production, suited for the available technology, and competitive
(even ignoring the complexity of the value of being 'best' even if
not price competitive), managing a largish team of workers
(moderating communication--quantity, quality and content--,
balancing workloads, developing teamwork, etc.), not expecting
perfection, and perhaps a half dozen other things I have forgotten
and dozens of things I have never learned. The failures can be
disheartening even to those on the sidelines, but the achievements--
even though very imperfect--can be extraordinary (at least to those
on the sidelines).

nm...@cam.ac.uk

unread,

Feb 18, 2011, 6:46:40 AM2/18/11

to

In article <38bdafdd-dbf8-461c...@o7g2000prn.googlegroups.com>,

Paul A. Clayton <paaron...@gmail.com> wrote:
>On Feb 18, 5:32 am, torb...@diku.dk (Torben gidius Mogensen) wrote:
>
>> So I would, like Hennessy & Patterson, prefer to have primitives that
>> can be combined in various ways to take advantage of special cases. Up
>> to a point, anyway. I would not want to implement, say, multiplication
>> in software, as a hardware implementation can be a lot faster than
>> anything you could do in software, and software implementations
>> typically need extra registers and can have hard-to-predict jumps. But
>> I would want instructions for special cases such as shifts for
>> muliplication by powers of two.
>
>Idiom detection can be used to optimize sequences of general
>operations, but I do not know how expensive such is.

That's not the problem - it doesn't work! If the language is very
inflexible and the compiler very dumb, then it can help a lot - but
that is merely removing inefficiencies that have been unnecessarily
included. Other than that, there just isn't enough commonality to
get more than minor gains.

A lot of experts and others (like me!) have been sounding the trumpet
of primitives for many decades - Hennessy & Patterson may have been
among them, or may have been converts. But the dual attack of
benchmarketing and the dogma of the RISC fanatics have blocked that
approach.

Multiplication is a clear case of something that can be broken down
into primitives, usefully and efficiently, but only if the actual
engineering requirements are paramount. The benchmarketeers demand
that the detailed operations they need are completely in hardware,
and the RISC fanatics were not prepared to provide the functionality
needed. And, yes, the primitives are at a fairly high level.

Floating-point is another.

What are NOT, are virtual memory mapping and inter-thread signalling,
but both are far too often excluded from the architecture and left
to the operating systems to kludge up, unreliably and inefficiently.

Regards,
Nick Maclaren.

Anders....@kapsi.spam.stop.fi.invalid

unread,

Feb 18, 2011, 8:08:27 AM2/18/11

to

Paul A. Clayton <paaron...@gmail.com> wrote:
>

> I think ARM's Thumb2 (which, unlike the original Thumb, is a complete
> instruction set) excludes the load/store multiple instruction.
>
> (For the early ARM processors, a complex instruction had an advantage
> since there were no caches--and burst accesses had much greater
> bandwidth.)

No, the load/store multiple instructions are still there. The biggest
users of the Thumb-2 ISA are the Cortex-M microcontrollers, so the
bandwidth advantage is still there as well.

-a

Quadibloc

unread,

Feb 18, 2011, 11:29:56 AM2/18/11

to

On Feb 18, 1:04 am, n...@cam.ac.uk wrote:

> A more serious point is that the x86 could be toppled within 5 years,
> though I doubt that it will be - but it's unlikely to last another 25!
> This has relatively little to do with architecture, though, and a hell
> of a lot to do with international politics (including patent abuses
> and similar practices).

Unlike Mac users, Windows users aren't likely to tolerate not being
able to run the same software natively when they upgrade their
computer. It's true that computer power continues to increase, though,
and so eventually the dissatisfaction with Windows 7 and successors
for not booting up as quickly, and not running with as little
overhead, as Windows 98 may go away, and so Windows 8, or some distant
successor, on ARM might eventually be fast enough and compatible
enough...

Of course, if Windows is toppled, then it's perfectly possible for
people to choose Linux on ARM instead of Linux on x86 as its
successor. Perhaps that could happen within 25 years, but I can't
predict something like that, as I can't see the reason for the market
to tolerate such a breakdown in compatibility.

I mean, if IBM started manufacturing a line of personal computers that
came with Linux or OS/2 pre-installed - or, better yet, if they
*bought Apple* and merged OS X with OS/2 to produce THE standard
operating system of the future - and it had clear advantages, like:

a) booting up, and consuming as few resources as, Windows 3.1 if you
choose (if you want multimedia, you can choose a slower Windows 98
equivalent mode),

b) ironclad Internet security (Surf the web? You're watching a video
of your sandbox surfing the web...),

then, **maybe**, with the magic of the IBM name behind it, enough
third-party developers would be attracted to enable a steadily growing
market share versus Microsoft.

It would help if they used a CPU which could natively run 68020/68882,
PowerPC, x86, *and* System/370 binaries. It might not run current
Windows programs, but it has another pool of pre-existing software.

John Savard

Quadibloc

unread,

Feb 18, 2011, 11:50:31 AM2/18/11

to

On Feb 18, 4:00 am, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> On Feb 17, 5:21 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> > Once a pipeline has been designed, that paipeline is capable of
> > dealing with fairly complex activities in a single "fell swoop". So,
> > for example, if one were to design a pipeline that had a stage to read
> > the data cache and a later stage to write the data cache, it is fairly
> > easy to make that pipeline support read-modify-write instruction
>
> Wouldn't this add 'significant' depth to early pipelines and so made
> branches more expensive? The "superpipelined" MIPS R4000
> only had eight stages--2 Icache, RF, EX, 2 Dcache, tag check,
> and WB. Skewed pipelines (as used in Cell PowerPC core/PPE
> and the XBox360 Xenon cores) are interesting (and especially
> useful when there is more than one instruction of delay for a load),
> but such might have been suboptimal for the mid-to-late 1980s.
> (I have 20/40 [20/80?] hindsight!)

Let's say to start with you have circuits which can perform a
particular operation in a particular time.

As a specific example, let's say that a Williams tree to do a double-
precision floating-point multiply can do it in 15.36 nanoseconds,
because you have a gate delay of only 60 picoseconds.

If you pipeline this operation as much as is reasonable, so that not
only is your clock cycle 960 picoseconds long (for a clock rate of
1.04 GHz) but you can start a new double-precision floating-point
multiply on every single cycle, where's the problem?

Of course if you have a branch right after you've started a new double-
precision floating-point multiply, and you won't know which way to go
until it is finished, you might have a mess to clean up.

But even if you always took the maximum penalty - i.e., with the
simplest hardware, always just wait until you don't need to predict
anything before doing a branch - with just a little loop unrolling you
have still gained far more than you have lost by having a 16-stage
pipeline for a floating-point multiply.

And there's no reason to even waste time waiting. That's what
multithreading is for. Get instructions from another thread while
you're waiting to branch. They can even be unrelated double-precision
floating-point multiplies.

Of course, since the chip also has addition circuitry, and fixed-point
circuitry, one actually wants to initiate one operation for every
functional unit on the chip on every cycle if possible. With enough
threads, this does not require messy stuff like out-of-order
execution. There may be a problem keeping the chip cool, there may be
a problem with die area to decode all these instructions, there may be
a problem with having enough cache space to actually get all those
instructions, but those are different issues.

Naturally, though, one still wants to reduce single-thread latency.
After all, if all one's problems were embarrassingly parallel, one
wouldn't need to design a fancy CPU, one could just use lots of simple
and cheap ones. But pipelines are a more tightly-coupled form of
parallelism than separate CPUs with separate caches and paths to
memory.

John Savard

nm...@cam.ac.uk

unread,

Feb 18, 2011, 11:10:51 AM2/18/11

to

In article <ef3ea9ff-69c6-4375...@t16g2000vbi.googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>> A more serious point is that the x86 could be toppled within 5 years,
>> though I doubt that it will be - but it's unlikely to last another 25!
>> This has relatively little to do with architecture, though, and a hell
>> of a lot to do with international politics (including patent abuses
>> and similar practices).
>
>Unlike Mac users, Windows users aren't likely to tolerate not being
>able to run the same software natively when they upgrade their

>computer. ...

I am thinking a LOT more radically than that! If a shift starts,
they will not merely tolerate such a change, but migrate by droves.
I have seen the 'compatibility' argument shown to be entirely
emotional twice before - all that is needed is that the applications
are ported, and they accept most of the users' old data.

The toppling of Microsoft would be a secondary effect.

>I mean, if IBM started manufacturing a line of personal computers that
>came with Linux or OS/2 pre-installed - or, better yet, if they
>*bought Apple* and merged OS X with OS/2 to produce THE standard
>operating system of the future - and it had clear advantages, like:

That's not radical enough, though it would be a start.

>b) ironclad Internet security (Surf the web? You're watching a video
>of your sandbox surfing the web...),

Starting from a Unix design? You know better than that! While
bulletproof security is possible, I don't see the interest.

>then, **maybe**, with the magic of the IBM name behind it, enough
>third-party developers would be attracted to enable a steadily growing
>market share versus Microsoft.

Precisely.

No - the dark horse is China. Japan COULD have rocked the boat, but
has shown itself too timid. Let's say that China produces a design
that costs half as much, uses a quarter of the power and delivers
three times the aggregate performance, with an open specification.
That is very possible, especially if it were done by collaboration
with India and parts of Europe.

That might well take over much of the Asian market, and a great deal
of the European and even North American Linux one, and cause Intel,
Microsoft and even the USA IT business to go into a tailspin. Yes,
there would be a major political upheaval, with all that that implies,
but it's not out of the question.

Regards,
Nick Maclaren.

Quadibloc

unread,

Feb 18, 2011, 12:04:04 PM2/18/11

to

On Feb 18, 4:46 am, n...@cam.ac.uk wrote:

> Multiplication is a clear case of something that can be broken down
> into primitives, usefully and efficiently, but only if the actual
> engineering requirements are paramount. The benchmarketeers demand
> that the detailed operations they need are completely in hardware,
> and the RISC fanatics were not prepared to provide the functionality
> needed. And, yes, the primitives are at a fairly high level.
>
> Floating-point is another.

Break down multiplication into primitives?

When your small processor does it by adding and shifting, and your big
processor has a Williams tree, exactly *which* primitives are you
going to break it down into?

Of course, these days, even the "small" processors are big ones, so
that objection is no longer valid.

On the other hand, it's still a tossup as to whether one is going to
do SRT division or Goldschmidt division. Is everyone going to have to
recompile all their programs, because the new generation of processors
would divide more efficiently the other way?

And it's not as if different operating systems or different languages
might want to do different parts of multiplication or division. (The
Itanium actually did division in software.)

Things like "virtual memory mapping and inter-thread signalling" have
different requirements in different operating systems. There are
essential assists that hardware must provide (an atomic swap with
memory instruction at the very lowest level) but balancing that
against flexibility is a problem.

It's different if you're IBM, and you're writing your own operating
system for the hardware. Intel isn't IBM, and doesn't have this
luxury. Oracle, now that it owns Sun, does, however - and I think that
the reason they bought Sun was to go head to head with IBM. If Unisys
can't afford to make their own chips, I think they're crazy, but
that's another matter.

If Microsoft decides it needs to have a talk with Intel so that it can
turn Windows 9 into a serious mainframe operating system, that would
be nice. What's your estimate on the likelihood of that happening?

John Savard

Jason Riedy

unread,

Feb 18, 2011, 12:20:49 PM2/18/11

to

And Quadibloc writes:
> Break down multiplication into primitives?

You betcha. Has nothing to do with flat-out single-unit FLOP performance
(sharing the same multiplier anyways) and more to do with algorithmic
flexibility, bandwidth, and power (mostly spent on memory transfer).

The FLOP/memory op ratio already is absurdly large and only grows. We
can use that imbalance to use fewer resources when possible, provide
more accuracy when necessary, support different arithmetics (binary,
decimal, other roundings, that all share the same core pieces)...

Breaking multiplication into pieces permits stitching together
double-precision multiplication in only twice the time of single, quad
in four times the time of single using the same unit. Addition's a
possible gotcha, but the error-free transformation route is quite
interesting.

When we can maintain telescoping precisions with somewhat evenly
increasing costs, mixed-precision computations can adapt to the data at
hand. Using mixed precision whenever possible widely is considered
necessary for exaflop-scale systems. For some problems (including the
quite-important Ax=b), you transfer less data and use fewer
computational resources whenever possible while maintaining result
quality and asymptotic complexity. Since Ax=b still drives that part of
the computing space, this will happen and may benefit everyone.

That said, I haven't pushed through what primitives are necessary. I'd
work down from multiplied precisions (doubled-double, quaded-double,
etc.) and Rump's "error-free transformations"[1] while trying to handle
an extended exponent range. Perhaps there's a blocker somewhere that
renders this whole idea pointless, but it's worth a serious look.

Jason

[1] Long-known techniques for multiplied precisions, but his phrasing
and extension to faithfully rounded sums is quite elegant.

MitchAlsup

unread,

Feb 18, 2011, 12:30:51 PM2/18/11

to

On Feb 18, 4:32 am, torb...@diku.dk (Torben gidius Mogensen) wrote:
> So I would, like Hennessy & Patterson, prefer to have primitives that
> can be combined in various ways to take advantage of special cases.

The attribution should go to "Bell and Newell" or "Bell, Grason, and
Newell" (about a decade and a half earlier than H&P.)

"Primitives nor solutions".

Mitch

Andy "Krazy" Glew

unread,

Feb 18, 2011, 12:32:39 PM2/18/11

to

On 2/18/2011 8:10 AM, nm...@cam.ac.uk wrote:
> In article<ef3ea9ff-69c6-4375...@t16g2000vbi.googlegroups.com>,
> Quadibloc<jsa...@ecn.ab.ca> wrote:

> No - the dark horse is China. Japan COULD have rocked the boat, but
> has shown itself too timid. Let's say that China produces a design
> that costs half as much, uses a quarter of the power and delivers
> three times the aggregate performance, with an open specification.
> That is very possible, especially if it were done by collaboration
> with India and parts of Europe.

I like Godson / Longsoon.

MIPS ISA, x86 emulation, vectors, OOO.

I wish I could work on it. I used to speak Chinese, back when I spent a
semester in Beijing.

Paul A. Clayton

unread,

Feb 18, 2011, 12:38:50 PM2/18/11

to

On Feb 18, 8:08 am, Anders.Monto...@kapsi.spam.stop.fi.invalid wrote:
[snip]

> No, the load/store multiple instructions are still there. The biggest
> users of the Thumb-2 ISA are the Cortex-M microcontrollers, so the
> bandwidth advantage is still there as well.

Thank you for the quick correction! (I should have looked it up
before posting.)

MitchAlsup

unread,

Feb 18, 2011, 12:39:54 PM2/18/11

to

On Feb 18, 5:00 am, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> On Feb 17, 5:21 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> > On Feb 17, 3:14 pm, "Paul A. Clayton" <paaronclay...@gmail.com>
> > wrote:
>
> > > Adding the hardware aspects for ease of pipelining (which
> > > discourages complex instructions and complex encodings) and the
> > > memory hierarchy (register count was also largely influenced
> > > by seeking to make register allocation easier for the
> > > compiler), pushes the design philosophy toward RISC.
>
> > I have to disagree with you here, with limits.
>
> > Once a pipeline has been designed, that paipeline is capable of
> > dealing with fairly complex activities in a single "fell swoop". So,
> > for example, if one were to design a pipeline that had a stage to read
> > the data cache and a later stage to write the data cache, it is fairly
> > easy to make that pipeline support read-modify-write instruction
>
> Wouldn't this add 'significant' depth to early pipelines and so made
> branches more expensive?

Memrefs become less expensive, branches become incrementally more
expensive. Since one typically has 2 memrefs in the 5 instruction
leading to a branch and only 1 branch, overall it pays off. {And that
is part of the "with limits" I mentioned.}

> > failed managerially when all (7) design teams "bit off more than they
> > could chew" in the second or third iteration of their architectures.
> > Brooks mentioined this in his famous bok as the "second system
> > syndrome".
>
> I thought the MIPS R4000 (after R2000 and R3000) was decent--were
> you referring to the R8000 (third significantly different
> microarchitecture)?

R3000 was a "slice and dice" of the R2000, it was not a new
microarchitecture.
R4000 was a "change the flip-flops into a super-pipelined register",
it was not a new microarchitecture.

> Was the overreaching Alpha the 21164 (second) or 21264 (third)?

The 21164 was a "sclice and dice" of the 21064, and was not a new
miroarchitecture.

> You have mention some of the aggressiveness of the Motorola
> 88120 (third implementation).

Here, internal politics killed the calf.

HP and Sun did better managerially than the rest (of us).

Mitch

MitchAlsup

unread,

Feb 18, 2011, 12:46:25 PM2/18/11

to

On Feb 18, 11:20 am, Jason Riedy <ja...@lovesgoodfood.com> wrote:
> And Quadibloc writes:
> > Break down multiplication into primitives?
>
> You betcha.

In what machine resources are you going to put the 1037 bits* needed
in the pipeline stage between the top of the multiplier tree and the
bottom of the multiplier tree? {(*)Athlon FP multiplier staging
register size in stage 2}

You can ONLY break instruction down into primitives when the
intermediate data fits (convienently) in an already existing machine
resource (register).

Mitch

nm...@cam.ac.uk

unread,

Feb 18, 2011, 12:20:46 PM2/18/11

to

In article <7decc39a-88c0-49c0...@c10g2000vbv.googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>> Floating-point is another.
>
>Break down multiplication into primitives?
>
>When your small processor does it by adding and shifting, and your big
>processor has a Williams tree, exactly *which* primitives are you
>going to break it down into?

The basic split is on how many bits are needed, and at which end,
with extra steps of fixing up the sign handling and overflow (for
integer) and rounding and normalisation (for floating-point).
But, more importantly, the range of precisions needed is often
quite large.

The performance of many systems is that there is little gain from
(say) a 64x64 multiplication over a fully pipelined 32x32 one
(or double those for modern pancake-sized chips). And it isn't
hard to design primitives for building 64x64 out of 32x32 - the
need for branches is almost entirely because things that are
trivial in hardware are evil in most ISAs.

The reason that I say that is that the chip area saved could be used
to save power, increase parallelism or improve other aspects. It is
rare outside benchmarketing and cryptography for multiplication to
be a bottleneck. And, before you bring in RSA, my approach would
NOT slow it down, because the much more suitable primitives would
help as much as the smaller basic multiply loses.

>And it's not as if different operating systems or different languages
>might want to do different parts of multiplication or division. (The
>Itanium actually did division in software.)

Don't bet on it. They do. It relates to the precision, overflow
and sign handling, and rounding (think Cobol).

More seriously, if there was a sane set of primitives provided,
Newton-Raphson iteration plus a couple of fixup primitives would
deliver the same performance as many systems do in hardware!

I have pretty often written multiplication and division code in
software that could beat the hardware, if it were not for the
extra grobble that was imposed by having to use unsuitable
primitives.

Regards,
Nick Maclaren.

Quadibloc

unread,

Feb 18, 2011, 1:05:08 PM2/18/11

to

On Feb 18, 10:20 am, Jason Riedy <ja...@lovesgoodfood.com> wrote:

> When we can maintain telescoping precisions with somewhat evenly
> increasing costs, mixed-precision computations can adapt to the data at
> hand.

I would have assumed separate multiplier units for each supported
precision - one for double-precision, one for single-precision, in
order to maximize "flat-out FLOPS". Large caches to make the slowness
of off-chip memory as irrelevant as possible.

And perhaps external circuitry to bring some calculations closer to
the memory.

John Savard

nm...@cam.ac.uk

unread,

Feb 18, 2011, 12:27:57 PM2/18/11

to

In article <8a163da5-395e-4aca...@n18g2000vbq.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>
>You can ONLY break instruction down into primitives when the
>intermediate data fits (convienently) in an already existing machine
>resource (register).

Why on earth shouldn't there be hidden 'registers' and appropriate
sequencing requirements? Dammit, many RISCs had lots of the latter!

Yes, it produces a slightly messy ISA, but not one that is hard to
handle. And, on modern systems, you can trivially solve that by
having an aggregate instruction for the most common requirements
that is expanded by the hardware instruction parser.

Neither are new technologies, after all, and you don't HAVE to
make a pig's ear of an ISA design :-)

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Feb 18, 2011, 1:38:15 PM2/18/11

to

On Feb 18, 8:08 am, Anders.Monto...@kapsi.spam.stop.fi.invalid wrote:
[snip]

> No, the load/store multiple instructions are still there. The biggest
> users of the Thumb-2 ISA are the Cortex-M microcontrollers, so the
> bandwidth advantage is still there as well.

Thank you for the quick correction! (I should have looked it up
before posting.)

Paul A. Clayton

unread,

Feb 18, 2011, 1:38:38 PM2/18/11

to

On Feb 18, 8:08 am, Anders.Monto...@kapsi.spam.stop.fi.invalid wrote:
[snip]

> No, the load/store multiple instructions are still there. The biggest
> users of the Thumb-2 ISA are the Cortex-M microcontrollers, so the
> bandwidth advantage is still there as well.

Thank you for the quick correction! (I should have looked it up
before posting.)

Jason Riedy

unread,

Feb 18, 2011, 1:43:55 PM2/18/11

to

And MitchAlsup writes:
> In what machine resources are you going to put the 1037 bits* needed
> in the pipeline stage between the top of the multiplier tree and the
> bottom of the multiplier tree? {(*)Athlon FP multiplier staging
> register size in stage 2}

You're assuming one multiplication design. Consider also shift
and add using known results for adding a fixed quantity of
floating-point numbers with correct rounding.

I'm not saying this is perfect or an appropriate for current
typical desktops, but it's a feasible design that I'd love to
play with from the software side. The exascale folks very much
are looking in this direction, too. Anything to move the least
data necessary.

But talking about which hardware units are to be split is pretty
silly (imho) until someone provides a worked-out software design
as a starting point. It's not as straight-forward as splitting
what's already in the hardware.

Jason

Jason Riedy

unread,

Feb 18, 2011, 1:52:30 PM2/18/11

to

And Quadibloc writes:
> I would have assumed separate multiplier units for each supported

> precision --- [...]

That defeats my purpose of supporting more precision
possibilities than the hard-wired ones with a steady (and
analyzable) degradation of in-cpu performance.

Think I'm coming up with a funding pitch through all of
this... hm.

Jason

Stefan Monnier

unread,

Feb 18, 2011, 2:03:39 PM2/18/11

to

> has shown itself too timid. Let's say that China produces a design
> that costs half as much, uses a quarter of the power and delivers
> three times the aggregate performance, with an open specification.

I don't think anyone on this earth has the faintest idea how we could
deliver three times the performance at a quarter of the power (i.e. an
order of magnitude improvement of performance-per-watt).
For particular applications, maybe, but for general computing I'm skeptical.

Stefan

nm...@cam.ac.uk

unread,

Feb 18, 2011, 1:26:39 PM2/18/11

to

In article <87bp29t...@NaN.sparse.dyndns.org>,

Jason Riedy <ja...@lovesgoodfood.com> wrote:
>
>But talking about which hardware units are to be split is pretty
>silly (imho) until someone provides a worked-out software design
>as a starting point. It's not as straight-forward as splitting
>what's already in the hardware.

In the case of multiplication, it would be an insane starting point.
I agree with your statement.

The key to getting a benefit here is to work out what the software
needs to do, its requirements, constraints and objectives, in
detail - and then split THOSE up, according to how the hardware
people think that they could best deliver them.

For example, one very important requirement that is completely
denied by most 'computer scientists' (including most of the RISC
fanatics) is multiplication as an addressing operation. What
does it need and not need?

It needs to be handled as part of the integer/addressing
pipeline, if floating-point is separate. PA-RISC and others got
that one badly wrong.

It needs to be moderately efficient, but not hugely so, as
it is almost always followed by a memory access, and dominates
only a few applications (nowadays).

It does not need to handle any result larger than the address
size, and can reasonably be optimised for one of the operands
being fairly small.

Ideally, it should detect overflow, but need only do so as
part of the whole calculation, and efficiency of trapping that is
not a concern.

Right - that's ONE requirement (and not the most important).
Another is the support of RSA and its analogues. Another is
floating-point multiplication, with its different precisions
and rounding. And the last is ordinary integer multiplication,
with its sign and overflow issues.

The way that they are split up at present is dictated by inertia,
and not much else. Even when I started (40+ years ago), the
current split was dominant, but it hadn't yet reached the state
of being automatically assumed to be the ONLY way to do the job.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Feb 18, 2011, 1:34:38 PM2/18/11

to

In article <jwv8vxdtbfx.fsf-...@gnu.org>,

Read a bit more carefully - I didn't say serial performance; I said
aggregate performance. Almost everybody knows how to do a lot
better than that - I was being conservative. If I recall correctly,
Sicortex delivered more than that relative to the mainstream clusters
of the time!

What they don't know is how to make use of it in currently serial
workloads. That's a separate matter.

Regards,
Nick Maclaren.

Robert Myers

unread,

Feb 18, 2011, 3:16:24 PM2/18/11

to

I'll respond to you because you seem the least likely to come back
with a smug answer.

Why do people even talk about fiddling with floating point hardware,
when there is no way to feed perfectly feasible floating point
hardware that already exists (for example, GPU type hardware in
situations where repeatedly hammering on data already in cache isn't
feasible)?

For the problems I know about, the need for precision is not
uniformaly spread acrosss momentum space, so one can contemplate
modest (factor of 4-8) gains in effective bandwidth by localizing
precision in k-space. Problem is, the need for precision is spread
all over x-space. You *may* be able to beat the problem of varying
needs for precision by breaking the problem up into artificially
separated components of field variables (like the standard u+u' trick
of turbulence calculations), but it's not as if you could get orders
of magnitude out of such trickery.

As to bringing the processor closer to memory, I don't get it at all.
Sure, lots of things to do with "Ax=b." (nice way to avoid further
propagandizing linpack. thanks for the suggestion). For my problem
though, some data is going to be in one memory bank and the data it
needs to be FMAC'd with may be on the other side of the cluster.
What's the point of PIM tricks? If you have to move the data,
anyway, you might just as well drag it into cache. I have yet to hear
anyone propose anything that reduces data movement except in those (to
me, artificial) cases where the data are *already* localized.

Lots of dumb questions. Pessimistic, as usual.

Robert.

nm...@cam.ac.uk

unread,

Feb 18, 2011, 2:45:11 PM2/18/11

to

In article <a30fdc16-8946-4265...@8g2000prt.googlegroups.com>,
Robert Myers <rbmye...@gmail.com> wrote:

>On Feb 18, 1:43=A0pm, Jason Riedy <ja...@lovesgoodfood.com> wrote:
>
>Why do people even talk about fiddling with floating point hardware,
>when there is no way to feed perfectly feasible floating point
>hardware that already exists (for example, GPU type hardware in
>situations where repeatedly hammering on data already in cache isn't
>feasible)?

I can't answer for Jason Riedy, but can for myself. Because:

1) One can reduce the real estate and power requirements
significantly.

2) One can increase the extensibility considerably, either
to more precision or to modes not currently supported.

And the last includes reliable exception detection, which is NOT
what IEEE 754 specifies, but which is an almost trivial tweak.

Extra performance? As you say, how would you use it?

Regards,
Nick Maclaren.

Stefan Monnier

unread,

Feb 18, 2011, 5:02:58 PM2/18/11

to

>> For particular applications, maybe, but for general computing I'm skeptical.

[...]

> What they don't know is how to make use of it in currently serial
> workloads. That's a separate matter.

I see we agree.

Stefan

Jason Riedy

unread,

Feb 18, 2011, 5:15:03 PM2/18/11

to

And Robert Myers writes:
> I'll respond to you because you seem the least likely to come back
> with a smug answer.

I'm a young'un. ;)

> Why do people even talk about fiddling with floating point hardware,
> when there is no way to feed perfectly feasible floating point
> hardware that already exists (for example, GPU type hardware in
> situations where repeatedly hammering on data already in cache isn't
> feasible)?

What both Nick and I are proposing may well slow down the
computation of a single result.

A potential benefit (as I and a few exascale folks see it) is in
using lower precision whenever feasible, reducing the amount of
data transfered and that power usage. David Keyes has some good
slides on the memory problems at exascale, but I don't know a
good link. (Video at http://hdl.handle.net/1853/36948, but it's
an hour long.)

Another potential benefit in blowing a very little extra time to
provide numerically *better* results without adding to CPU
complexity. With Ax=b, you can achieve componentwise accuracy of
about 1 ulp with a little doubled precision unless your problem is
quite hairy, and then you can detect failure. (With a chance of
failure of the whole shebang. It's not a verified / interval
method.)

And a factor of 2-4 less power used in memory would be welcome by
the folks building massive systems. ;) That's an area where each
small percentage improvement translates into serious cost
savings, just like a 5% improvement can a day off of a computing
job. Whether these ideas are useful on a small machine, I don't
know.

> As to bringing the processor closer to memory, I don't get it at all.
> Sure, lots of things to do with "Ax=b." (nice way to avoid further
> propagandizing linpack. thanks for the suggestion).

It's more than that. I also mean sparse systems, iterative
methods, etc. LINPACK kinda sucks the air from those, too.

> For my problem though, some data is going to be in one memory
> bank and the data it needs to be FMAC'd with may be on the
> other side of the cluster. What's the point of PIM tricks? If
> you have to move the data, anyway, you might just as well drag
> it into cache.

This isn't at all a dumb observation. If you're going to *re-use*
the data, you should pull it into some local memory. The Cell
made programmers' lives hell by insisting on manual control of
what's pulled but use relatively little power. Caches blow power
by guessing but are easy to use.

For my problems, unstructured graph analysis that isn't easily
expressible in terms of a sparse matrix-sparse vector product
(BFS), caches almost always guess wrong, I guess worse, and I
don't *re-use* much of the large data.

This isn't related to the arithmetic tricks above; this is a
separate topic.

The "PIM trick" isn't being close to one memory but rather having
many memory pipes available, more than you can fit out of a
single chip. If the P part of that is built to tolerate latency,
you can use some fraction of the aggregate bandwidth of all the
pipes. It's a cluster on a board / in a rack, more or less, with
the assumption (imho) that all memory is globally addressable and
transfered in relatively small chunks.

(Limited by to various bottlenecks, like network bisection
bandwidth, contention... Those shouldn't be ignored in real life,
but smaller scale (<1024 node) systems can reduce the impact of
some potential bottlenecks.)

> I have yet to hear anyone propose anything that reduces data
> movement except in those (to me, artificial) cases where the
> data are *already* localized.

For many graph problems, if you know a good way to reduce the
movement, you pretty much have the result you're trying to
compute. I'd love to phrase that formally and prove it...

Jason

MitchAlsup

unread,

Feb 18, 2011, 5:44:22 PM2/18/11

to

On Feb 18, 12:26 pm, n...@cam.ac.uk wrote:
> For example, one very important requirement that is completely
> denied by most 'computer scientists' (including most of the RISC
> fanatics) is multiplication as an addressing operation. What
> does it need and not need?

Note, integer multiplication on the 88K was:: a) 3 cycles of latency,
b) 32*32->32-bits, c) no overflow,... checking

You could check for overflow (signed or unsigned) with a 11-
instruction 12 cycle latency string of instructions--which was on par
for the MIPS integer multiplication hardware.

> It needs to be handled as part of the integer/addressing
> pipeline, if floating-point is separate. PA-RISC and others got
> that one badly wrong.

It was not part of the main pipeline, but a separate pipeline (more
like a CDC 6600 computation unit).

> It needs to be moderately efficient, but not hugely so, as
> it is almost always followed by a memory access, and dominates
> only a few applications (nowadays).

3 cycles fast enough for you? Could be done in 2 cycles today--so AGen
would take 3 total cycles when multiplication was performed.

> It does not need to handle any result larger than the address
> size, and can reasonably be optimised for one of the operands
> being fairly small.

Its the sub-1% multiprecision stuff that causes multiplication and
divisiion to be so "out of kilter" with respect to data path design.

> Ideally, it should detect overflow, but need only do so as
> part of the whole calculation, and efficiency of trapping that is
> not a concern.

Why not just detect this with page faults? or project the
multiplication into a 33-bit address space so page faults WILL happen!
(since only 32-bits would/could be translated.)

> And the last is ordinary integer multiplication,
> with its sign and overflow issues.

What languages require integer multiplication to have sing and
overflow detection?

Mitch

Quadibloc

unread,

Feb 18, 2011, 8:50:54 PM2/18/11

to

On Feb 18, 11:26 am, n...@cam.ac.uk wrote:

> For example, one very important requirement that is completely
> denied by most 'computer scientists' (including most of the RISC
> fanatics) is multiplication as an addressing operation. What
> does it need and not need?

Generally speaking, datatypes on computers today tend to have power-of-
two lengths.

Thus, multiplication as such is used in addressing only in cases where
you have, say, things like...

DIMENSION A(31,73)
CHARACTER*11 B(15)

Unless one is designing a machine to handle a very high-level
instruction set, one doesn't have a way to just dump the subscripts
into index registers in a case like this. Even if, like the 68000, you
_can_ say "shift this index over 3 bits when you use it, because it
points to one of many double-precision floats".

Thus, in one sense, "multiplication as an addressing operation"
doesn't quite make sense to me.

However, you probably didn't mean it in that sense. In the sense that
you _did_ mean it, which allows for explicit multiplication
instructions in the code, yes, I agree that a fast fixed-point
multiply is necessary.

John Savard

Quadibloc

unread,

Feb 18, 2011, 8:53:02 PM2/18/11

to

On Feb 18, 11:52 am, Jason Riedy <ja...@lovesgoodfood.com> wrote:
> And Quadibloc writes:
> > I would have assumed separate multiplier units for each supported
> > precision --- [...]
>
> That defeats my purpose of supporting more precision
> possibilities than the hard-wired ones with a steady (and
> analyzable) degradation of in-cpu performance.

That may be. I have a different purpose - supporting the hard-wired
ones with the maximum performance attainable in silicon. (Or Indium
Phosphate if I can get it.) I'm assuming physics is more of a priority
than number theory.

John Savard

Andy "Krazy" Glew

unread,

Feb 18, 2011, 9:21:12 PM2/18/11

to

On 2/18/2011 9:20 AM, nm...@cam.ac.uk wrote:
> In article<7decc39a-88c0-49c0...@c10g2000vbv.googlegroups.com>,
> Quadibloc<jsa...@ecn.ab.ca> wrote:

> The performance of many systems is that there is little gain from
> (say) a 64x64 multiplication over a fully pipelined 32x32 one
> (or double those for modern pancake-sized chips). And it isn't
> hard to design primitives for building 64x64 out of 32x32 - the
> need for branches is almost entirely because things that are
> trivial in hardware are evil in most ISAs.

Huh?

Wrong.

Andy "Krazy" Glew

unread,

Feb 18, 2011, 9:22:41 PM2/18/11

to

The sajme pipestage that Microunity and I want to use for the thousand
or so bits of a superaccumulator.

Hard to do OOO, though.

Andy "Krazy" Glew

unread,

Feb 18, 2011, 9:27:37 PM2/18/11

to

Because really funky large irregular hidden registers are hard to handle
in an OOO system.

Easy enough in-order. Hard OOO.

Not impossible. However, I am not aware of any modern OOO machine that
does a really good job of handling irregular register sizes. 64 bit
int, 128 or 256 bit SIMD packed vector. A tip of the hat to segment
register renaming.

Heck, most Intel OOO processors to date have placed integer and SIMD in
the same PRF, "wasting" 50% or 75% of the PRF space. Ditto store
buffers. Sandybridge?

AMD seems to have had separate PRFs for int and FP.

---

One of the reasons I got into RISC was that having a nice uniform
physical register file, the kind Mitch likes in the 88K, makes it just
plain easier to do OOO.

Andy "Krazy" Glew

unread,

Feb 18, 2011, 9:30:29 PM2/18/11

to

On 2/18/2011 5:50 PM, Quadibloc wrote:
> On Feb 18, 11:26 am, n...@cam.ac.uk wrote:
>
>> For example, one very important requirement that is completely
>> denied by most 'computer scientists' (including most of the RISC
>> fanatics) is multiplication as an addressing operation. What
>> does it need and not need?
>
> Generally speaking, datatypes on computers today tend to have power-of-
> two lengths.

Except... MIPS, ARM, most other "embedded" processors and DSPs, have
"accumulators" that are, e.g. 24, 40, 48, 72, 80, etc. bits in width.

x86 has resisted this trend.

This trend was most emphatic a few years ago. Now, aspeople switch to
using FP, it seems to be dead-ending.

Andy "Krazy" Glew

unread,

Feb 18, 2011, 10:15:16 PM2/18/11

to

Actually, David Shaw et al have shown that you should NOT just drag it
into cache. They have shown that the best place to do such calculations
is roughly the centroid of the data locations and code locations, with
some weight functions on the arcs.

They showed this best for their molecular dynamics machine - which I
think has as big a global communications problem as you do, Robert.

I think the result will apply to all computations.

Jason Riedy

unread,

Feb 18, 2011, 10:19:00 PM2/18/11

to

And Andy Glew writes:
> Because really funky large irregular hidden registers are hard
> to handle in an OOO system.

Exposing my utter (software perspective) ignorance: Don't OOO
systems rely on many hidden (renaming) registers? What is the
difference between register classes?

(Which isn't to detract from the pure software exercise of
considering the primitives from which linquistic arithmetic
operations may be constructed. Dammit. I want my quad precision
FP at no more than 4x worse than double precision performance,
being generous.)

Jason

Quadibloc

unread,

Feb 18, 2011, 10:28:20 PM2/18/11

to

On Feb 18, 10:20 am, n...@cam.ac.uk wrote:

> Don't bet on it. They do. It relates to the precision, overflow
> and sign handling, and rounding (think Cobol).

It is true that the decimal multiplication unit is likely to behave
differently from the binary multiplication unit. Since decimal
multiplication normally takes place in programs that are input-output
intensive, and not arithmetic intensive, in the same computer where a
full-blown Williams tree is used for binary multiplication, decimal
arithmetic might be handled by a two-digit wide ALU.

Of course, since transistors are cheaper now than they were back in
the System/360 days, it's entirely possible that a somewhat higher-
performance decimal arithmetic unit will be what we will see in
microprocessors... when they start getting decimal arithmetic units.
(That is, mainstream ones. IBM makes single-chip or even multicore
processors with decimal arithmetic already, but they go into its own
computers.)

Paying a severe performance penalty for flexibility in the arithmetic
unit you're not doing computational fluid dynamics with isn't
something I'd particularly object to. But as decimal arithmetic units
start to grow bigger, and higher performance is expected of them,
retaining flexibility in respect of precision is going to become
increasingly awkward for them too.

I'm not going to claim I know what the right decision will be in that
area.

John Savard

Jason Riedy

unread,

Feb 18, 2011, 10:55:11 PM2/18/11

to

And Quadibloc writes:
> It is true that the decimal multiplication unit is likely to behave
> differently from the binary multiplication unit.

No. No, it isn't.

But find the funding folks willing to risk this....

Jason

Andy "Krazy" Glew

unread,

Feb 18, 2011, 11:41:46 PM2/18/11

to

On 2/18/2011 9:04 AM, Quadibloc wrote:
> On Feb 18, 4:46 am, n...@cam.ac.uk wrote:
>
>> Multiplication is a clear case of something that can be broken down
>> into primitives, usefully and efficiently, but only if the actual
>> engineering requirements are paramount. The benchmarketeers demand
>> that the detailed operations they need are completely in hardware,
>> and the RISC fanatics were not prepared to provide the functionality
>> needed. And, yes, the primitives are at a fairly high level.
>>
>> Floating-point is another.

>
> Break down multiplication into primitives?

Such micro-optimizations are one of my favorite topics, although I look
on them suspiciously. Neverthelless, I was moved to write a wiki page,
and i hope that you folks will add more examplees for the wiki. (I wish
I had security sufficient to allow public access, but having been
spammed twice, I am reluctant.)

http://semipublic.comp-arch.net/wiki/Micro-optimization_primitive_instructions

By [[micro-optimization primitive instructions]] I mean instructions
that expose the internal steps and partial results of what are otherwise
considered to be primitive instructions.

One of the best academic and published examples is in the Dally paper
referenced at the bottom:
[[Micro-optimization of floating-point operations]].
However, other examples have been found in actual instruction sets,
particularly before transistor counts increased to the point where
fully pipelined floating point [[FADD]] and [[FMUL]] instructions were
common.

= Examples of Floating-Point Micro-Optimization Instructions (from Dally
paper) =

Examples from Dally paper: (transcribed not quite exactly)

;Automatic Block Exponent
:Identifying the largest exponent is cascaded additions, aligning all
mantissae and adding without normalization.
* Saves: power of intermediate normalizations and roundings, unnecesary
shifts
* Cost: Precision, unless extra mantissae width.See also [[block
floating-point]], [[superaccumulator]].

;Shift Combining
:An alternative to automatic block exponent used in casxes where order
of operations must be preserved: avoids repeatedly shifting left for
normalization, and then shifting right for alignment.
;I.e. combines normalization and alignment shifts.
* Saves: power in redundant shifts.
* Cost: precision, sometimes

;Post Multiply Normalization
:Maintain extra guad bits to the right of mantissa, to avoid
normalizzation shift, since multiplication of normalized numbers can
denormalize by at most one bit position. (Denorm*norm can produce
greater shifts.)

;Conventional Optimizations
:E.g.
(A + B) * (A - B)
can [[CSE (common sub-expression)]] the alignment of A and B.
* Saves: power
* Cost: the exposed micro-optimization, possibly requiring a few extra
bits of mantissa and exponent.

;Scheduling
:Exposing primitive operations permits more effective optimization by
compiler (or, for that matter, by a dynamic or [[OOO]] scheduler).

Dally suggests a fairly horizontal instruction format for these
microoptimizations, consisting of
* Exponent Ops
** E+, E1 - exponent add/subtract (EC <- EA op EB)
** FF1 - returns shift required to normalize mantisssa MA (EC <- FF1 MA)
* LDE, STE - load or store exponent as integer
* Mantissa Ops
** M+, M-, M* - mantisssa add, subtract, multiply (MC <- MA op MB)
** SHR, SHL - mantissa shift left or right; supports [[shifting by a
negative count]] in either dirrection
** ABS, NEG - zeroes and complements the mantissa sign bit
** LDM, STM - loads and stores mantissa as integer
** LDF, STF - load and store mantissa and exponent together as standard
floating point number
* Branch Ops
** BNEG - [[branch on exponent negative]]
** Bcond - [[branch on exponent and mantissa compare]]] (EA,MA) relop
(EB,MB)

= Other Possible Micro-optimizations =

== Booth Encoding ==

Many processors use Booth encoding and multiplierarrays,
for both integer and floating point.

Booth encoding involves

(a) preparing multiples of one operand. Unfortunatyely, the actual
multiples depeend on how advanced the multiplier is - {-1,0,+1},
{-2,-1,0,+1,+2}, 3X, 4X, etc.

(b) using the second operand selecting which multiples of the first
operand are to be fed into the array for each digit group.

We can certainly imagine [[CSE]]ing the work in preparing such
multiples, for eother operand.
Unfortunately, the multiples needed change with the exact multiplier
used - one microarchitecture might require produced -1X and 3X multiples
(1X and 2X are "free", shifting - while another might require 5X, etc.
Therefore, the number of bits needed might change from chip to chip.

We can imagine storing these large intermediate results in a SIMD packed
vector, via an instruction that looks like:
vreg256b <- booth_encode(64b)

Heck - that might be a useful operation to provide even if not doing
microoptimization. It has uses.

The details of the Booth encoding used might be hidden.

However, perhaps an easier approach might be to do this
microoptimization in microarchitecture:
e.g. have a [[dynamic scheduler]] recognize several multiplications that
share a common first operand,
and then decide to skip the [[Booth encoding]] pipestage.
The skip might shorten the pipeline - or it might just be used to save
power.

This amounts to [[memoizing]] the Booth encoding.

== Virtual Memory Translations ==

There are many, many, repeated translations of nearby virtual addresses
that result in the same physical page address.

; CacheTranslation with Base Registers
E.g. Austin and Sohi suggested caching the translations next to the base
register, and thereby avoiding TLB lookup, and supporting more
translatuons on a low ported TLB.
<UL>
Todd M. Austin and Gurindar S. Sohi. 1996. High-bandwidth address
translation for multiple-issue processors. In Proceedings of the 23rd
annual international symposium on Computer architecture (ISCA '96). ACM,
New York, NY, USA, 158-167. DOI=10.1145/232973.232990
http://doi.acm.org/10.1145/232973.232990
</UL>
Equivalently, power can be saved.

; Instructions to Save Translations
We can imagine that this might be exposed to software as an instruction
that looks like
treg := [[translate_virtual_address]]( virtual address )
dreg := load_using_physical_adddress( treg + offset )
store_using_physical_address( treg + offset)

The first instruction saves the translation, and any permissions bits
required, in treg.
treg might be an ordinary address sized integer register, or a special
register to hold such a translation.

Note: this instruction, [[translate_virtual_address]], is often
requested even outside the context of micro-optimization.
See [[uses of an instruction to save virtual to physical address
translation]].

The load and store operations use a saved translation. I would imagine
that they would trap if the memory location specified by offset and
sizze fell outside the saved translation.

NOTE: in a [[virtual machine]] environment, a Guest OS executing this
instruction would get at most a partial benefit.

;User Mode Instructions to Save Translations
Instructions such as [[load_using_physical_address]] can only be used by
privileged code - if software can construct arbitrary physical registers.

But if the saved translation lives in a special treg, translation
register, that can only be written by certain instructions,
then unprivileged code could employ this instruction.
Equivalently, this could be a [[privilege tagged capability register]] -
possibly an ordinary register, with an extra bit that
gets set by the translate instruction.
The physical memory accesses would require that bit to be set.
Other operations on such a register might clear the tag bit.

This is how [[capability registers]] work for other capabilities;
physical addresses are just an extra capability.

Except... physical addresses can change, e.g. during [[virtual memory]]
operations such as [[page fault]]s and [[COW]].

(1) we could expose the [[translation registers]] to the OS, which can
treat them as an extended TLB, changing them as necessary.
: This is not unlike what old OSes needed to do when [[multitasking
using base registers]].

(2) However, we might then need to re-translate. This could be
accomplished by storing the original virtual address inside the treg
[[[translation register]].

I.e. we have circled back to Austin and Sohi - except that instead of
the physical address being a cache, it is more exposed.

Not completely exposed - it is a [[covert channel security hole]] to
expose physical addresses.
But it is an explicitly acknowledged, albeit opaque, data quantity.

== Division Step, etc. ==

Any "step" operation, such as divide-step, multiply-step, sqrt-step,
can be considered as a micro-optimization primitive.

With the concomitant problem, that the algorithm, may vary between
versions of a processor.

== TBD: other micro-optimizations ==

TBD

= References =
W. J. Dally. 1989. [[Micro-optimization of floating-point operations]].
SIGARCH Comput. Archit. News 17, 2 (April 1989), 283-289.
DOI=10.1145/68182.68208 http://doi.acm.org/10.1145/68182.68208

W. J. Dally. 1989. Micro-optimization of floating-point operations. In
Proceedings of the third international conference on Architectural
support for programming languages and operating systems (ASPLOS-III).
ACM, New York, NY, USA, 283-289. DOI=10.1145/70082.68208
http://doi.acm.org/10.1145/70082.68208

Andy "Krazy" Glew

unread,

Feb 19, 2011, 2:03:00 AM2/19/11

to

On 2/18/2011 9:20 AM, Jason Riedy wrote:

> And Quadibloc writes:
>> Break down multiplication into primitives?
>

> You betcha. Has nothing to do with flat-out single-unit FLOP performance
> (sharing the same multiplier anyways) and more to do with algorithmic
> flexibility, bandwidth, and power (mostly spent on memory transfer).
>
> The FLOP/memory op ratio already is absurdly large and only grows. We
> can use that imbalance to use fewer resources when possible, provide
> more accuracy when necessary, support different arithmetics (binary,
> decimal, other roundings, that all share the same core pieces)...
>
> Breaking multiplication into pieces permits stitching together
> double-precision multiplication in only twice the time of single, quad
> in four times the time of single using the same unit. Addition's a
> possible gotcha, but the error-free transformation route is quite
> interesting.

>
> When we can maintain telescoping precisions with somewhat evenly
> increasing costs, mixed-precision computations can adapt to the data at

> hand. Using mixed precision whenever possible widely is considered
> necessary for exaflop-scale systems. For some problems (including the
> quite-important Ax=b), you transfer less data and use fewer
> computational resources whenever possible while maintaining result
> quality and asymptotic complexity. Since Ax=b still drives that part of
> the computing space, this will happen and may benefit everyone.
>
> That said, I haven't pushed through what primitives are necessary. I'd
> work down from multiplied precisions (doubled-double, quaded-double,
> etc.) and Rump's "error-free transformations"[1] while trying to handle
> an extended exponent range. Perhaps there's a blocker somewhere that
> renders this whole idea pointless, but it's worth a serious look.
>
> Jason
>
> [1] Long-known techniques for multiplied precisions, but his phrasing
> and extension to faithfully rounded sums is quite elegant.

Using the smallest numbers you can get away with is a good thing. E.g.
FP16 rather than FP32; Intel GenX GPU even has an FP8 data type.

Smaller precisions reduce memory bandwidth.

LRB only proposed to load and store FP16, converting into FP32
internally. Methinks you should probably be highly interested in
operating on FP16 internally. Or, maybe not FP16
- but perhaps you want accumulators.

FP32 += FP16

FP32 += FP16*FP16.

In integer, it is easier to see what you need for such mixed precisions

I_n+x += In, for some numberx of bits that gives sufficient protection
against overflow.

I2n := In * In

I_2n+x := In*In + In

1_2n+x += In*In

Similarly for floating point.

I think wer should have such instructions/operations.

At least in vectors. Possibly in scalars, if we can turn off unused bits.

Mike Haertel suggested it was simply to always just say I4n += In*In.
Yes, it is simply - but it seems overkill to say I128 += I32*32.
I80 += I32*I32 is much more reasonable.

---

Although I have long been intrigued by micro-optimizations such as
exposing the parts of a multiplier, I am not sure that is the way to go.
Eliminating the redundant shifts involbving in aligning and
normalizing seems sensible - but this is why I am interested in
superaccumulators.

---

If you go far alomg this path, you want to have reconfigurable logic, so
you can use precisely the size of data your algorithm requires:
3-birs here, 17 there, etc.

Andy "Krazy" Glew

unread,

Feb 19, 2011, 2:05:13 AM2/19/11

to

I suppose I am interested in creating hardwired functional units that
support more precision possibilities, and the least precision possible.

Certainly conversions.

Probably more flavors of precision, including mixed precision
operrations, that can fit in the opcode of a 32 bit instruction.

nm...@cam.ac.uk

unread,

Feb 19, 2011, 3:24:57 AM2/19/11

to

In article <5a330e16-5fd8-438b...@n1g2000yqm.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>
>> For example, one very important requirement that is completely
>> denied by most 'computer scientists' (including most of the RISC
>> fanatics) is multiplication as an addressing operation. What
>> does it need and not need?
>
>Note, integer multiplication on the 88K was:: a) 3 cycles of latency,
>b) 32*32->32-bits, c) no overflow,... checking
>
>You could check for overflow (signed or unsigned) with a 11-
>instruction 12 cycle latency string of instructions--which was on par
>for the MIPS integer multiplication hardware.
>
>> It needs to be handled as part of the integer/addressing
>> pipeline, if floating-point is separate. PA-RISC and others got
>> that one badly wrong.
>
>It was not part of the main pipeline, but a separate pipeline (more
>like a CDC 6600 computation unit).

And that's a killer for a few applications. The dogma is that
it isn't needed because compilers can do strength reduction, but
there are some algorithms and applications where that doesn't
work, and which use a LOT of multi-dimensional array indexing.
One user came to me with such a problem and I said "Sorry, but
this is PA-RISC" - in more detail, of course.

>> It needs to be moderately efficient, but not hugely so, as
>> it is almost always followed by a memory access, and dominates
>> only a few applications (nowadays).
>
>3 cycles fast enough for you? Could be done in 2 cycles today--so AGen
>would take 3 total cycles when multiplication was performed.

Yup. No problem. It's the 10+ or the pipeline drains that are the
problems.

>> It does not need to handle any result larger than the address
>> size, and can reasonably be optimised for one of the operands
>> being fairly small.
>
>Its the sub-1% multiprecision stuff that causes multiplication and
>divisiion to be so "out of kilter" with respect to data path design.

I am not surprised.

>> Ideally, it should detect overflow, but need only do so as
>> part of the whole calculation, and efficiency of trapping that is
>> not a concern.
>
>Why not just detect this with page faults? or project the
>multiplication into a 33-bit address space so page faults WILL happen!
>(since only 32-bits would/could be translated.)

Because that doesn't work! Overflow often causes an index in one
dimension to overflow another or, worse, to trash a neighbouring
or even remote object. And experience from the 1970s is that some
50% of overwriting bugs in array-based languages were originally
due to untrapped integer overflow.

Of course, in languages like C (and even C++), the pointer errors
dominate, but those are rare in array-based languages.

>> And the last is ordinary integer multiplication,
>> with its sign and overflow issues.
>
>What languages require integer multiplication to have sing and
>overflow detection?

All require sign detection, but it can be ignored for multiplication
(not division) if using twos' complement.

Overflow, Cobol and any language that takes RAS seriously. Damn
few nowadays, I agree, but one would hope that architectures would
not actively handicap the provision of known important RAS features.
One would be disappointed :-(

Note that there is NO reason that it should slow things down. Many
years of experience has shown that trap-diagnose-and-terminate is
all that is needed for 99% of applications.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Feb 19, 2011, 3:34:40 AM2/19/11

to

In article <2d8a6429-4f42-41b3...@a28g2000prb.googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>> For example, one very important requirement that is completely
>> denied by most 'computer scientists' (including most of the RISC

>> fanatics) is multiplication as an addressing operation. =A0What

>> does it need and not need?
>
>Generally speaking, datatypes on computers today tend to have power-of-
>two lengths.

No, not at all. There are all of the following to consider:

1) As soon as you have derived types (Fortran), structures (C)
or classes (C++), you have other sizes - and most modern codes do.

2) In languages that have decent array support (mainly Fortran,
but Bjarne's add-on for C++ is acceptable), a compiler can't assume
that and so must either handle the general case or generate many
different variants. And even three 3-D arguments gives 512
possibilities!

3) Powers of two array sizes for multi-dimensional arrays are
doubleplus ungood as far as caching and TLBs are concerned. Quite
a lot of programs deliberately pad them to other sizes to improve
performance.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Feb 19, 2011, 3:52:45 AM2/19/11

to

In article <_4adnbRGm4-HtMLQ...@giganews.com>,

Andy \"Krazy\" Glew <an...@SPAM.comp-arch.net> wrote:
>
>> The performance of many systems is that there is little gain from
>> (say) a 64x64 multiplication over a fully pipelined 32x32 one
>> (or double those for modern pancake-sized chips). And it isn't
>> hard to design primitives for building 64x64 out of 32x32 - the
>> need for branches is almost entirely because things that are
>> trivial in hardware are evil in most ISAs.
>
>Huh?
>
>Wrong.

Which assertion are you claiming that is wrong?

I have done such analyses and have even beaten the built-in
multiplication on several systems (Terje has, too) - and note that
I said "many" not "all" or even "most". Also, I said to compare
64x64 versus 128x128 on modern pancake-sized chips.

As far as primitives are concerned, consider the following:

32x32 => top half
32x32 => bottom half
Accumulate 32 into 64/128 at arbitrary location mod 32

Bottom half 64x64 then becomes 8 instructions, with very weak
ordering constraints (the accumulations associate and commute).
128x128 is 32 instructions.

Yes, that's not a complete design, especially not for floating-
point, but is enough to justify my statement.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Feb 19, 2011, 9:44:45 AM2/19/11

to

It is indeed trivial to construct wider multiplies if you have both
N*N->2N and ADD/ADC.

Things become harder without both of these.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Feb 19, 2011, 9:56:44 AM2/19/11

to

Quadibloc wrote:
> On Feb 18, 10:20 am, n...@cam.ac.uk wrote:
>
>> Don't bet on it. They do. It relates to the precision, overflow
>> and sign handling, and rounding (think Cobol).
>
> It is true that the decimal multiplication unit is likely to behave
> differently from the binary multiplication unit. Since decimal
> multiplication normally takes place in programs that are input-output
> intensive, and not arithmetic intensive, in the same computer where a
> full-blown Williams tree is used for binary multiplication, decimal
> arithmetic might be handled by a two-digit wide ALU.

Pure (integer/fixed-point) decimal arithmetic (possibly not true for FP
decimal?) should _always_ be handled in 9 (32-bit) or 19 (64-bit) digit
chunks:

The time to convert to/from such chunks to binary is so low that simply
using the normal binary ALUs is cheaper/faster than anything else except
full custom HW.

AMD used to quote in their optimization manuals (unfortunately without
attribution :-() my 32-bit binary-to-decimal conversion code which needs
30-50 cycles to convert a full 32-bit unsigned value to 10 decimal digits.

Converting in the other direction is of course both easier and faster.

nm...@cam.ac.uk

unread,

Feb 19, 2011, 10:10:59 AM2/19/11

to

In article <qj1538-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>It is indeed trivial to construct wider multiplies if you have both
>N*N->2N and ADD/ADC.
>
>Things become harder without both of these.

Yes. I would prefer separate bottom and top half results, possibly
with tweaks, for reasons that you will know. That's a little more
complicated, but not much (after all, the latter is what floating-
point multiplication does!)

In my experience, most of the speed advantage of the built-in large
multiplication comes from not having to insert the extra grobble
to test for overflow (i.e. carry). But that would be trivial for
the hardware to provide in the ISA, and that is precisely what I
proposing.

But, for some reason, far too many architectures make it unnecessarily
hard to do wider multiplication - and even more, iterative division,
in software. I think that it's insane ....

Andy "Krazy" Glew

unread,

Feb 19, 2011, 12:13:01 PM2/19/11

to

OOO relies on renaming registers, or, rather, registers to be renamed to
- which I'll call a PRF. At least one pool of renaming registers.

You can have multiple pools, multiple PRFs. But each such pool has
overhead. And, moreover, each such pool is a potential bottleneck.

So, for example, the original P6 had a PRF pf whatever the architecural
+ microcode register file was (the RRF), plus 40 rename registers (the
ROB). Both integer and FP used the same ROB. ROB entries were
data-ful, 80+ bits for x87 FP, and therefore wasted 80-32 bits for
integers (pre 64 bits), and up to 80-8 bits for byte operations.

By the way: back on P6 we knew about the pros and cons of both. Our
original design had separate int/FP. But that consumed more area, so we
backed off to a unified int/FP design.

More modern machines have separate integer and SIMD packed vector PRFs.
(And they have data-less reorder buffers - which is a separate issue.)
But then you have to maintain a separate freelist for each. More
overhead.

The real cost is the bottlenecls. How many do you want of each PRF
type? 64 int + 128 FOP? Maybe 32 int + 140 FP? (BTW, I keep saying
FP when I should be saying "SIMD packed vector" or "graphics" or "wide".)

Whatever you decide, you will penalize some workloads. Whereas, if you
can figure out a way to squish the two together, e.g. by saying that 64
int + 128 FP is approximately the same as 160 int/FP regs in the same
PRF (assuming 64 bit ints and 128 bit FP packed SIMD), you get a design
that can adapt to workloads at either extreme.

This is not so bad for ints and SIMD packed vectors - although different
in sizze, they are not that different. 32/80. 64/128. 64/256 anyone?

But say that you want a really big or unusual register type. Say a 1024
bit temp register inside a multiplier. Or one of my superaccumulators.

You almost definitely do not want to make all 64 bit integers occcupy a
1024 bit wide register, wasting so much space. So, to rename such a
wide register you might want to give it its own PRF pool. But, how many
renaming copies are you going to give it? 2 total (1 extra)? 4? ....

It actually isn't really a size problem - it is a problem for any PRFs
that want to be physically separate and isolated from each other. Last
night I mentioned micro-optimizations using Booth encoding - which
usually fit within a 128 or 256 bit register. But if you want to avoid
the wires to send that all the way to the main PRF, if you want to have
a few saved Booth values inside the multiplier for the compiler to use,
then it is a separate pool. How many do you have? 2? 4? 8? Pretty
soon we are talking cost. And the wires to send it to a homegenous PRF
cost, to.

---

Segment register renaming on x86 has occasionally been handled this way.
The first Intel implementations had 1 renaming copy per segment
register that could be renamed - SS and SS', DS and DS', etc. This
allowed you to write to the register once and then use it for a while,
but if you did something like

// use old value0 of DS
DS := value1
mov eax := DS:some address
DS := value2
...

You were guaranteed to stall. Yes, it was a bottleneck.

Even modern machines have a rather limited renaming pool for segment
registers.

---

So here's the bottom line: irregular hidden registers embedded in the
pipeline are bad for OOO. They are a side effect, additional dataflow.

You either pay the cost to route them to one of the main OOO PRFs.

Or you create a local PRF, and pay the cost to rename them. And then
pay the performance cost when this separate PRF cost overflows.

(Or don't rename, and stall on access. Which is overflow, but more
specific.)

So, if you want to do OOO, then your special hidden registers probably
better not be used all that often, or important to performance. Which
rather defeats the purpose.

At least, they should be read-mainly, because it is writing them that
causes stalls.

The ability to take advantage of such hidden registers embedded in the
pipeline is one of the big advantages of in-order versus out-of-order.

===

BTW Jason, given your interest in Exascale, you are probably okay on
limiting yourself to in-order.

Except.... everything I said about such special hidden registers and
OOO also applies to in-order multithreaded designs - like GPUs. If they
interleave threads in a lane every cycle, and take 40 cycles before they
can execute a dependent operation from the same thread, as some GPUs
do, and if every thread in the workload is using this hidden register,
then you need 40 copies of the hidden register. Times the number of lanes.

I can be stronger:

The ability to take advantage of such hidden registers embedded in the
pipeline is one of the big advantages of in-order single threaded
designs versus out-of-order and multithreaded in-order.

One famous computer architect once said to me "Once you have implemented
SMT or IMT, you are 90% of the way to OOO anyway."

Chris Gray

unread,

Feb 19, 2011, 1:37:09 PM2/19/11

to

MitchAlsup <Mitch...@aol.com> writes:

> What languages require integer multiplication to have sing and
> overflow detection?

My Zed language does. Same for add/subtract/divide. Remember, signed division
can overflow. I would love to be able to generate trapping instructions,
instead of having to do expensive explicit testing.

--
Experience should guide us, not rule us.

Chris Gray c...@GraySage.COM

nm...@cam.ac.uk

unread,

Feb 19, 2011, 1:13:45 PM2/19/11

to

In article <87r5b3v...@GraySage.com>, Chris Gray <c...@GraySage.com> wrote:
>MitchAlsup <Mitch...@aol.com> writes:
>
>> What languages require integer multiplication to have sing and
>> overflow detection?
>
>My Zed language does. Same for add/subtract/divide. Remember, signed division
>can overflow. I would love to be able to generate trapping instructions,
>instead of having to do expensive explicit testing.

In addition to Cobol, and almost certainly Ada and some Lisps,
you can add a lot of languages that are not compiled directly
into machine code: Python, SNOBOL, Spitbol, Mathematica, Perl
and others. You aren't alone.

A fair number of Fortran compilers used to, and I think that
several would, today, if modern architectures didn't actively
discourage integer overflow checking :-(

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Feb 19, 2011, 2:01:57 PM2/19/11

to

On 2/19/2011 9:13 AM, Andy "Krazy" Glew wrote:
> On 2/18/2011 7:19 PM, Jason Riedy wrote:
>> And Andy Glew writes:
>>> Because really funky large irregular hidden registers are hard
>>> to handle in an OOO system.
>>
>> Exposing my utter (software perspective) ignorance: Don't OOO
>> systems rely on many hidden (renaming) registers? What is the
>> difference between register classes?
>>
>> (Which isn't to detract from the pure software exercise of
>> considering the primitives from which linquistic arithmetic
>> operations may be constructed. Dammit. I want my quad precision
>> FP at no more than 4x worse than double precision performance,
>> being generous.)
>>
>> Jason
>

> The ability to take advantage of such hidden registers embedded in the

> pipeline is one of the big advantages of in-order single threaded
> designs versus out-of-order and multithreaded in-order.
>
> One famous computer architect once said to me "Once you have implemented
> SMT or IMT, you are 90% of the way to OOO anyway."

This conversation helped me clarify, in my mind, the issue with special,
hidden, state not necessarily in a nice smooth register file.

It is not in-order versus out-of-order.

It is a question of how finally interleaved accesses are to such state -
of how many accesses may be pending at once.

Simple single threaded in-order machines need not have multiple
overlapped accesses.

OOO and fine grained MT in-order may have.

Quadibloc

unread,

Feb 19, 2011, 2:58:33 PM2/19/11

to

On Feb 19, 1:34 am, n...@cam.ac.uk wrote:

> No, not at all. There are all of the following to consider:

Well, I did go on to note the first two of those. But since those are
user-created types that could be of arbitrary length, usually they're
not directly supported in hardware for indexing, the way the 68000
will shift an index to match the type of an operand.

And so I agreed that integer multiplication is necessary - to
facilitate handling array indexes in software. But it would not be
appropriate to have every memory-reference instruction have modes to
handle complicated arrays.

John Savard

Quadibloc

unread,

Feb 19, 2011, 3:03:26 PM2/19/11

to

On Feb 19, 7:44 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:

> It is indeed trivial to construct wider multiplies if you have both
> N*N->2N and ADD/ADC.

While, in general, the idea of splitting multiplication up into
smaller operations seems strange to me, integer multiplication with
double-length products is a normal type of instruction that should be
readily available. Even if the alternate kind is more commonly used in
generating code for higher-level languages.

John Savard

MitchAlsup

unread,

Feb 19, 2011, 3:10:58 PM2/19/11

to

On Feb 19, 2:24 am, n...@cam.ac.uk wrote:
> Note that there is NO reason that it should slow things down. Many
> years of experience has shown that trap-diagnose-and-terminate is
> all that is needed for 99% of applications.

The reason and rational for not putting the overflow detection at teh
instruction set elvel in 88K time frame was that we would have run out
of instruction encoding bits. Such is the failure of blind-sided RISC
mantras.

Mitch

MitchAlsup

unread,

Feb 19, 2011, 3:20:03 PM2/19/11

to

On Feb 19, 9:10 am, n...@cam.ac.uk wrote:
> But, for some reason, far too many architectures make it unnecessarily
> hard to do wider multiplication - and even more, iterative division,
> in software. I think that it's insane ....

Ok, What percentage of all applications running on all the powed up
x86s worldwide this day, right now, this second, are being spent in
multiprecision anything?
A) 0.0001%
B) 0.001%
C) 0.0%
or
D) 0.1%

Mitch

Andy "Krazy" Glew

unread,

Feb 19, 2011, 3:23:56 PM2/19/11

to

On 2/19/2011 10:37 AM, Chris Gray wrote:
> MitchAlsup<Mitch...@aol.com> writes:
>
>> What languages require integer multiplication to have sing and
>> overflow detection?
>
> My Zed language does. Same for add/subtract/divide. Remember, signed division
> can overflow. I would love to be able to generate trapping instructions,
> instead of having to do expensive explicit testing.
>

This is the conclusion I have come to for most error detection:
trap, don't require checks.

Trap, as cheaply as possible, to a user level trap handler. Which
probably amounts to doing the equivalent of a CALL, with an instruction
pointer pointing to the uncompleted instruction that has trapped (or the
completed instruction, pointing after with pointer to before, or ...)

If necessary, make the default trap be disabled.

But trapping is the safething to do.

---

Following in the lines of the D Programming Language:
functions should throw an exception on an error, not return -1 or 0 or ...

nm...@cam.ac.uk

unread,

Feb 19, 2011, 2:51:16 PM2/19/11

to

In article <f3e60361-3715-45ac...@o18g2000prh.googlegroups.com>,
Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>> No, not at all. =A0There are all of the following to consider:

>
>Well, I did go on to note the first two of those. But since those are
>user-created types that could be of arbitrary length, usually they're
>not directly supported in hardware for indexing, the way the 68000
>will shift an index to match the type of an operand.

Then we are at cross-purposes. I was saying that there needs to be
an integer multiplication that is part of the addressing pipeline,
not that it has to be an indexing mode. I.e. NOT the mistake made
by PA-RISC etc.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Feb 19, 2011, 2:59:16 PM2/19/11

to

In article <76cc1210-c540-427a...@o20g2000yqk.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>
>> But, for some reason, far too many architectures make it unnecessarily
>> hard to do wider multiplication - and even more, iterative division,

>> in software. =A0I think that it's insane ....

>
>Ok, What percentage of all applications running on all the powed up
>x86s worldwide this day, right now, this second, are being spent in
>multiprecision anything?
>A) 0.0001%
>B) 0.001%
>C) 0.0%
>or
>D) 0.1%

Quite likely 1%, with many applications spending above 10%, often
considerably above it. Public-key encryption and decryption and,
of course, attempts to break it :-)

Other than that, I agree that it is negligible, but RSA and similar
algorithms are not!

HOWEVER, if it were not so ridiculously expensive, more numerical
codes would use extended precisions than currently do, with the
consequent improvement in accuracy. That shouldn't be sniffed at,
as those codes are important out of all proportion to the number
of their users.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Feb 19, 2011, 3:07:45 PM2/19/11

to

In article <7_adnURKMfRCu_3Q...@giganews.com>,

Andy \"Krazy\" Glew <an...@SPAM.comp-arch.net> wrote:

>On 2/19/2011 10:37 AM, Chris Gray wrote:
>> MitchAlsup<Mitch...@aol.com> writes:
>>
>>> What languages require integer multiplication to have sing and
>>> overflow detection?
>>
>> My Zed language does. Same for add/subtract/divide. Remember, signed division
>> can overflow. I would love to be able to generate trapping instructions,
>> instead of having to do expensive explicit testing.
>
>This is the conclusion I have come to for most error detection:
>trap, don't require checks.
>
>Trap, as cheaply as possible, to a user level trap handler. Which
>probably amounts to doing the equivalent of a CALL, with an instruction
>pointer pointing to the uncompleted instruction that has trapped (or the
>completed instruction, pointing after with pointer to before, or ...)

Yes. There is absolutely no reason that such a mechanism should
be any slower than a mispredicted branch plus a function call.
Not cheap, but not catastrophic, and the operating system need
not be involved at all.

>If necessary, make the default trap be disabled.
>
>But trapping is the safething to do.

Agreed.

Regards,
Nick Maclaren.

Andrew Reilly

unread,

Feb 19, 2011, 5:27:30 PM2/19/11

to

The original question, as put, is a little unfair: probably >99% of the
applications running on powered-up (x86) computers are sitting in an idle
loop or power-save state, waiting for a key press or click or network
packet.

If you rephrase the question as the fraction of computers are actually
doing anything that anyone is actively caring about, then the result is
probably still fairly low: office applications and games probably don't
do a whole lot of multiprecision arithmetic. Not none, though: much of
the 64-bit arithmetic used for filesystem access is essentially extended
precision, on ia32 (and ARM). A fair bit of crypto uses large
arithmetic, but that's probably not going to be a significant fraction of
processor load unless it is in a system using an encrypted disk (and is
also busy.)

On the other hand, there are more, and more commonly used, programming
languages that automatically use extended precision arithmetic instead of
overflowing (silently or with a trap): python, perl, scheme, ruby, lisp.
So there might be more of it going on than you think.

Cheers,

--
Andrew

nm...@cam.ac.uk

unread,

Feb 19, 2011, 5:34:56 PM2/19/11

to

In article <8saueh...@mid.individual.net>,

Andrew Reilly <areil...@bigpond.net.au> wrote:
>
>The original question, as put, is a little unfair: probably >99% of the
>applications running on powered-up (x86) computers are sitting in an idle
>loop or power-save state, waiting for a key press or click or network
>packet.

Yes. I had assumed that Mitch wasn't playing that word game.

>If you rephrase the question as the fraction of computers are actually
>doing anything that anyone is actively caring about, then the result is
>probably still fairly low: office applications and games probably don't
>do a whole lot of multiprecision arithmetic. Not none, though: much of
>the 64-bit arithmetic used for filesystem access is essentially extended
>precision, on ia32 (and ARM). A fair bit of crypto uses large
>arithmetic, but that's probably not going to be a significant fraction of
>processor load unless it is in a system using an encrypted disk (and is
>also busy.)

Er, have you looked into what networking protocols use nowadays?
My assertion was assuming that an active computer spends a lot of
its time transferring data, and that a lot of data nowadays uses
some form of semi-secure or signable protocol. A huge amount of
video, television and other, does - though I don't know how much
of that uses RSA-like cryptography.

Regards,
Nick Maclaren.