I hope the EPIC architecture has a whole new fresh and straightforward
instruction set and just resorts to compiling techniques for compatibility.
Joseph
>Just browsed through the Pentium III instruction set (see developer.intel.com).
>With the addition of the 70 new instructions, this set is just humongous!
>The new instructions are also way more complicated due to all sorts of
>special conditions that need to be taken care of. Is there anybody besides
>me that thinks this is a very bad architectural choice?
MMX was a clear misdesign. KNI/SSE is a direct descendant of the same
cleverness which lead to the MMX design (and to AMD-3D). Obviously
there must be left enough room for Merced to improve performance.
What startles me is the total lack of 3-address instructions. While it
may be a controversial discussion at times, whether traditional
integer code with fairly little ILP really requires 3-address
instructions, but IMHO the kind of code addressed by SSE (and MMX)
cries for it.
What kind of mind may invent a compare instruction, which puts its
result into one of its input operands, overwriting the compared value,
a value which almost certainly is still required afterwards. Results
in 2 instructions (move,compare) instead of one. And the number of
instructions to decode is a critical resource (while the instruction
size is not, at least not for this kind of code). It's hard to add
decoders, given a stream of variable-length instructions.
While the x86 design is (almost) entirely 2-address, a deviation would
have been easy. The P6/P2 core already translates x86 2-address
instructions into 3-address pseudo-RISC-ops (as does the K6). And the
encoding would have been easy too: a single group of 8 0F-map opcodes
for a generic 3-address SSE instruction group (the irregular
instructions may have been left 2-address to ease decoding), with the
exact operation specified by a trailing immediate byte - like AMD did,
and like Intel did for the compares.
>and still retain compatibility? This instruction set probably adds so much
>complexity to hardware design and verification that any significant
>architectural advance must be next to impossible.
Wrt encoding, there is little new stuff. It follows the usual weird
x86 rules: prefixes (F3) are common stuff in x86 and the P2 core can
handle one prefix fairly good, the trailing opcode of the compares is
also found in a different role in MMX shifts and the shuffle insn.
Concerning the operations themselves: The complexity of most
operations is completely local to the execution unit handling it, so
these units can be verified independently.
Probably most parts of the P2 engine were left completely unchanged
after all. Some earlier thread in this newsgroup makes me believe that
the internal buses still operate at 64bits, folding a single 128-bit
insn into 2 64-bit insns.
The biggest complexity lies in the instruction set manual itself.
(1) Awefully slow to read with Acrobat Reader (Ghostscript is better
though). Maybe Adobe needs some SSE based acceleration ;-).
(2) The descriptions of the new stuff are spread all over the manual,
with no table of contents for the individual insns. Some contextual
grouping and some description of motivation would have been useful.
Vol1 and a white paper are missing here.
>I hope the EPIC architecture has a whole new fresh and straightforward
>instruction set and just resorts to compiling techniques for compatibility.
Maybe HP played a major role in this part of the IA64 design as
Intel's history is not known for good instruction set designs (the
i960 being the only exception from this rule). If not, well, the
better for the competition. One of my nightmares is that someday we
all are stuck to some Intel IA64 with absolutely no equally ranked
competition.
Gruss, Andreas
It's not that bad, as far as I can tell; the new instructions are fairly
decently orthogonal and implement a rich set of SIMD operations. I can't
quite see why they added SS versions working on only one vector
element - presumably so that you could avoid using the FPU at all for
single-precision work.
>How on earth are
>they going to carry forward this legacy instruction set into new
processors,
>and still retain compatibility?
Oh, I don't think it's too bad. And Intel could, without too much
difficulty, increase the vector register width to 8 (16 would make the
FXSTOR buffer too small) - SHUFPS stops being universally useful, but it
wouldn't break backwards compatibility provided they introduced new
MOV32 instructions), and stick in eight functional units of each sort
for an x4 performance increase which might even be usable in some real
code.
Tom
>As for overwriting one of the registers,
>that's being with us for eons - and it never prevented people from
>writing fast programs.
But it may have prevented people from writing faster programs.
To which Andreas Kaiser sagely replied:
> But it may have prevented people from writing faster programs.
Alberto followed up with this seeming irrelevancy:
> If I can handle a = a + b[i]. or a = a + b[i+j*k] , how does
> having a = b + c make my program any faster ?
If your program has expressions of the form a=b+c then a machine
with three operand instructions will be able to evaluate those
expressions twice as fast as a similar two operand machine and
your program will be faster proportionally to the occurance of
such expressions in your algorithm.
Clearly the presence of a=a+b[i] and a=a+b[i+j*k] does NOTHING
to alleviate the problem of coding an expression of the form
a=b+c as more than one instruction. Even though the many of the
expressions used in most programs are of the form a=a+b there
are cases where the programmer legitimately wants to write a=b+c
and these cases WILL be slower on a two operand machine than on
a three operand machine. The reverse, howver, is not the case:
the presence of three operand instructions does NOT slow down the
evaluation of a two operand operation.
- Jeff Dutky
[of KNI]
>What startles me is the total lack of 3-address instructions.
I quite like Cyrix's approach for their EMMX extensions (which almost
no-one seems to know about; have a look at page 15 of Cyrix document
AP106 (Software Customization for the 6x86^{TM} Family) for the
specification. They added a set of instructions of the form
PADDSIW mmregA, mmregB/mem64
which did a packed-add-signed-words of mmregA and mmregB, and put the
result in mmreg[A XOR 1].
Their rationale was that FFT code tended to include lots of things of
the form q=a+b;r=a-b; which required pathological numbers of reloads on
a two-instruction architecture.
(they also added a set of average-packed-bytes and
absolute-difference-of-packed-bytes instructions, presumably to get good
MPEG performance out of their MediaGX system-on-a-chip; I thought
difference-of-packed-bytes was mostly used in MPEG *encoding*, and I'd
love to know if people got realish-time MPEG encoding on the 6x86MX).
Tom
Not true. This is a typical illogical conclusion from people from the
RISC community.
a = b + c
is equivalent to:
a = b
a += c
which, with register forwarding which all modern x86 processors support
happens together in one clock.
> Clearly the presence of a=a+b[i] and a=a+b[i+j*k] does NOTHING
> to alleviate the problem of coding an expression of the form
> a=b+c as more than one instruction.
x86's do not execute only one instruction per clock. (Incidentally, the
above construct is typically not necessary for most real world
algorithms.)
> [...] Even though the many of the
> expressions used in most programs are of the form a=a+b there
> are cases where the programmer legitimately wants to write a=b+c
> and these cases WILL be slower on a two operand machine than on
> a three operand machine.
Well, my point is that it will not be slower on a modern x86 architecture
for any reason that you are citing.
> [...] The reverse, howver, is not the case:
> the presence of three operand instructions does NOT slow down the
> evaluation of a two operand operation.
This is actually false. RISC processors that support general 3 operand
instructions require much more complicated register routings to the alus.
Not being a CPU designer I cannot be sure, but I believe that to sustain
x86-like architectural speed the registers must be ported one more time.
This will negatively affect the clock rate at which you can scale your
CPU. Ergo the PowerPC and its relatively slow Mhz curve.
--
Paul Hsieh
http://www.pobox.com/~qed/cpuwar.html
If they are not in registers, then memory load latency dominates and this
is a mute point.
> >x86's do not execute only one instruction per clock. (Incidentally, the
> >above construct is typically not necessary for most real world
> >algorithms.)
>
> The very fact that so many algorithms still are expressed in terms of
> very basic primops is a sure indicator that our computer architecture
> concepts have much to evolve.
>
> For example, take multiplying a vector by a matrix. If I write my
> algorithm like this:
>
> bench(float matrix a, x)
> {
> return x * a;
> }
>
> or, even better, in APL, a suitably configured parallel machine will
> do it in O(log n) time, and the task of converting this into
> appropriate machine code can be safely delegated to the compiler. Now,
> of course, you can go down to individual multiplies and adds, optimize
> your pipelines ad nauseam, tighten you loops, and what not; but all
> that jazz doesn't happen because the algorithm demands it, but
> because the architecture can't do anything smarter than a handful -
> at most - of operations at one time.
I must be missing your point. CPUs only have a handful of ALUs so what
more do you expect from them? To distribute this over multiple
processors, there exist compilers that will auto-decompose such
structures at which point this becomes a compiler question, not a CPU
question.
> Another example is a Fourier Transform. If I want to do one, I can for
> example resort to an FFT algorithm, and break everything into
> individual adds and multiplies - or I can take the parallel approach
> and do it the way it requires. Again, if I could express it like this,
> for example,
>
> return a fft b;
>
> Where the compiler knows that the "fft" operator is and handles it
> appropriately; for example, if you have a Reconfigurable Mesh you can
> efficiently handle it as a matrix by vector multiplication, no need
> for the fine grained algorithm.
>
> As hardware gets more complex, it makes more and more sense to move
> algorithms into hardware, and increase the granularity of the objects
> the language can handle, and burying the corresponding complexity away
> from the user. Sure, we can probably only achieve ultimate speed by
> doing things at a very fine granularity. But then, complexity has its
> price; like, where would we be if we had to design our graphics chips
> at transistor level, or even at gate level ? A Honda in hand is better
> than a Ferrari in the shop.
Uh ... I don't think we are anywhere near the point where real algorithms
(i.e., O(log n) time matrix multiplies and so on) have a place on CPUs.
Gimme a break, the level of sophistication of algorithms in CPUs is
currently using "SIMD" or "VLIW" which isn't going to impress anyone
algorithmically.
> >This is actually false. RISC processors that support general 3 operand
> >instructions require much more complicated register routings to the alus.
> >Not being a CPU designer I cannot be sure, but I believe that to sustain
> >x86-like architectural speed the registers must be ported one more time.
> >This will negatively affect the clock rate at which you can scale your
> >CPU. Ergo the PowerPC and its relatively slow Mhz curve.
>
> If I need two cycles to do a memory to register add on one
> architecture, and only one cycle on another, for that kind of
> operation the first machine is effectively twice as slow.
But if it takes two clocks, there is probably a reason for it. Unless
the designers are yutzes that reason is usually that the processor will
better scale to higher clock rates in which case their decision is the
right one.
> [...] One cannot just compare megahertz to megahertz, that may not
> necessarily be the way it works out.
I never said that. I am talking about Mhz *scaling*. This is a typical
statement from the RISC community ... especially among those that have
not yet learned that lesson of the necessity of Mhz scaling (PowerPC.)
But the three operand must redirect its ALU output to an additional
register which must be selected or encoded somehow back from the original
instruction. The ALUs in a sense need access to 3 registers at once, not
two.
> > This will negatively affect the clock rate at which you can scale your
> > CPU. Ergo the PowerPC and its relatively slow Mhz curve.
>
> Last time I looked, the Pentium II was shipping at 350, 400 and 450 MHz
> and Apple was shipping PowerMacs at 300, 350 and 400 MHz.
P-III's are at 500Mhz. The x86 architecture has its ridiculously archaic
instruction set decoding mechanism which is half handled by micro-code to
deal with. For an instruction set that was never designed for high clock
rates the x86 architecture has proven to be remarkable scalable in this
respect.
The PowerPC was designed much more recently (93? 94?) when scalability
with clock rate considerations should have been made very early.
> [...] Hardly any
> difference at all there -- 12% - 15% on clock rate and a heck of a lot
> less than that on speed (and the difference is in the other direction
> according to most benchmarks).
Later this year will see a huge difference in clock rate between x86s and
PowerPCs (according to Microprocessor Forum.) While the PowerPC G4 will
not exceed 500Mhz, other CPUs such as the K7, Jalapeno, and Intel's very
long in the tooth P6 core will end the year in the 600-700 Mhz range.
> I'm sure that next month Pentiums III's will be at 500 MHz and PPC750's
> (or whatever the one with AltiVec will be called) will be at 450 or 500
> MHz as well.
AltiVecs available next month? I'll believe it when I see it. In any
event 500Mhz P-III's are available right now. Legally and otherwise.
> Other than a brief time when the Pentium Pro came out and was at 200 MHz
> while the PPC604 hadn't gotten there yet, I can't recall Pentium ever
> being much ahead of PPC on clock rates and the PPC603e got to 240 (and I
> think even 300) MHz *long* before any Pentium did.
That is because, the current x86 architectures have been reaching the end
of their life recently. Its getting very painful even for Intel to crank
up its clock rate. The PPC750 on the other hand just came out last year,
and should be cranking like a Mofo.
> And that's not to mention a 3-address RISC called "Alpha" which is *way*
> ahead on clock rate.
Aha, but the Alpha is truly an architecture of a different color.
Everything on the Alpha is solved by incredibly deep pipelining (even the
three registeredness, I'm sure). There is no question that Alpha is the
right way to do a processor architecture up until today.
However, as a side effect of their deep pipelining they require a
massively huge number of rename registers (on top of their large number
of architectural registers.) As can be seen in the 21264's micro-
architecture, this requirement nearly brought the architecture to its
knees -- the registers are literally statically partitioned so that two
different halfs of the chip can deal with smaller register files at a
time. There are penalties for cross register usage that compilers have
to take care of -- if it can. Its not clear that Alpha is correct for
tommorrow.
Even the Alpha is not perfect. (That's why the 21364 is nothing more
than a 21264 that's been SMPed.) Though given that DEC is no longer its
own company on a mission to create the fastest CPUs in the industry, we
probably will never see my claims really tested.
> > > This is actually false. RISC processors that support general 3 operand
> > > instructions require much more complicated register routings to the alus.
> > > Not being a CPU designer I cannot be sure, but I believe that to sustain
> > > x86-like architectural speed the registers must be ported one more time.
> >
> > How is this? Both have to fetch operands from two registers and write
> > results back to one.
>
> But the three operand must redirect its ALU output to an additional
> register which must be selected or encoded somehow back from the original
> instruction. The ALUs in a sense need access to 3 registers at once, not
> two.
I've done some experiments recently in the design of several CPU's(8 and 32 bit
RISC as well as some CISC). The largest of these
was a mostly MIPS R3000 compatible CPU. And from that
I can say that there is NO difference in the register file between a
2 or 3 operate instruction.
In the case of the R3000, the register file is made from "tripple port
RAM". Where two ports are for reading and one port is for
writing.
The reason why there is no difference is simple in concept-- but
difficult to explain. I'll try to do by best...
If the CPU executed an instruction in a single clock cycle then
you're right. The register file could/would be slightly different
for 2 or 3 operand instructions. The reason for this is that
we would be reading and writing to the same register location.
But the CPU doesn't execute an instruction in a single clock.
The CPU is pipelined, so the register reads happens a clock
or two before the write.
Since the write happens after the read, and the processor is
pipelined, in a single clock cycle you actually have two register
reads (from one instruction) and a register write (from a
different instruction).
So. in any one clock cycle you have two reads and a completely
unrelated write. This means that the register file need to have
three ports (2 reads, and 1 write). Once you have a register
file like this, the only penalty for having 3 operands over 2
is in the instruction decoding.
David Kessner
dav...@peakaudio.com
: I can say that there is NO difference in the register file between a
: 2 or 3 operate instruction.
Ideally. I not-so-long ago had the, um, _learning opportunity_
to use a machine that apparently went directly from Napkin to first
silicon. This charming machine was two-address (apparently its designer
had an un-requited love-affair with the 68K :-), and it really
did have a _dual_ port register file: one read, one read/write.
When scheduling for this machine, one needed to be aware of the
exact number of cycles that a given instruction would take, and
arrange for the "destination" (actually, source-dest) of that instruction
to be used as a source of the correct later instruction, else
one would incur a one-cycle "bubble" in the pipe. Lovely. Ah,
well, as my old boss used to say "If the job was easy, we'd get
someone cheaper than you to do it" :-)
: Since the write happens after the read, and the processor is
: pipelined, in a single clock cycle you actually have two register
: reads (from one instruction) and a register write (from a
: different instruction).
If different instructions can take different numbers of
pipe-stages, you actually need either more write-ports or the
interlocks to insert the bubble mentioned above. CPU design,
done well, is an exacting discipline, not something for
armchair architects, unless they are just in it for amusement,
as seems often the case in this group.
And I guess we don't _even_ want to get into bypassing
and register-renaming, right? :-)
Mike
| alb...@agames.com, speaking only for myself
> Aha, but the Alpha is truly an architecture of a different color.
> Everything on the Alpha is solved by incredibly deep pipelining (even the
> three registeredness, I'm sure). There is no question that Alpha is the
> right way to do a processor architecture up until today.
>
I wouldn't really call the Alpha incredibly deeply pipelined.
Compared to most contempary micros the various implementations of the
alpha are at most average. Just take a look at the latencies in EV6.
> However, as a side effect of their deep pipelining they require a
> massively huge number of rename registers (on top of their large number
> of architectural registers.) As can be seen in the 21264's micro-
> architecture, this requirement nearly brought the architecture to its
> knees -- the registers are literally statically partitioned so that two
> different halfs of the chip can deal with smaller register files at a
> time. There are penalties for cross register usage that compilers have
> to take care of -- if it can. Its not clear that Alpha is correct for
> tommorrow.
>
The huge number of registers and the clustering are pretty much
orthogonal to what you say. There are two very good reasons to go
with the clustering approach. First it shortens path/wire length
allowing to continuously achieve higher clock speeds. Second, it
allows you to break up the register file into smaller less ported
partitions. By having less ports the actualy size goes down and the
access time goes down. Most of the stuff that was done in EV6 will
have to be done by just about everyone. Cluster functional units
seems to be a pretty bright idea that a lot of people will follow.
> Even the Alpha is not perfect. (That's why the 21364 is nothing more
> than a 21264 that's been SMPed.) Though given that DEC is no longer its
> own company on a mission to create the fastest CPUs in the industry, we
> probably will never see my claims really tested.
>
I think the Alpha design teams and anyone with a clue is still on a
mission to create the fastest computers on earth. I think the issue
that some people haven't yet grasped, is that in the future that won't
mean just having the faster core.
Whats the usefullness of a 10Ghz CPU if its fed by a 50mhz/8bit bus?
EV7 is the result of looking at where the current bottlenecks are and
applying fixes. This is all any computer architecture really is. All
we really do is fix last years bottlenecks, and in turn create all new
bottlenecks to fix the next year.
Aaron Spink
not speaking for Compaq
All opinions are my own
> a = b + c
>
>is equivalent to:
>
> a = b
> a += c
>
>which, with register forwarding which all modern x86 processors support
>happens together in one clock.
No, they execute sequentially but pipelined, the forwarding just
avoids having to read the rename register for "a" in the 2nd
operation. The MOVE (actually a COPY, not a MOVE) instruction is
implemented as a pass-thru ALU operation using one of the available
ALU execution units, because the data won't appear on a result bus
(the source of forwarding) without someone placing it there. So the
value, on which ADD depends on (the rename register which is
associated with the 1st "a"), does not appear on this result bus
before the MOVE has completed ALU execution, just in time for the ADD
operation to get it from there to execute in the next clock cycle.
The execute phase of the MOVE operations effectively runs in parallel
with the operand fetch (K6) or instruction issue stage (P6) of the ADD
instruction. Just pipelining, no overlap.
Of course, a complex instruction rename stage could fold both
instructions together to once again form a = b + c. But I have yet to
see a CPU which is clever enough to do this, because it is really hard
work to correlate the registers of different operations in such a way.
If Agner Fog is right, Intels P2 is not even clever enough to see that
XOR EAX,EAX does not depend on some earlier operation writing to EAX,
EVEN THOUGH it detects that the result is zero (BTW: the K6 is, but
its job is easier).
Additionally, having two instructions instead of one requires one
additional decoder slot, and decoding variable length x86 instructions
in parallel is hard work with significant influence on the clock rate
limit. Decoders are a critical resource especially in the K6.
>(Incidentally, the above construct is typically not necessary for most
>real world algorithms.)
I initially referred to the compare instructions, which produce a bit
mask destroying one of the input operands. IMHO when a compare
instruction is used, both operands are likely to be still required
afterwards, so the additional move instruction is hardly avoidable.
>This is actually false. RISC processors that support general 3 operand
>instructions require much more complicated register routings to the alus.
No. Due to the mechanisms of out of order execution together with
register renaming, it is generally not possible to use one of the
input operand registers as writeback target in the out of order stages
(unless it is a temp register, e.g. in a load-execute insn), even
though it somehow maps to the same architecture register. The
retirement unit is the one to either copy the rename register to the
architecture register or to modify the rename table accordingly.
Actually, both K6 and P-II/III use a RISC-like 3-address instruction
format internally, so once the CISC instructions are decoded, there is
no difference to plain RISC anymore. In case of the K6, this is
documented in the "Anatomy" book.
>Not being a CPU designer I cannot be sure, but I believe that to sustain
>x86-like architectural speed the registers must be ported one more time.
>This will negatively affect the clock rate at which you can scale your
>CPU. Ergo the PowerPC and its relatively slow Mhz curve.
The PowerPCs have a fairly short pipeline even of you account for the
excess stages which may be required for CISC. There is no dedicated
rename stage (P2), there is no dedicated register read stage (P2, K6),
both appear to be folded into some other stages. So the while P2 is
sometimes called super-pipelined, the PowerPCs are not.
Indeed the PowerPCs may need additional ports, but for a different
reason: there are several instructions which have 3 register input
operands, and there are some which have 2 register output operands
(load with update, but here the operand's result bus is usually shared
with some other unit's result bus). So they actually use 4-address
ops. But this is a totally different story.
Gruss, Andreas
> This is actually false. RISC processors that support general 3 operand
> instructions require much more complicated register routings to the alus.
> Not being a CPU designer I cannot be sure, but I believe that to sustain
> x86-like architectural speed the registers must be ported one more time.
How is this? Both have to fetch operands from two registers and write
results back to one.
> This will negatively affect the clock rate at which you can scale your
> CPU. Ergo the PowerPC and its relatively slow Mhz curve.
Last time I looked, the Pentium II was shipping at 350, 400 and 450 MHz
and Apple was shipping PowerMacs at 300, 350 and 400 MHz. Hardly any
difference at all there -- 12% - 15% on clock rate and a heck of a lot
less than that on speed (and the difference is in the other direction
according to most benchmarks).
I'm sure that next month Pentiums III's will be at 500 MHz and PPC750's
(or whatever the one with AltiVec will be called) will be at 450 or 500
MHz as well.
Other than a brief time when the Pentium Pro came out and was at 200 MHz
while the PPC604 hadn't gotten there yet, I can't recall Pentium ever
being much ahead of PPC on clock rates and the PPC603e got to 240 (and I
think even 300) MHz *long* before any Pentium did.
And that's not to mention a 3-address RISC called "Alpha" which is *way*
ahead on clock rate.
-- Bruce
On a RISC machine with instructions in 3-operand format, each execution
pipe has two source busses and a result bus. So, to support it, the register
file has to have two read ports and one write port.
On an x86 machine, each execution pipe has two source busses and a result
bus. To support it, the register file has two read ports and one write port.
So, it makes no difference. The fact that in a two-operand instruction
format, there appear to be only two registers involved doesn't change
the fact that we need to read two source operands and need to write two
destination operands, and each of those read or write operations requires
a port into the register file.
-- Norbert
-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Actually, the mechanics of folding moves into dependent register-to-register
instructions are not that complicated and can be accomplished by the rename
stage, as you suggest. Most modern x86 processors (including the K6-2) are
able to fold the FXCH instruction by "pointer swizzling" during the rename
stage. Similarly, you can fold moves by "pointer aliasing". While this
creates a few minor problems, it's a solvable problem. I don't know any
machine that does it, though.
The hard part is to provide this additional functionality without lengthening
the rename stage so much as to either require a decrease in the operating
frequency or require an additional pipe stage. It's also possible to fold
loads into a dependent register-to-register operation and this is indeed being
done by an existing CPU (well it's not a perfect fold, but it does eliminate
the latency of the load instruction that one would otherwise have to add to
the length of a dependency chain). Sorry can't reveal the CPU at this time.
> Additionally, having two instructions instead of one requires one
> additional decoder slot, and decoding variable length x86 instructions
> in parallel is hard work with significant influence on the clock rate
> limit. Decoders are a critical resource especially in the K6.
This is correct if you are talking about dual-decode machines. It pretty
much becomes a non-issue for triple-decode machines (especially if they
have three identical, powerful decoders).
> I initially referred to the compare instructions, which produce a bit
> mask destroying one of the input operands. IMHO when a compare
> instruction is used, both operands are likely to be still required
> afterwards, so the additional move instruction is hardly avoidable.
Yes, this is a real issue with the two-operand approach. But if you have
enough decoders, and enough execution units, the performance degradation
is minimal.
> Indeed the PowerPCs may need additional ports, but for a different
> reason: there are several instructions which have 3 register input
> operands,
Not that I'm aware of, except on the FP side (which is multi-cycle anyway).
There *were* some 3 input register rotate/mask instructions in Power, but
they were dropped for PowerPC. The PPC601 had them, but that chip was
never intended to push clock rates.
> and there are some which have 2 register output operands
> (load with update, but here the operand's result bus is usually shared
> with some other unit's result bus).
Once again, that only executed with no penalty on the (slow) PPC601. On
the 603/604/750 you can write back a sustained two integer results per
clock cycle and if you use a load with update then that instruction takes
up two of those slots which means you can't (on average) use the other
unit for anything at the same time. Thus using a load-with-update instead
of a simple load and an add doesn't lose you anything except scheduling
flexibility but it doesn't gain you anything (except slightly reduced
instruction fetch) either.
So your dual-issue PowerPC (such as the 750 ("G3")) needs four integer
read ports and two write ports.
-- Bruce
> br...@hoult.actrix.gen.nz says...
> > q...@pobox.com (Paul Hsieh) wrote:
> >
> > > This is actually false. RISC processors that support general 3 operand
> > > instructions require much more complicated register routings to the
alus.
> > > Not being a CPU designer I cannot be sure, but I believe that to sustain
> > > x86-like architectural speed the registers must be ported one more time.
> >
> > How is this? Both have to fetch operands from two registers and write
> > results back to one.
>
> But the three operand must redirect its ALU output to an additional
> register which must be selected or encoded somehow back from the original
> instruction. The ALUs in a sense need access to 3 registers at once, not
> two.
That doesn't make sense. All modern x86's use register renaming, which
mean that within the execution engine they are 3-address just like the
RISCs. The *only* difference is that the RISCs have to find room to
encode three register numbers in the instructions which means they have to
have fewer registers than they otherwise could for a givern instruction
size.
> > > This will negatively affect the clock rate at which you can scale your
> > > CPU. Ergo the PowerPC and its relatively slow Mhz curve.
> >
> > Last time I looked, the Pentium II was shipping at 350, 400 and 450 MHz
> > and Apple was shipping PowerMacs at 300, 350 and 400 MHz.
>
> P-III's are at 500Mhz. The x86 architecture has its ridiculously archaic
> instruction set decoding mechanism which is half handled by micro-code to
> deal with. For an instruction set that was never designed for high clock
> rates the x86 architecture has proven to be remarkable scalable in this
> respect.
If you put in enough pipeline stages then you can clock as fast as you
like, no matter how aweful the instruction set. Money to burn doesn't
hurt either.
> The PowerPC was designed much more recently (93? 94?) when scalability
> with clock rate considerations should have been made very early.
I wonder how the PPC would go with Intel's resources behind it?
> > [...] Hardly any
> > difference at all there -- 12% - 15% on clock rate and a heck of a lot
> > less than that on speed (and the difference is in the other direction
> > according to most benchmarks).
>
> Later this year will see a huge difference in clock rate between x86s and
> PowerPCs (according to Microprocessor Forum.) While the PowerPC G4 will
> not exceed 500Mhz, other CPUs such as the K7, Jalapeno, and Intel's very
> long in the tooth P6 core will end the year in the 600-700 Mhz range.
I think you'll see Apple shipping 600 MHz before year end as well. Time
will tell.
> > I'm sure that next month Pentiums III's will be at 500 MHz and PPC750's
> > (or whatever the one with AltiVec will be called) will be at 450 or 500
> > MHz as well.
>
> AltiVecs available next month? I'll believe it when I see it. In any
> event 500Mhz P-III's are available right now. Legally and otherwise.
That's a metaphorical "next month". There *always* something better
coming out "next month" and the leaders leapfrom each other all the time.
> > Other than a brief time when the Pentium Pro came out and was at 200 MHz
> > while the PPC604 hadn't gotten there yet, I can't recall Pentium ever
> > being much ahead of PPC on clock rates and the PPC603e got to 240 (and I
> > think even 300) MHz *long* before any Pentium did.
>
> That is because, the current x86 architectures have been reaching the end
> of their life recently. Its getting very painful even for Intel to crank
> up its clock rate. The PPC750 on the other hand just came out last year,
> and should be cranking like a Mofo.
We will see.
-- Bruce
>> Indeed the PowerPCs may need additional ports, but for a different
>> reason: there are several instructions which have 3 register input
>> operands,
>
>Not that I'm aware of, except on the FP side (which is multi-cycle anyway).
No integer instructions, but FSEL and the FMADD/FMSUB group. While
these instuctions are multicycle, they are capable of one instruction
per clock cycle throughput at least on some member(s) of the Power/PPC
family, so you still need 3 register read ports.
Some Altivec instructions also have 3 register input operands.
>Most modern x86 processors (including the K6-2) are
>able to fold the FXCH instruction by "pointer swizzling" during the rename
>stage.
On the K6, FXCH executes in 2 FPU clocks, just like most other simple
FPU operations, so if it uses renaming to achieve this, it doesn't pay
off.
>This is correct if you are talking about dual-decode machines. It pretty
>much becomes a non-issue for triple-decode machines (especially if they
>have three identical, powerful decoders).
What about the P3, with its 3 decoders and 2 execution units? This
beast is is little more complicated here. When looking at the visible
architecture level, all looks easy, reg-reg SSE instructions look as
if they were single microop insns and it has 2 SSE execution units to
do the work.
When looking more closely, things change a lot. From timings and the
mechanics of the SHUFPS insn, I'm pretty sure that this beast is pure
64-bit internally, having 16 independant 64bit SSE registers. The
packed scalar insns decode into at least 2 microops each, requiring
either two execution units (MOVAPS) for one clock or one unit for two
clocks (the others). So even a simple looking MOVEAPS can only be
decoded by the single complex decoder. It is all but free.
A 3-address SSE instruction set whould have saved the move a no costs
in terms of data paths or decoding resources. Decoding and executing
the move in parallel however needs additional decoder and execution
resources, even on the P3. Maybe we'll see a "real" SSE implementation
someday, but today it does make a difference.
Gruss, Andreas
> x86's do not execute only one instruction per clock. (Incidentally, the
> above construct is typically not necessary for most real world
> algorithms.)
> > expressions used in most programs are of the form a=a+b there
> > are cases where the programmer legitimately wants to write a=b+c
> > and these cases WILL be slower on a two operand machine than on
> > a three operand machine.
> Well, my point is that it will not be slower on a modern x86 architecture
> for any reason that you are citing.
Considering that the hyperpiplingning (12+ stages) in the modern x86
processors is at least partially caused by all those register forwarding,
branch prediction, stack managment, data renaming tricks, one can definately
say that it does slow the processor down.
If it didn't, why are they at 12 stages?
[snip]
> Not being a CPU designer I cannot be sure, but I believe that to sustain
> x86-like architectural speed the registers must be ported one more time.
> This will negatively affect the clock rate at which you can scale your
> CPU. Ergo the PowerPC and its relatively slow Mhz curve.
Why ergo the PowerPC and it's relatively slow Mhz curve?
a) PowerPC has been ahead of x86 in terms of Mhz several times.
b) PowerPC currently has only 4 pipleline stages, not 12 like x86,
while still managing the same or almost the same Mhz
c) note that other RISC like R10K/12K & SPARC are behinf PowerPC
d) the Mhz rating doesn't matter all that much.
> --
> Paul Hsieh
> http://www.pobox.com/~qed/cpuwar.html
--
Sander
There is no love, no good, no happiness and no future -
all these are just illusions.
> The PowerPC was designed much more recently (93? 94?) when scalability
> with clock rate considerations should have been made very early.
What's the use of making a high Mhz CPU when the memory system cannot
feed the processor.
> > [...] Hardly any
> > difference at all there -- 12% - 15% on clock rate and a heck of a lot
> > less than that on speed (and the difference is in the other direction
> > according to most benchmarks).
> Later this year will see a huge difference in clock rate between x86s and
> PowerPCs (according to Microprocessor Forum.) While the PowerPC G4 will
> not exceed 500Mhz, other CPUs such as the K7, Jalapeno, and Intel's very
> long in the tooth P6 core will end the year in the 600-700 Mhz range.
>
Yeah, right. I believe it when I see it.
> > I'm sure that next month Pentiums III's will be at 500 MHz and PPC750's
> > (or whatever the one with AltiVec will be called) will be at 450 or 500
> > MHz as well.
> AltiVecs available next month? I'll believe it when I see it. In any
> event 500Mhz P-III's are available right now. Legally and otherwise.
>
Comparing P III with PPC750 that came out almost a year and a half ago?
Look at how long it took for P II to go from 233 Mhz to 400Mhz.
And notice that PPC604e was at 350Mhz in August 1997. When did Intel reach
350Mhz?
> > Other than a brief time when the Pentium Pro came out and was at 200 MHz
> > while the PPC604 hadn't gotten there yet, I can't recall Pentium ever
> > being much ahead of PPC on clock rates and the PPC603e got to 240 (and I
> > think even 300) MHz *long* before any Pentium did.
> That is because, the current x86 architectures have been reaching the end
> of their life recently. Its getting very painful even for Intel to crank
> up its clock rate. The PPC750 on the other hand just came out last year,
> and should be cranking like a Mofo.
PPC750 is not an architecture, it is one processor in family. And it didn't
come out last year.
[snip]
> However, as a side effect of their deep pipelining they require a
> massively huge number of rename registers (on top of their large number
> of architectural registers.) As can be seen in the 21264's micro-
> architecture, this requirement nearly brought the architecture to its
> knees -- the registers are literally statically partitioned so that two
> different halfs of the chip can deal with smaller register files at a
> time. There are penalties for cross register usage that compilers have
> to take care of -- if it can. Its not clear that Alpha is correct for
> tommorrow.
Instead of the number of rename registers, look at the ration of rename
registers to architectural registers.
[snip]
| In article <36e6670d...@news.lf.net>, a...@stuttgart.netsurf.de
| (Andreas Kaiser) wrote:
|
| > Indeed the PowerPCs may need additional ports, but for a different
| > reason: there are several instructions which have 3 register input
| > operands,
|
| Not that I'm aware of, except on the FP side (which is multi-cycle anyway).
|
| There *were* some 3 input register rotate/mask instructions in Power, but
| they were dropped for PowerPC. The PPC601 had them, but that chip was
| never intended to push clock rates.
The Store Indexed instructions have 3 register reads:
base register
index register
store data
In most designs this requires a single extra "store data" read port on the
register file.
--
-- Tim Olson
I'd say you got that backwards. Deeply pipelining a CPU allows you
to run at higher clock rates. The branch prediction, result bypassing,
out of order execution (and the associated register renaming) are
techniques used to minimize the impact of deeper pipelines on CPI.
So far the effect of increased clock rate is larger than the increase
in CPI so the net performance is higher. As far as marketing PCs to a
consumer market goes, a 500 MHz CPU that averages 1.5 clocks per instruction
will far more attractive than a 250 MHz CPU that averages 0.75 clocks
per instruction. All the x86 vendors have realized this, especially Cyrix
and Centaur.
>
[snip]
> Why ergo the PowerPC and it's relatively slow Mhz curve?
It is one of the biggest unanswered questions in the history of MPUs
in the 90's. IBM certainly has world class semi processes and are
no slouches at design. It is likely a combination of designing to the
least common denominator of IBM and Mot processes, poor leadership
at Somerset, and just plain poor execution. Any other suggestions ?
All opinions strictly my own.
--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.
> I'd say you got that backwards. Deeply pipelining a CPU allows you
> to run at higher clock rates. The branch prediction, result bypassing,
> out of order execution (and the associated register renaming) are
> techniques used to minimize the impact of deeper pipelines on CPI.
Now consider the case that you can run with no such tricks at the same
frequency that otherwise requires the 12 pipeline stages. You have increased
instructiuon latency 3 times for nothing.
> So far the effect of increased clock rate is larger than the increase
> in CPI so the net performance is higher. As far as marketing PCs to a
> consumer market goes, a 500 MHz CPU that averages 1.5 clocks per instruction
> will far more attractive than a 250 MHz CPU that averages 0.75 clocks
> per instruction. All the x86 vendors have realized this, especially Cyrix
> and Centaur.
Well, the Mhz figure sells a lot of CPUs, no matter how much or little
real work these attain.
> >
> [snip]
> > Why ergo the PowerPC and it's relatively slow Mhz curve?
> It is one of the biggest unanswered questions in the history of MPUs
PowerPC hasn't been all that bad. It was at 350Mhz in 1997. The question
rather is, why did the PPC750 Mhz numbers remain so low for so long time?
> in the 90's. IBM certainly has world class semi processes and are
> no slouches at design. It is likely a combination of designing to the
> least common denominator of IBM and Mot processes, poor leadership
> at Somerset, and just plain poor execution. Any other suggestions ?
The real unanswered question is why did it take just *SO LONG* until
PPC 750 reached 300Mhz & 400Mhz when another chip which was much more
complex in many ways ran at 350Mhz at the time PPC750 was introduced.
The PPC750 timeline:
August 1997: PPC750 @ 233 & 266 MHz
March 1998: PPC750 @ 300MHz
May 1998: AltiVec Presentation
July 1998: PPC750 @ 300, 333, 366MHz.
January 1999: PPC750 @ 400MHz
March 1999: PPC750 @ 450 & 466Mhz (just saw the announcement)
The last two lines represent copper based processors, on which IBM has been
slightly ahead of Motorola.
> All opinions strictly my own.
> --
> Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
> Kanata, Ontario proportions to put more PRECISION and POWER into
> dem...@mosaid.com architectures with MIPSed results but ALPHA's well
> pde...@igs.net that ends well.
--
Note that I said __K6-2__, not K6. You are correct about the K6. On the
K6-2 and K6-III, the FPU is improved to handle FXCH by pointer swizzling.
The FPU had register renaming all along, but back when it was designed
(1992), the speed of FXCHs wasn't an issue. BTW, this is essentially the
same FPU that already shipped with the Nx586-PF. Of course it has been
cleaned up a bit for Pentium compatibility before being glued to the
K6-family CPUs.
Of course, without a pipelined FPU, 0-cycle FXCHs don't buy you much. I
think the improvement was in the 0-5% range depending on the software.
> >This is correct if you are talking about dual-decode machines. It pretty
> >much becomes a non-issue for triple-decode machines (especially if they
> >have three identical, powerful decoders).
>
> What about the P3, with its 3 decoders and 2 execution units? This
> beast is is little more complicated here. When looking at the visible
> architecture level, all looks easy, reg-reg SSE instructions look as
> if they were single microop insns and it has 2 SSE execution units to
> do the work.
The SIMD SSE floating-point instructions actually emit 2 microops. Intel
was quite clear on that at the Intel Developer Forum. The scalar SSE FP
instructions are 1 microop each. There is a 2-way SIMD SSE adder hooked
up to issue port 1, and a 2-way SIMD SSE multiplier hooked up to issue
port 0. Latency of the SSE multiplier is 4 cycles, the SSE adder, 3 cycles.
> When looking more closely, things change a lot. From timings and the
> mechanics of the SHUFPS insn, I'm pretty sure that this beast is pure
> 64-bit internally, having 16 independant 64bit SSE registers. The
> packed scalar insns decode into at least 2 microops each, requiring
> either two execution units (MOVAPS) for one clock or one unit for two
> clocks (the others). So even a simple looking MOVEAPS can only be
> decoded by the single complex decoder. It is all but free.
Although Intel wasn't clear on this, it's obvious that SSE registers
are actually pairs of 64-bit registers. This is how MOVLPS and MOVHPS
can be 1 microop, 1-cycle latency instructions, as no merging with the
other half of the register is required, as would be the case if the
registers were truly monolithic 128-bit registers.
There is only one shuffle unit, hooked up to port 1. You will note that
the SHUFPS instructions doesn't shuffle between the two halves of the
register. So SHUFPS is implemented as two microops both flowing down
port 1 in consecutive cycles.
> A 3-address SSE instruction set whould have saved the move a no costs
> in terms of data paths or decoding resources. Decoding and executing
> the move in parallel however needs additional decoder and execution
> resources, even on the P3. Maybe we'll see a "real" SSE implementation
> someday, but today it does make a difference.
As has been pointed out before 3-address operations are principally
incompatible with the standard x86 opcode encodings. Of course, you
could invent some new encoding scheme in addition to the existing one,
but this would make decoding instructions even more difficult and thus
slower. Not a good idea.
In the PC world, MHz is what sells processors. Readers of comp.arch know
that performance has two components, MHz and architectural performance.
Joe Sixpack who buys his machine at CompUSA, Fry's or similar places doesn't
know that, and a certain company selling x86 processors has done it's best
to make him believe that MHz is the all important metric. This is why PR
rating schemes have failed, which were invented by companies whose CPUs
have higher architectural performance but lower MHz.
One way to reach higher MHz is to move to processes with decreasing feature
sizes (0.35u->0..25u->0.18u). The other is to more deeply pipeline the
processor, thus splitting the work over more stages so each stage can have
a shorter delay. This is the logic behind AMD's decision to basically double
the pipeline stages going from K6-family processors to K7.
This isn't quite correct. All pipelines are made up from two components
1. combinational logic
2. storage elements
As you crank up the frequency, you will find that the storage elements
use don't scale quite as well as the combinational logic, the result
being that with increasing MHz, the proportion of time allotted to the
storage elements increases somewhat ("flopping overhead"), leaving less
time for "real work".
Also, longer pipelines means higher branch misprediction cost. Branch
misprediction is one of the bigger performance destroyers for integer
code (number-crunching often has loops, where the branches are nicely
predictable with almost 100% accuracy).
And running your core faster does nothing for your memory subsystem
(external caches, DRAM), which means you now need increased load/store
buffers to cover the increased latency of loads and stores. But the
bigger the buffers, the harder it becomes to run them at full speed,
as bigger structures (like load/store queues, caches, ROMs) etc are
always somewhat slower than smaller ones.
The end effect is that when increasing the number of pipe stages, total
performance initially goes up, but eventually the curve flattens and
eventually falls, i.e. as you pipeline deeper overall performance
_decreases_. There is no universal limit as to where this crossover
occurs, as it depends on many factors. With current technology it's
probably beyond 20 stages or so. Reportedly, Intel's upcoming Willamette
chip will sport an insanely deep pipeline. No numbers have been thrown
around though.
> > The PowerPC was designed much more recently (93? 94?) when scalability
> > with clock rate considerations should have been made very early.
>
> I wonder how the PPC would go with Intel's resources behind it?
Who has more money to spend: IBM or AMD? By summer, AMD will have what
looks like the highest MHz x86 processor, and it seems that it will be
faster than any PowerPC processor at the time (in MHz, potentially also
in SPECint _and_ SPECfp).
The amount of available ressources definitely plays a role, but there
are many other factors. IBM has traditionally been in the brainiac
camp, not the speed demon camp, as evidenced by the Power2 and Power3
implementation. This design philosophy propbably explains to some degree
the low MHz numbers of PowerPC processors.
I will also point out that the PowerPC architecture is quite baroque
(or Rubenesque, if you will) as far as RISC architectures go. Alpha on
the other hand is one of the cleanest RISC architectures around.
>The SIMD SSE floating-point instructions actually emit 2 microops. Intel
>was quite clear on that at the Intel Developer Forum.
I wasn't there, so I had to use Intel docs (architecture manual,
optimization guide). Which consistently avoid telling this, up to the
point that all other instructions are classified into <n> microops or
microcode - except for the ISSE instructions which are not.
>There is only one shuffle unit, hooked up to port 1. You will note that
>the SHUFPS instructions doesn't shuffle between the two halves of the
>register.
AFAIK it's a little different: the upper part of the result comes from
one input operand and the lower part from the other, so the 1st
microop gets both halves of one operand and the 2nd both halves of the
other. However I can imagine better ways to document it in a readable
manner as Intels does in its pseudocode.
This instructions joins the otherwise independant parallel data paths
of 128-bit instructions, so it may add one cycle of effective latency
to the previous operation, a cycle which is normally hidden.
However I already did recognize this peculiarity. It was sort of a
final proof of my assumption, because IMHO noone would have designed
it this way without being forced to.
>As has been pointed out before 3-address operations are principally
>incompatible with the standard x86 opcode encodings.
I object! What if ISSE operations matching the generic pattern
xmmreg1 = xmmreg2 <op> xmmreg3/mem
were encoded as
00001111 00111ddd mmsssrrr [index/disp] yyyyyyyy
ddd = xmmreg1
sss = xmmreg2
mm/rrr = xmmreg3 or memory operand
yyyyyyyy = ISSE operation code
IMHO this is compatible to the x86 instruction encoding rules. The
destination register field is also found at the same location in BSWAP
and the trailing opcode is found in 3DNow! and CMPxS (swap ddd/sss if
you like). The ISSE operation code affects neither register types,
transfers directions nor instruction length, so it isn't needed until
the operations are issued to the reservation stations. Of course,
instructions which do not fit into this pattern, such SHUFPS and those
using both SSE and MMX or integer regs, would still have to be encoded
2-address.
There is just one drawback, besides being one byte longer: An
instruction with index byte and 32bit displacement (e.g. ESP relative)
may exceed the limit of 8 bytes for parallel decoding.
Gruss, Andreas
>Note that I said __K6-2__, not K6. You are correct about the K6. On the
>K6-2 and K6-III, the FPU is improved to handle FXCH by pointer swizzling.
All revisions of the K6-2, or just the enhanced revision (the one with
write combining)? I still get 2 clocks for FXCH on my plain old K6-2,
independant of context.
BTW: The Analysis book "predicted" a L2 TLB for what it calls the
K6+3D. IMHO it would have been time, especially for the direct mapped
ITLB (in my GCC traces, the ITLB miss penalty is about the same as the
I-cache miss penalty). Do you have any idea what happened to it, it is
not mentioned in the docs of the K6-III.
Gruss, Andreas
The cost and power budgets for Apple's product line tend to be tighter than
for say PA-RISC, Alpha, or MIPS. The PowerPC architecture is slightly more
complex than any of those as well. This is shown by the compiler guide
warning one off certain features like load/store multiple and load/store
update. (Which for load/store update is boneheaded. Either throw hardware
at it or jettison them.)
Probably most notable is that PowerPC falls inbetween the high dollar
per low volume chip workstation/server market and the high chip volume x86
compatible market. The implementations follow the economics and so they may
not be pushing the envelope on performance as other architectures are. Note
that IBM has until recently been doing POWER stuff for their highend market
rather than PowerPC so that doesn't really count. (And they've been more
floating-point code focused than any other vendor. This is changing due to
the AS/400 being PowerPC based now.)
-Z-
> However, as a side effect of their deep pipelining they require a
> massively huge number of rename registers (on top of their large number
> of architectural registers.)
I'd say this is more an effect of going 4-way superscalar and out-of-order.
If you have up to 80 instructions in flight - mostly to cover memory latency -
you better have enough (are 144 enough?) registers.
> As can be seen in the 21264's micro-
> architecture, this requirement nearly brought the architecture to its
> knees -- the registers are literally statically partitioned so that two
> different halfs of the chip can deal with smaller register files at a
> time. There are penalties for cross register usage that compilers have
> to take care of -- if it can.
The first is definitely incorrect - there are two copies of the whole register
file which are kept up-to-date automatically, with a penalty of one cycle to
transfer data between the two copies. AFAIK, there is no way a compiler could
specify to which block an instruction goes, except for those instructions, of
course, that are handled only in one of the blocks. Also, I remember reading
that the issue logic tries to issue dependent instructions to the same block
of functional units if it can. The point was that, while the clustered
registers cost a few percent of performance compared to the unclustered
design, the clustered one allos a higher clock target, more than making up for
the overhead.
> Even the Alpha is not perfect. (That's why the 21364 is nothing more
> than a 21264 that's been SMPed.)
Nonsense. Have you actually looked at the additions to the '264 core in the
'364?
Jan
> Ideally. I not-so-long ago had the, um, _learning opportunity_
> to use a machine that apparently went directly from Napkin to first
> silicon. This charming machine was two-address (apparently its designer
> had an un-requited love-affair with the 68K :-), and it really
> did have a _dual_ port register file: one read, one read/write.
> When scheduling for this machine, one needed to be aware of the
> exact number of cycles that a given instruction would take, and
> arrange for the "destination" (actually, source-dest) of that instruction
> to be used as a source of the correct later instruction, else
> one would incur a one-cycle "bubble" in the pipe. Lovely. Ah,
> well, as my old boss used to say "If the job was easy, we'd get
> someone cheaper than you to do it" :-)
Yick! That doesn't sound like a pleasure to write code orcompilers for.
> If different instructions can take different numbers of
> pipe-stages, you actually need either more write-ports or the
> interlocks to insert the bubble mentioned above. CPU design,
> done well, is an exacting discipline, not something for
> armchair architects, unless they are just in it for amusement,
> as seems often the case in this group.
>
> And I guess we don't _even_ want to get into bypassing
> and register-renaming, right? :-)
Right. I was assuming a "classic RISC" architecure where allinstructions
take the same amount of time to execute with none
of the more exotic schemes to make things faster.
While I think that I'm beyond the "armchair architect", I'm only
just beyond. I'm beyond enough to know that CPU design is
easy but correct CPU design is extremely difficult. It seems
like everything new I learn exposes 10 things I don't know.
CPU design is not for the faint of heart (or, as Christopher Robin
would say, "Not the the bear of very little brain.").
David Kessner
dav...@peakaudio.com
>On Sat, 06 Mar 1999 15:47:47 +1300, br...@hoult.actrix.gen.nz (Bruce
>Hoult) wrote:
>
>>> Indeed the PowerPCs may need additional ports, but for a different
>>> reason: there are several instructions which have 3 register input
>>> operands,
>>
>>Not that I'm aware of, except on the FP side (which is multi-cycle anyway).
>
>No integer instructions, but FSEL and the FMADD/FMSUB group. While
>these instuctions are multicycle, they are capable of one instruction
>per clock cycle throughput at least on some member(s) of the Power/PPC
>family, so you still need 3 register read ports.
But this is not as hard/interesting bcs there's only one FPU.
>Some Altivec instructions also have 3 register input operands.
This is, of course, more interesting and harder because there are two
Altivec units (though they do different things, one is ALU, one is
permute) in the current implementation, and one can sustain two Altivec
ops per cycle.
Maynard
Actually, correct isn't all that hard. Correct AND fast is
hard, just like with software, and for many of the same reasons.
(Wetware makes the same mistakes in different media.)
-andy
> David Kessner (dav...@peakaudio.com) wrote:
>
> : I can say that there is NO difference in the register file between a
> : 2 or 3 operate instruction.
>
> Ideally. I not-so-long ago had the, um, _learning opportunity_
> to use a machine that..... was two-address...., and it really
> did have a _dual_ port register file: one read, one read/write.
> When scheduling for this machine, one needed to be aware of the
> exact number of cycles that a given instruction would take, and
> arrange for the "destination" (actually, source-dest) of that instruction
> to be used as a source of the correct later instruction, else
> one would incur a one-cycle "bubble" in the pipe.
How about the original i960? The original implementation was 3 address,
but the reg. file was 1 Rd/ 1 Wt !
It did have result bypassing; the result of this was that operations that
needed two sources
had a bubble to get the second operand UNLESS it was a dependent operation
that could forward
one of the operands from the result bus.
Sounds like a cool way to reduce read ports. Especially nowadays with wide OOO
processors where most operands are forwarded (at least I assume so. Anybody
has actual numbers?). Now how about eliminating write ports for writes whose
results are never read back from the register file (but only via forwarding) ?
Stefan
As an interesting note. In the (admittedly small) sample I have taken MHz
is just about proportional to performance (at least to SPEC).
see http://www.cse.msu.edu/~brehob/IA32/interviewtalk/img034.gif
and http://www.cse.msu.edu/~brehob/IA32/interviewtalk/img035.gif
Mark
--
~~~~~~~~~~~~~~~~~~~~~~~bre...@cps.msu.edu~~~~~~~~~~~~~~~~~~~~~~~~~
| | The reports of SIMD's death have been greatly exaggerated | |
| -=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- |
~~~~~~Mark Brehob: Ultimate Player, Gamer, Computer Geek~~~~~~~~~~
Guess which processor group designed the P6?
Opps.
This should say:
In Intel's P5 and P6 family of processors MHz and spec rating appear to
be fairly proportional.
Hmmm. Need more sleep.
Mark
: As an interesting note. In the (admittedly small) sample I have taken MHz
If the architecture we not out-of-order, they would not need any more
than (#arch.registers) + (#execution.unit.latency.clocks) number of
registers which would probably be fairly well under control, even for
the Alpha.
Out-of-order needs a boat-load (I can't think of a useful mathematical
bound at the moment, however, I am sure some processor designer here can
whip one out of his/her head.) Basically the same architectural register
can be the start of a several dependency chains that could be in flight -
this is only possible with O-O-O.
> > As can be seen in the 21264's micro-architecture, this requirement
> > nearly brought the architecture to its knees -- the registers are
> > literally statically partitioned so that two different halfs of the
> > chip can deal with smaller register files at a time. There are
> > penalties for cross register usage that compilers have to take care
> > of -- if it can.
>
> The first is definitely incorrect - there are two copies of the whole
> register file which are kept up-to-date automatically, with a penalty
> of one cycle to transfer data between the two copies.
Aha. Sorry, I misremembered this. I knew there were two copies of the
register file. Hmmm ... then this must be motivated by the fact that
there are too many alus and not all of them could reach all the registers
at once, unless there were two copies.
> [...] AFAIK, there is no way a compiler could specify to which block
> an instruction goes, except for those instructions, of course, that
> are handled only in one of the blocks.
Well, something must guide the processor's choice about which register
file to use. Probably which ALU is being used(?) In which case the
compiler can track this in its back end code generator.
> [...] Also, I remember reading that the issue logic tries to issue
> dependent instructions to the same block of functional units if it can.
So there must be a heck of a number of duplicated units. It almost makes
you think that they could have designed the chip into two totaly
symmetrical halves toggled by the start of a dependency chain (a load or
non-renamed register copy after a branch (?) -- just a thought.)
> [...] The point was that, while the clustered registers cost a few
> percent of performance compared to the unclustered design, the
> clustered one allos a higher clock target, more than making up for the
> overhead.
>
> > Even the Alpha is not perfect. (That's why the 21364 is nothing more
> > than a 21264 that's been SMPed.)
>
> Nonsense. Have you actually looked at the additions to the '264 core
> in the '364?
No -- I just read the MicroProcessor report article that said that the
364 is basically a single chip containing two 264's. (Or at least I
didn't remember reading anything that indicated that there would be any
real architectural improvements to the core.) What are the actual
changes?
> No -- I just read the MicroProcessor report article that said that the
> 364 is basically a single chip containing two 264's. (Or at least I
> didn't remember reading anything that indicated that there would be any
> real architectural improvements to the core.) What are the actual
> changes?
This isn't what the Compaq presentation at the uP Forum said, can
someone clear this up? I thought 364 was "just" 264 + memory (RAMBUS)
controller + network interface + huge gobs of cache?
464 is 364+hardware multithreading AFAIK (yikes!).
Chris
--
Chris Morgan <mihalis at ix.netcom.com http://mihalis.net
"We're going to start selling Linux to single-party users very
soon. Q: It's going to be on the menu? A: Yes. You'll go to Dell,
pull down "operating system," and click "Linux." - Michael Dell
> If the architecture we not out-of-order, they would not need any more
> than (#arch.registers) + (#execution.unit.latency.clocks) number of
> registers which would probably be fairly well under control, even for
> the Alpha.
>
> Out-of-order needs a boat-load (I can't think of a useful mathematical
> bound at the moment, however, I am sure some processor designer here can
> whip one out of his/her head.) Basically the same architectural register
> can be the start of a several dependency chains that could be in flight -
> this is only possible with O-O-O.
>
> Aha. Sorry, I misremembered this. I knew there were two copies of the
> register file. Hmmm ... then this must be motivated by the fact that
> there are too many alus and not all of them could reach all the registers
> at once, unless there were two copies.
>
I believe that is the explaination typically given in papers in such.
The use of clustering is solely a response to the both increased
loading(the more ALU's the more loading/ports) and increased
wirelengths. Basically regardless of whether something like EV6 was
out of order or not, it would be a good idea if you cared about clock
speed to cluster.
> No -- I just read the MicroProcessor report article that said that the
> 364 is basically a single chip containing two 264's. (Or at least I
> didn't remember reading anything that indicated that there would be any
> real architectural improvements to the core.) What are the actual
> changes?
>
according to the presentation at http://www.digital.com/alphaoem/present/index.htm
Going from 8 to 16 l1 victim buffers and 16 miss buffers.
16 l2 misses.
There is only only one core.
There are a couple of ways to read this. One is that the core is
fully capable of handling everything and that it is not the bottleneck
or that the cost to change the core wasn't worth the benefit.
Personally( and I'm starting to sound like a broken record), if people
keep concentrating on the core as much as they still seem to be doing
and don't look at the overall system and design that, then we are
going to have computers that do great on benchmarks with simple foot
prints and are slower than molasses at anything even remotely
intensive or complex.
Until we get main memory down to < 50 cycles there will be a lot more
work to do on the system than on the core. Take a look at the
performance difference going from 4 to 8 issue on a realistic machine.
Aaron Spink
Most definitely not speaking for Compaq
> This isn't what the Compaq presentation at the uP Forum said, can
> someone clear this up? I thought 364 was "just" 264 + memory (RAMBUS)
> controller + network interface + huge gobs of cache?
>
That is basically what the presentation says. Its a bandwidth engine
with a "small" cpu stuck on.
Aaron Spink
not speaking for Compaq
Cool!
Err, I mean, yes I can see how you might feel others are wrongly
obsessing on the core only these days. Since we haven't been treated
to how the Elbrus wonderchip solves these problems, I suppose their
plan for leading edge bandwidth just has "and then a miracle happens
here".
I don't know where this "two core" idea for 21364 comes from but
that's twice it's come up here, I suppose one could say it's a cpu
core and a bandwith core glued together, and then someone else could
misinterpret, but it still seems odd people keep saying "two 264
cores".
Until we get CPUs down to < 4 MHz that will continue to be true, so
people might as well learn to deal with it.
--
Joe Keane, amateur mathematician
I do not know of any current implementations,
but this idea has been presented in many several papers.
Overall, I find it rather amusing/annoying the continual harping
in this newsgroup about register file ports. Various tricks such as
register writeback port combining, register read port combining,
and bypassing of the register file completely, mean that the number
of register file ports need not be large. If a chip design uses a large
number of register file ports, it is not for lack of microarchitectural
techniques to eliminate the number of register file ports, but only
because they feel that they would rather put up with the ports than
deal with the alternatives.
Me, I'd rather eliminate the register file ports.
> Now how about eliminating write ports for writes whose
> results are never read back from the register file (but only via forwarding) ?
I haven't been able to stop wondering about a related question lately
(he says, idly imagining himself in that waaaay impressive armchair).
It seems to me that most results should never be written back at all.
Assuming ooo with a reorder buffer, from what I understand the writeback
ports are a major bottleneck? If there are multiple results queued for
a register, both/all committed for writeback before any potential
interrupt processing, why not just toss the earlier one[s]?
Tight loops could easily wind up never writing back the iterators at
all, pace interrupts: the retire logic just accepts "superseded" as
permission to avoid writeback.
Would this work? Is it working now and I just haven't tripped over the
implementations in my limited experience? If it doesn't pay, I'd be
glad to learn why.
Or is this just another way of saying exactly the same thing Stefan did?
Jim
I'd fix the instruction set.
The machine is just following what the code says.
I think that a large part of the `register' stuff is really just a
contorted way to send values between operations.
The compiler has some connections between operations, and it mainly
wants send the values from operations in one cycle to operations in the
next cycle or two. So it has to find some unused registers and tell the
machine to put the values there, and get them back right away, except
it's really before it put them there.
Then the machine has to analyze the instruction stream and look at all
the register specifiers to figure out how to chain up the operations,
which is of course what the compiler wanted to do in the first place.
This is not optional, it needs to do this, because following the
standard pipeline it has to send the values back in time.
Now with a mess of comparators, the machine can figure out that the
value is used soon, and set up the proper chaining. So far it's just a
waste of transistors to get back to what you really wanted. But what it
can't see is that the values are never used later, although the compiler
knows this, so it still has to do the write-back anyway, unless the next
use of that register is so close that it sees it can kill it.
The instruction set is mixing up chaining operations with actually
storing values for some decent amount of time.
Also 32 GPRs is really pathetic.
Agreed.
I am, however, suspicious of proposals to make the chaining explicit,
such as target instruction specification or source instruction specification,
which I have posted long diatribes about the problems of, relating to fan-out,
fan-in, and crossing of control flow convergent and divergent points.
I have begun to suspect that the level of indirection afforded by registers
and memory - which requires the "mess of comparators" - is necessary for
any general solution. Sure, a specific instruction set might make discovery
of the dependencies easier for many codes, but it will always have limitations
that require it to fall back to memory.
Stack instructions are one variety that make it particularly easy to recognize
dependencies. But any serious work with stack instruction sets quickly shows
that you spend a whole lot of time moving data to and from top of stack.
Multiple stack processors, such as Bernd Paysan's 4-stack processor,
or the recently posted 3-stack processor, are one way to reduce the stack
manipulation, but there still remains a lot of stack manipulation and data
movement. Sure, we can do "move elimination" using renaming tricks.
Is that any better than "the mess of comparators"?
I suspect that any direct dependence instruction set will have to have some
notion of "atomic units" - so that the processor is forbidden to stop
half way along, with some data items still in flight along the paths between
the execution units. Or, if you have precise interrupts at any boundary, then
you have exposed these paths between execution units, these chains.
You may have avoided naming them in all instructions, but you will have
to name them somehow.
In the meantime, there are some simple things that can and should be done:
[0] 32 registers is not enough. IA64's mess of registers - is that 128 or 256,
I never remember which - may not be enough. You can always spill to memory,
but the problem with memory is that memory addresses need to be computed,
and hence may not be known at the time of instruction decode. Memory
renaming or registerification always needs to handle such dynamic
disambiguation.
Perhaps it would be a good idea to have an auxiliary register set, rather like Cray's
S and T registers, that is not accessed as an instruction operand directly, and thus
does not require opcode bits, but which is accessed via MOVE TO/FROM operations.
It would be fairly easy to rename such an auxiliary register set.
Or, equivalently, a region of memory that can be treated as registers. What you need
is a set of storage where address matches can be determined at decode time.
I started out to say "no computed addresses", but that is wrong - what you want is
a set of storage that may have computed addresses, but where the computed addresses
are always computed exactly the same way, using the same physical register numbers
and offsets. With provision for save/restore.
Do this, and you are halfway to renaming and registerifying memory anyways.
[1] Instruction sets should have some way of explicitly killing a register or
memory value, so that the procesor knows it need not write it back. E.g.
for memory the "forget cache line" instruction has been proposed to avoid
unnecessary writebacks of temporary operations. Similarly, there should
be a "forget register" or "kill register" operation - perhaps as a separate
instruction, perhaps as a bit on the last use.
The problem with such operations is "What should the state of memory or
the register be after the kill operation?" Theoretically nobody will ever look
at the value again --- but, of course, compilers and human assembly coders
have bugs. It would be somewhat awkward if the value was non-deterministic,
easily giving rise to Heisenbugs that disappear under the debugger, when
different timing is enabled.
I would suggest that KILL_REGISTER always zero the register. Zeroing is simple
enough that it can probably be implemented by a renaming trick in an out-of-order
processor. Zeroing might be more awkward for a simple in-order processor,
bt still would be doable.
Zeroing during "FORGET CACHE LINE" would be more problematic. Probably
not a good idea, since it amounts to a write. Therefore, non-determinism will
probably occur with any true "FORGET CACHE LINE" operation. I'll talk
more about this below.
Note that "FORGET CACHE LINE (ZEROING IT)" would not be the same as
"ZERO ALLOCATE CACHE LINE" - the ZALLOC instruction I proposed for
P6, similar operations for Power PC. ZALLOC allocates in a cache line,
and zeros the line in the cache. "FORGETTING" is intended to get rid of
a line if it happens to be in the cache, without doing a bus write. Y'know,
I might almost prefer to remap a forgotten page to a page of zeroes in a COW
manner.
---
Non-determinism: usually, uniprocessor programs are deterministic.
Theoretically, you could provide all possible inputs to a uniprocessor program,
and be guaranteed that you would always get the same answer back (except
for hardware bugs). This makes debugging easier.
Multiprocessor programs are frequently non-deterministic.
However, recently several processor features have been proposed that
could make a uniprocessor program non-deterministic. The "forget cache line"
feature I described above could cause such non-determinism: if you erroneously
forgot a memory location, and then read it and branched depending on what you
had read, which path you took might depend on exactly how far back the
cache line had been forgotten.
I've been guilty here too. The ZALLOC instruction I proposed for P6 could and should
be implemented deterministically, but I was willing to make an implementation
compromise that would have made it slightly non-deterministic. I was willing to
live with this, most assembly coders were willing to live with, and the compiler
was willing to live with it, but we were forestalled by an architect with much
experience debugging systems in the real world. Here, I would like to say
"You were right, Dan L."
Similarly, I have proposed operations that query microarchitectural features,
such as
IF data is in cache THEN use code scheduled for cache hits
ELSE use code scheduled for cache misses.
This would cause similar non-determinism.
It may not be possible to eliminate all Heisenbugs.
It probably is wise to not introduce any more than is necessary.
Or, am I being too conservative here? Are people willing to trust compilers
to always generate code that will mask all such non-determinisms in the
simplest of programs?
In what way is this different from the Merced notion of 'rotating
register sets', i.e. treating a block of registers as an indirectly
addressed resource (via the loop counter).
IMHO the main part of Merced regs (96? out of 128) which can rotate is a
half-way point between registers and memory.
[snip]
> Similarly, I have proposed operations that query microarchitectural features,
> such as
> IF data is in cache THEN use code scheduled for cache hits
> ELSE use code scheduled for cache misses.
> This would cause similar non-determinism.
>
> It may not be possible to eliminate all Heisenbugs.
> It probably is wise to not introduce any more than is necessary.
>
> Or, am I being too conservative here? Are people willing to trust compilers
> to always generate code that will mask all such non-determinisms in the
> simplest of programs?
This is a tough question: I would gladly accept non-determinism in
faulty programs, as long as correct programs would behave
deterministically.
However, I would expect a substantial payback for this. What kind of
speedups did you hope for when you made your suggestion during the P6
process?
In my real-life programming however, I almost always do some kind of
network operation or request, which of course means that my program is
effectively multi-processing.
It is almost impossible to get two identical runs of the program in such
an invironment.
Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
None at all. The non-determinism I was willing to accept for ZALLOC
was to eliminate the need for a 32 byte granule comparison - yet another
address match condition. Probably to save about 200 gates, and a number
of validation conditions.
If there had been willingness to spend the gates, no performance would have been
one.
What about in my cache-query example? Would you expect to be able to write
significantly faster code if you had it?
glew> [ ... ] I have begun to suspect that the level of indirection
glew> afforded by registers and memory - which requires the "mess of
glew> comparators" - is necessary for any general solution. Sure, a
glew> specific instruction set might make discovery of the dependencies
glew> easier for many codes, but it will always have limitations that
glew> require it to fall back to memory.
glew> Stack instructions are one variety that make it particularly easy
glew> to recognize dependencies. But any serious work with stack
glew> instruction sets quickly shows that you spend a whole lot of time
glew> moving data to and from top of stack.
Ahhh, but this seemingly reasonable qualitative analysis does not quite
address tradeoff: "a lot" is not quantified in terms of cost, and anyhow
is an estimate that cannot be accepted for the sake of argument.
What in effect matters not how much stack thrashing happens, but how
much of it happens in hot spots, as as for register file architectures
it does not matter how much spilling happens, but how much of it happens
in hot spots.
With a sufficient number of stacks or a sufficient number of registers
neither stack thrashing nor spilling need happen, and some people,
especially those fond of register files, have wished into existence lots
of such resources (I remember a paper that found that no program
examined by the authors made full use of the 56-wide Pyramid register
file at the time the paper appeared).
However, the question is really one of appreciating the diminishing
returns: too many stacks or too many registers in the architecture
have a visible cost in the implementation.
Now, what matters then is code characteristics, and from that one can
derive how many stacks one needs, and deep they need to be. A register
file architecture is the special case where all stacks have depth one,
by the way.
Then the question arises of how to organise the CPU scratchpad: at one
extreme is the possibility of one stack N deep, at the other N stacks
one deep.
The tradeoff here is that each stack top (couple) is an accumulator, and
thus requires ports to all functional units (assuming that stacks are
not specialised), while each other stack element does not have ports to
the functional units, and thus if it must participate in an operation
must be brought to the top, and thus require wasteful stack rearranging.
Having N stacks of depth 1 means that stack rearranging need never
occur, a the price of multiporting N registers; having 1 stack of depth
N means that almost all need not be multiported, at the price of
rearranging the stack whenever an operand further down the stack needs
to be brought to the top for an operation.
The N stacks of depth 1 is optimal when most code stretches involve N
thoroughly mixed computations, the 1 stack of depth N is optimal when
most code stretches involve one computation made of N perfectly nested
subcomputations.
It seems however clear that most general purpose codes do not correspond
to _either_ extreme:
glew> Multiple stack processors, such as Bernd Paysan's 4-stack
glew> processor, or the recently posted 3-stack processor, are one way
glew> to reduce the stack manipulation, but there still remains a lot of
glew> stack manipulation and data movement. [ ... ]
Without repeating once again a number of discussions on this subject
that have appeared in 'comp.arch' over the years, there is some evidence
that for ``general purpose'' programs (excluding thus vector/matrix
oriented ones) one only needs about four temporary ``registers'' per
indepent ``evaluation thread'', and that so far it seems that one can
exploit up to four such independent ``evaluation threads'' (i.e. that
the sustainable degree of superscalarity is four, in general purpose
codes).
This suggests that for architectures designed for superscalar execution
of general purpose code, around 16 ``registers'' are sufficient to
minimize spilling/thrashing in the code most frequently executed code
bits, organised as 4 stacks with 4 CPU-based scratchpad ``registers''.
Very little stack rearranging can be expected here, because of that
evidence (some of it, rather compelling, is present in various studies
on the efficacity of graph coloring) that most hot spots in general
purpose code involve at most four distinct subcomputaiton each of which
involves four nested operands. All a compiler's got to do is to identify
which subcomputations nest and which not, and assign them appropriately
to the separate stacks.
Also, one rather interesting advantage of stack architectures is that
the tradeoff is not really between N scratchpads registers organised as
M < N stacks, or N scratchpad registers organised as N stacks (register
files), other things being equal, because other things need _not_ be
equal.
Having M < N saves (potentially significant) implementation resources,
which can be expended to either add more scratchpad registers
(e.g. deepening the stacks, which in general will leave the architecture
unchanged, simply delaying stack-to-memory spills), or to optimise other
bits of the CPU implementation.
Thus a better match of the architecture to the characteristics of host
spots of typical general purpose code (a small number of independent
computations each composed of a small number of nested subcomputations)
one not only gets, as you mention above, better dependency management,
but also implementation savings that can result in further improvements;
and stack thrashing is, almost by definition, minimised, or at least is
not any worse than spilling in an equivalently-resourced N-by-1
architecture.
Note the various vital qualifications above, in particular "general
purpose code": codes that have specific operand reference patterns, and
in particular vector/matrix oriented codes, really require specific
architectural support.
I have voiced many times in the past years in 'comp.arch' my opinion
that perverting the implementation of "general purpose"
(scalar-oriented) architectures so that they sort-of stream (on
vector-oriented) is an abomination, and it seems that this has become a
more widespread opinion, see the current popularity of SIMD-like
architectural extensions, and the provision of streaming palliatives
like fetch-ahead instructions. A bit too little, too late, but welcome
nonetheless.
glew> In the meantime, there are some simple things that can and should
glew> be done:
glew> [0] 32 registers is not enough. IA64's mess of registers - is
glew> that 128 or 256, I never remember which - may not be enough.
That's actually rather incorrect either way: for general purpose codes
it is too many, with too much multiporting, and for vector oriented
codes it may be too little, with too much multiporting too. EPIC is an
attempt to do vector-like without going all the way towards proper
streaming oriented architectures; the cost it implies is excessive
generality in the implementation. OK, architecture is dead, as
implementation silicon budgets are almost infinite, but this is no good
reason to waste them.
glew> You can always spill to memory, but the problem with memory is
glew> that memory addresses need to be computed, and hence may not be
glew> known at the time of instruction decode. Memory renaming or
glew> registerification always needs to handle such dynamic
glew> disambiguation.
But spilling off the top of a stack of CPU scratchpad registers can
rather easily be made known at instruction decode time...
glew> Perhaps it would be a good idea to have an auxiliary register set,
glew> rather like Cray's S and T registers, that is not accessed as an
glew> instruction operand directly, and thus does not require opcode
glew> bits, but which is accessed via MOVE TO/FROM operations. [ ... ]
Well, a similar technique is used apparently in the new Sony Emotion
Engine vector units, which are reported in Sony's rather terse official
datasheet as having a few KB of scratchpad memory each.
Yet indeed: lots of scratchpads are really needed for ``buffering''
streaming operands, i.e. simulating vector registers; then it i just
simply cleaner to have vector registers directly.
After all the history of killer-micro hw and sw over the past 20 years
is that they have progressively imitated more and more closely the
characteristics of ``big iron'' systems: currently a Linux x86 box is
not that different from a Tenex/Multics box of 25 years ago (rather
faster though, so that we can play Quake2 :->), and it is just to be
expected that in revisting the history of ``big iron'' systems killer
micros will acquire vector oriented architecture extensions, for the
same reasons (and much fro the same applications).
> A register file architecture is the special case where all
> stacks have depth one
A *damn* good point, and one I haven't seen pointed out before!
-- Bruce
Why would you need this? I would (perhaps naively?) expect this
operation to simply truncate the address to the nearest cache line
boundary. Did you intend that the address would have had to be valid as
the beginning of a cache line?
> address match condition. Probably to save about 200 gates, and a number
> of validation conditions.
>
> If there had been willingness to spend the gates, no performance would have been
> one.
>
> What about in my cache-query example? Would you expect to be able to write
> significantly faster code if you had it?
Not really, if I instead had a way to promise the hardware that I would
write all the bytes in a cache line, i.e. no need for a
read-for-ownership bus transaction.
It sure sounds like this is exactly what the MOVNTQ (MOVe Non Temporal
Quad word) opcode is supposed to do: If the target is already in the
cache, update the relevant line, and mark it as dirty, if not, buffer
the data (in case more follows), and remember which bytes have been
written to.
> >But what it can't see is that the values are never used later, although
> >the compiler knows this [...]
Many architetures nowadays have a "read 0, write black hole" register that
should serve the purpose. Or are you talking values that might be used, but
- e.g., at the end of a loop - end up not being used in some cases? I wouldn't
think these cases to be suitable for optimization at the architectural level.
I have the impression, for instance from the discussion of the 21264's
features, that the number of register ports is usually a worst-case - i.e.,
we run into a piece of code that can actually use all of the functional units
simultaneously - basis, using a sort of Kirchhoff's principle: so and so many
instructions can be dispatched for execution per cycle, this implies so and
so many read and (because from a previous cycle, the same number of
instructions are still inflight) a certain number of write ports. The real
difficulties lie in the cases that aren't so regular - for instance, the
implicit or explicit stall cycles that are required when a long-latency
functional unit (integer divide, anyone?) decides it wants to write back its
results.
> Or, equivalently, a region of memory that can be treated as registers.
This is the T9000's workspace cache, implemented in that machine as a sliding
window.
> Non-determinism: usually, uniprocessor programs are deterministic.
> Multiprocessor programs are frequently non-deterministic.
In the sense of execution order etc, yes. Concerning final results (apart
from roudnign/numerical stability)? Haven't seen that one yet (apart from
the case where you're using a distributed PRNG).
> Similarly, I have proposed operations that query microarchitectural features,
> such as
> IF data is in cache THEN use code scheduled for cache hits
> ELSE use code scheduled for cache misses.
> This would cause similar non-determinism.
> Or, am I being too conservative here? Are people willing to trust compilers
> to always generate code that will mask all such non-determinisms in the
> simplest of programs?
A widely used package, in its authors' opinion (and our experience) probably
the best in its field, already exposes such indeterminism on the algorithmic
level: FFTW. It times different variants of the FFT algorithm on the fly (if
so desired) and determines the currently optimal combination for the problem
at hand. We have seen slight variations run-to-run resulting from that. A
similar things happens in other cases, where depending on actual input values
different subparts of the code are executed or not.
Thus, I'd say you _are_ too conservative. What with imprecise exceptions,
unpredictable memory hierachies, and so on, we're gonna have to live with
non-determinism or various sorts.
Jan
SPARC register windows are one such example that are hardly recent. You
can't really know what value the memory "backing" the window holds since it
depends on when the last spill happened and those are interrupt driven.
(And yeah, some of us consider this enough of a hassle to avoid the
architecture whenever possible.)
-Z-
> SPARC register windows are one such example that are hardly recent. You
> can't really know what value the memory "backing" the window holds since it
> depends on when the last spill happened and those are interrupt driven.
Why would you ever want to look at them? They aren't part of the
architecturally visible state.
Jan
I understand how this can occur.
However, it seems to me that, from the point of view of a programmer
writing a uniprocessor program, that the non-determinism creep in only
when the register file save/restore code has bugs.
Or, it could creep in if you do something like
foo() {
int uninitialized_local;
...
use uninitialized_local;
}
But that bites regular programs anyway. Although, on a non-threaded,
non-signal driven program, such funkiness is still deterministic.
Q: how does SPARC introduce non-determinism
if the register stack spill code is correct?
----
Related, does anyone else remember how nice C was compared to FORTRAN,
since C required that all of the data segment be cleared on startup, whereas
in FORTRAN you could often inherit arbitrary garbage from other programs?
Apart from the security concerns, I'll contend that this default initialization
made C programs that much more consistent in behaviour. Buggy and incorrect,
perhaps, if the relied on initialization, but consistent.
Methinks sometimes we want a way to make sure that all stack locals are zero initialized.
(Course, it would be even better if hardware supported undefined values for all data
types. Sigh.)
The example is new, but the general principle is old. Consider the "store into
the following instruction" trick on a processor that does prefetching. Many
architectures require standing the hardware on it's head to catch this, but
as far back as 1960 the Manchester/Ferranti/ICL Atlas would execute the old
instruction, unless an interrupt took immediately after the store, in which
case you'd refetch and execute the new instruction when you got back. So
you didn't code that way any more.
--
--------
Sarr Blumson sa...@umich.edu
voice: +1 734 764 0253 home: +1 734 665 9591
ITD, University of Michigan http://www-personal.umich.edu/~sarr/
: > SPARC register windows are one such example that are hardly recent. You
: > can't really know what value the memory "backing" the window holds since it
: > depends on when the last spill happened and those are interrupt driven.
: Why would you ever want to look at them? They aren't part of the
: architecturally visible state.
Obviously we do not understand each other.
In the standard SPARC calling convention, there is a register in each
window which says where in memory that windows gets spilled. The memory is
definitely architecturally visible and you can easily have a pointer to it.
When a spill happens, the values at those addresses can change. Spills are
basically asynchronous with respect to program execution due to spills
caused by process context switches, etc. If you have a bug in your program
that accesses a pointer to a stale stack frame value, you can get
non-deterministic behavior. The painful part of this is that single
stepping in the debugger generally ensures that the entire set of your
process' register windows are spilled so the bug can go away in the
debugger.
This is exactly the sort of thing Andy was talking about. You'll note that
program behavior could also change depending on OS implementation and the
number of register windows the processor implements. (I.e. a buggy but
tested binary could break on a new processor implementation via this
mechanism.) How seriously one takes this is a matter of architectural
philosophy. (Not all register window implementations have this problem,
though the ones that don't tend to context switch the register windows into
dedicated storage just like a flat register file system would. That is an
efficiency hit I imagine.)
-Z-
And just why do you have a pointer to (memory used to back a) register
variable?
Unless you are a debugger, of course.
Cheers,
Jeremy
> : Why would you ever want to look at them? They aren't part of the
> : architecturally visible state.
>
> Obviously we do not understand each other.
[decription of window-spill form of Heisenbug deleted]
I'm quite familiar with this, but I'd argue that this form of indeterminacy
is caused by an incorrect program. I understood Andy Glew as discussing
indeterminate behaviour in correct programs.
I would argue that with regard to the problem you mention, the processor and
OS should in fact try to be as indeterminate as possible. That would aid in
finding such bugs - assuming the programmer cares at all, which I wouldn't
say is a done thing - as early as possible. For instance, the OS idle thread
that is spending its time zero-filling released pages could be using a PRNG
to fill them instead.
Jan
Not necessarily.
I'm concerned that things like
if( caches_are_frequently_missing ) {
use algo1
} else {
use algo2
}
would give rise to Heisenbugs if
a) there was a compiler bug in algo1
b) some path through algo1 was never exercised, because the
caches infrequently miss during testing.
But where the bug is uncovered during real production runs,
because the workload has changed, causing more context
switches and, hence, cache misses.
I'm concerned that, in the face of such behaviour, even
exhaustively testing all possible inputs is insufficient.
Exhaustive testing would require testing all inputs, times
all possible paths through such sources of non-determinancy
=> an exponential increase in the testing space of the program.
I think that some sources of non-determinacy are relatively
constrained - for example, the SPARC stack/pointer bug.
The code that would have to be tested to ensure correctness
is relatively small - but it is still annoying.
I am afraid that other sources of non-determinacy are less constrained,
and cause a significantly larger explosion of the state space. Which I assume
overall corresponds to a larger likelihood of bugs.
> Methinks sometimes we want a way to make sure that all stack locals are zero initialized.
>
> (Course, it would be even better if hardware supported undefined values for all data
> types. Sigh.)
I once wrote an extension to the Acorn Archimedes/RiscPC's operation
system RISC OS, which allowed threads to execute with an own
user mode integer register set und floating point registers
(incl. FPSR) - nothing more than a pre-emptive dispatcher (RISC OS's
built-in multitasking environment is co-operative only).
The float registers were loaded with a NaN (derived by an untrapped
division of 0.0 by 0.0) and the integers except stackpointer and PC
were loaded with $DEADBEEF.
Michael
Here is what Andy wrote (emphasis mine):
[
However, recently several processor features have been proposed that could
make a uniprocessor program non-deterministic. The "forget cache line"
feature I described above could cause such non-determinism: if you
*erroneously* forgot a memory location, and then read it and branched
depending on what you had read, which path you took might depend on exactly
how far back the the cache line had been forgotten.
]
I read this as being about hardware features that work fine for correct
programs but behave non-deterministically for incorrect programs. More
importantly, the non-determinism will not be obvious. It will result in a
program that fails, but fails rarely in an unrepeatable fashion. And
perhaps even a program which worked fine on one CPU/OS combination but now
fails on a new one.
I believe this is analogous to the SPARC situation. Nothing to bother
thinking about so long as you always write (and use) correct code.
-Z-
Because there is a bug in your program.
-Z-
> I'm concerned that things like
>
> if( caches_are_frequently_missing ) {
> use algo1
> } else {
> use algo2
> }
>
> would give rise to Heisenbugs if
> a) there was a compiler bug in algo1
> b) some path through algo1 was never exercised, because the
> caches infrequently miss during testing.
This I would call a test coverage problem - tcov or similar to the aid.
It "just" introduces yet another problem in how to actually prod your
program into the place where it will exercise a certain code path.
Jan
As you point out, due to security concerns, on modern multi-user OSes the
Fortran programmer often gets a cleared data segment even though the language
doesn't require it. As a result, a Fortran implementation which statically
allocates local variables will zero-initialize them too.
I think you're right: at this late date, default initialization of locals would
contribute a lot to productivity and reliability in C++ (IMHO more than most
of the ISO committee's inventions).
--
Steven Correll == 1931 Palm Ave, San Mateo, CA 94403 == s...@netcom.com
It is a test coverage problem.
The problem is that the test space size grows exponentially
in the number of such "non-determinism sources" in your program.
The first rule of test coverage is to try to reduce the test space size.