split register sets

mac

unread,

Dec 29, 2010, 6:16:11 PM12/29/10

to

During the discussion of floating-point array indexes, there was a
mention that partitioning registers into integer and floating-point sets
makes for simpler register access. I remember that the 68K had registers
partitioned into address and data (the 68Ks I used didn't have floating
point), which was generally seen as a bad thing.

As I recall The intel iXP 2800 network processor cores had two large
register sets (128 registers?). An instruction could only read one value
from each set. The assembler did a good job of assigning registers to
files.

Is there some advantage to this? Is it easier to have two register sets
only one read and one write port each, than to have one file with two
read and one write ports? Is this approach used anywhere else?

robert...@yahoo.com

unread,

Dec 29, 2010, 7:59:52 PM12/29/10

to

In most cases, modern processors have considerably more than two ports
on the register file.

If the memory cells are true multiported, their area scales roughly
with the square of the number of ports. Cells with more ports are
slower too, and tend to use rather more power. So splitting the
physical register file (say into GPRs and FPRs) can allow you to
implement the same number of physical registers with fewer ports, and
thus use less area, be faster, and use less power, *if* you can keep
the workloads separate. Alternatively you can increase the number of
physical registers (allowing more renames) for the same budget.

The tradeoff is complex, and examples exist of processors with splits
in the register file that don't correspond to the architecture (for
example, some Alphas with duplicated register files), and ones where
architecturally divided register files (say separate GPRs and FPRs)
are implemented in a single unified physical register file.

jacko

unread,

Dec 29, 2010, 10:45:30 PM12/29/10

to

He's right you know.

But I think a very small register file with combined read and write
port (1 only) and massive duplication of this simple core along with
very small caches local to each core, and register state exchange
between cores to move the process to the memory address, as each core
say has size/cores of the memory. All integrated into the DDR chips.

Call it PLM (process localized memory).. Prior art blah, blah...

MitchAlsup

unread,

Dec 30, 2010, 4:25:42 PM12/30/10

to

On Dec 29, 5:16 pm, mac <al...@theworld.com> wrote:

> Is there some advantage to this? Is it easier to have two register sets
> only one read and one write port each, than to have one file with two
> read and one write ports? Is this approach used anywhere else?

There are a number of isses involved, here, with conflicting forces
involved.

A) the porting of the register file(s)
B) the number of bits in the instruction dedicated to denoting
registers
C) the number of result busses
D) compiler issues

Register Files:

The height of the register file (except in rare cases) is set by the
height of the data path and rarely by topological arguments of the
register file. The AMD 29000 is an exception to this rule and paid a
significant area penalty routing the register file bits down to the
height of the data path.

Register files with 3 or more word (select) lines can be made to be
wire bound in the select-line direction. Register file decoders are
simple enough to be manipulated twice per cycle (SPARC, AMD 29000, and
ITanic excepted).

Register files with 6 or fewer ports can be made to fit an 18-wire
data path height and run true-complement bit lines and be wire bound
in both directions. Such files can be operated twice per cycle when
the cycle time is 18-gates per cycle or longer (maybe a tad faster
depending on the performance of the sense amplifiers and some
cleverness of the circuit designer and layout engineers. {I have
personally layed out and circuit designed these critters.}

Register "files" have a property that the important parameter is the
number of 'writes' per cycle. one can replicate the register file
building block to double or tripple the number of read ports depending
on architectural requirements. Thus, for machines of up to 18 reads
per cycle and up to 6 writes per cycle, one has straightforward
implementation techniques available. Consideration of where the
register data is to be bussed is paramount in this expansion of
register porting.

The height of the register file is no more a problem than the height
of the data path. In addition, the register file height may be
staggered across height (clocking) boundaries as in Pentium 4. Thus,
supporting registers of 64-bits or 128-bits is simple straightforward
circuit design and layout, with a minimum of interaction with
clocking.

Instruction set design issues:

The number of bits used to denote a register in an instruction is of
great concern to the instruction set designer. There are a number of
philosophies that all sort of work fairly well; everything from stack
based designs where registers are implicit (i.e. take no bits) to the
general RISC designs where all registers (and in fact all state
changes) are explicitly denoted in the instruction. In the middle
there are instruction sets taht imply some registers and give the
compiler/assembly language mechanisms to override the lack of bits in
an instruction.

If the designers chooses a small number of registers and chooses a
destructive (2-operand) instruction style, the typical instruction can
be compressed into 16-bits (PDP-11,...) If the designers choose either
3-operand form or more than 8-registers, the instruction expands to 32-
bits (with rare exceptions). This expansion may or may not cause
performance problems depending on the instruction buffering techniques
and instruction caching techniques employed. Once one has a 32-bit
instruction, the number of registers in a file can grow as great as
256 (at a 2-3 gate cycle time penalty).

If the designers want symetry in construction of constants and
immediates, the number of registers in a file are typically capped at
32-entries per file. 32-entries is enough for almost any calculation
(such as Livermore Loops) if the registers can each contain a DP FP
valvue, or an adress pointer, or an unrestricted inter value. But not
universally so. In any event, complier writers ALLWAYS want more
registers. For an architecture in design this very day; it would be
sad to supply it with fewer than 32- registers of anything less than
64-bits with 128-bit prefered.

Operand and Result Busses

Each register file port has at least as many wires as the size of the
operand, and often twice as many to support true-complement read/write
ports of adequate speed. In addition a very short distance away are
the actual operand busses, where the typical instruction has two
operand busses and one result bus. In between there is logic that
connects the dots that architects call the forwarding circuit/logic.
This piece may have 18-wires per bit-height on both sides of the logic
and a rats nest of multiplexers connecting the dots. As long as the
layout designer can connect the dots, there is little pressure to do
anything other than let him do his job.

When one has register porting problems, or forwarding rats nest
routing problems, the architects may be forced to (or simply desire
to) utilize multiple different architectural register "files"; and
deal with the instruction decode problems that come from this design
technique.

The CDC 6600 (still worth understanding for ANY budding architect)
used 3 reegister file of 8-entries each.
The A-file has 2 read ports and 1 write port. The read ports can be
sent to the increment unit(s) or to central memory when bundled with a
B register on a eXchange jump. The write port can come from the
increment unit or from central memory with a B-register on an eXchange
jump.
The B-file has 4 read ports and 2 write ports. The read ports can go
to the central memory on an eXchange jump, increment units, or the
shift unit. The write ports come from central memory on an eXchange
jump, the increment unit or the shift unit.
The X-register file has 6-read ports and 4 write ports. These ports go
almost everywhere.

The CDC 6600 is a well designed partitioned register file machine. One
file for addresses under active processing, one file for bookeeping,
and one file for crunching data.

The DEC VAX could have been a well designed single file architecture
(if one left ouot all the decimal, formatting, call/return, and other
baggage instructions). Unfortunately VAX proved that 16-registers was
simply too few, even when backed up with the illusion that inbound
accesses from memory were inexpensive (addressing modes).

The ubiquitous IBM 360-370-308x-309x-Z contains 16-integer registers
and 4-FP registers. IBM had to go to great lengths in order to make
these regsiters work as well as the 24-registers fo the CDC machines.
Along the way IBM invented the whole modern notion of Out-of-Order
processing with Tomasulo architected Reservation Stations--for which
we are eternally greatful.

Summerizing. 24-well partioned registers are "almost" enough, 16-
register are definately too few, and with herioc engineering, none of
this maters because the OoO processors can pretty much extract
whatever parallelism exists and synthesize the number of register the
algorithms need whether slightly beyoond or significanlty beyond what
the architecture supplies.

Compiler issues

I much confess I have always been a fan of integrated registers--that
is a single file of register, of sufficient quantity, and of
sufficient size*. Given that there are enough of them, the life of the
compiler is vastly simplified. An algorithm that needs 80% of the
registers for addressing data and bookeeping is happy, and another
algorithm that needs 75% FP remains happy. Whats is more, what we used
to call Argument Punning (now called VARARGS) where the caller sends
an integer or a floating point data item and the called procedure uses
it as a floating point or integer type is greatly simplified. Whether
this is legal in the language or not, the ease of getting this part of
the compiler goes from hard to simple.

The key issue for the compiler is simply "are there enough registers"
to enable whole procedure register coloring (or not)--this avoids
spilling and filling of register, and makes for tighter higher
performing codes. My mistake in the Mc88xxx architecture was not
making the original 88100 have 64-bit registers. This lead to a host
of minot issues and anoyances.

Bottom Line

Computer Architecture is ART not Science, and requires the balancing
of forces (sometimes unseen).
Computer microArchitecture is Science, exploiting the heroic efforts
of previous designers to see farther into the abyss.

Mitch

Paul A. Clayton

unread,

Dec 30, 2010, 5:43:48 PM12/30/10

to

On Dec 30, 4:25 pm, MitchAlsup <MitchAl...@aol.com> wrote:
[snip VERY nice post]
What do you think of registerification of a limited
subset of memory operations (particularly in load+op
instructions) to allow a single Architecture to have
a broader range of implementation (certain stack and
global/TLS [like a Knapsack cache] offsets would be
obvious targets)? A compiler could still do
significant 'register' allocation optimizations, it
seems.

You also simplified the instruction encoding issue
for a fixed power of two number of bits per
instruction. (This seems odd given your defense of
variable length encoding.)

I am a fan of clustering and register specialization
because such simplifies some microarchitectural
optimizations. I am doubtful, though, that common
compilers would be able to exploit such features
well. (Also note that specialization does not
necessarily mean exclusive dedication. E.g.,
a count register could still be mapped and used as
a GPR. An address/data register distinction might
be limited to short instructions [or there might be
overlap, e.g., 16 dedicated of each type and 16
that are shared/aliased].) I thought the private
and shared register sets of Sun's MAJC were
interesting. (The common FPR/GPR clustering might
be transformed [and is somewhat with SSE providing
integer and FP operations] into something more
like highly parallel/latency tolerant cluster and
less latency-tolerant cluster--or so it seems to
this mere technophile.)

While a microarchitecture could detect (and exploit)
certain register usage patterns, it seems that there
would likely be some benefit to software communication
of some usage information.

> Bottom Line
>
> Computer Architecture is ART not Science, and requires the balancing
> of forces (sometimes unseen).
> Computer microArchitecture is Science, exploiting the heroic efforts
> of previous designers to see farther into the abyss.

Yet the Architecture interacts with microarchitecture
as well as the software layer (and, of course,
microarchitecture seeks to exploit the common cases
and criticality of the expected software workload).

Paul A. Clayton
just a technophile--and appreciative of such teaching

MitchAlsup

unread,

Dec 30, 2010, 9:44:11 PM12/30/10

to

On Dec 30, 4:43 pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> On Dec 30, 4:25 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> [snip VERY nice post]
> What do you think of registerification of a limited
> subset of memory operations (particularly in load+op
> instructions)

In general I am a fan of 360-like instruction sets where inbound
memory references are made to appear inexpensive. This naturally
partitions the pipeline into several parts::

a) Fetch-Decode-AddressGen
b) Cache-LoadAlign
c) integer execute
d) FP execute
e) StoreAlign-writeback

And each partition can have its own operation queue using a simple
register interlock. This gives partially ordered instructon execution
performance without the cost of reservation stations (or scroeboards)
at the lower end of the performance spectum and in no way hinders the
higher performing end of the performance spectrum.

Many of the minicomputers modeled after the 360 were similar. The only
thing "bad" about the x86 model in this regard is having to decode the
instruction to decide which bits are used to access the address
generation register file (modR/M and SIB) but having isolated the
problem to two successive instruction butes makes the problem
managable. The 360 could always read the address gen register file
from two ports of 4-bits each with fixed bit positions in the
instruction (saving a pipe stage).

Mitch

Andy "Krazy" Glew

unread,

Dec 31, 2010, 5:13:18 PM12/31/10

to

I've been meaning to write an article for the http://comp-arch.net wiki
on RF structure.

We could go back a long way: "In the beginning there was an
accumulator. And a memory address register." But anything that
pretends to go back to "the beginning" of computer architecture is
likely to be ... scooped? prior-arted? E.g. I don't know off the top
of my head what Zuse's microarchitecture was, not Babbage's.

What I can say accurately is that at one time it was common to have a
single uniform register file. E.g. the PDP-11.

The Motorola 68000 was an exception: it had separate address and data
registers. I conjecture that this was motivated not by speed of
register file access, but by instruction encoding density: having
separate address and data register files effectively saved a bit in the
register encoding.

Shortly thereafter floating point started becoming common.

With the early generations of floating point coprocessors for
microprocessors, such as the Intel 8087 for the 8086 family, and the
I-forget-it's-name for the Motorola 680x0 family, the floating point
register file was of necessity separate, since the coprocessor was a
separate chip. While you *could* have a single register file, it would
add complexity and/or slow you down with a separate coprocessor.

Around the same time DSPs started becoming common, really a specialized
microprocessor. They often had XY memory, and XY register files - i.e.
separate register files for the "X" registers and the "Y" registers.
Which were otherwise identical. Sometimes there was a separate "A"
address register file, sometimes not.

I conjecture (with fairly high confidence) that the main reason for this
was, or is, once again, instruction encoding. The usage patterns often
allowed coefficients to be used in only one operand position.

I suspect that the Intel iXP 's separate register files have a similar
motivation: primarily instruction encoding and/or trying to implement a
large register set than fits in the natural instruction register width.

---

The poster asked:

> Is there some advantage to this? Is it easier to have two register
> sets only one read and one write port each, than to have one file
> with two read and one write ports? Is this approach used
> anywhere else?

Yes, it is easier to have two separate 1R/1W RFs of size N, than to have
a single RF of size 2N with 2R/1W ports. In isolation.

However, on a typical superscalar bit-interleaved datapath, it is not as
true - because the RFs may be stacked on top of each other.

However, if you have a non-bit-interleaved datapath, then the separate
RFs have a bigger advantage.

---

Continuing:

Separate integer and floating point RFs are quite easy to justify when
integers and FP have different data sizes: 32 bit ints versus 32/64/80
bit floats. And this was probably the state of the art in scalar
microprocessors for a while, with occasional exceptions that had unified
int/FP datapaths.

But, as SIMD packed vector instruction sets became common, it became
more common to have integer packed vectors and FP packed vectors in the
same register file: 16x8b-int, 4x32b-fp, etc.

(It may be hard to believe now, but one of the hardest things to sell
about Intel MMX was that you could have integer and FP share the same
RF. Now, sharing the x87 RF may not have been so good an idea, mea
culpa, but sharing 128 XMM, and so on for GPUs, is by now ubiquitous.)

It is not clear what the BKM for a new ISA would be nowadays. lmost
undoubtedly it would have separate scalar and SIMD packed vector
register files - the latter 128b or wider. But would the scalar part
support both int and FP, or would there be separate int and FP scalar?

---

A "side-track" of computer architecture has been accumulator or extended
precision register files - 40b or 48b integers, 20b and 80 bit FP, etc.

Paul A. Clayton

unread,

Jan 1, 2011, 9:42:30 PM1/1/11

to

On Dec 30 2010, 9:44 pm, MitchAlsup <MitchAl...@aol.com> wrote:
[snip]

> In general I am a fan of 360-like instruction sets where inbound
> memory references are made to appear inexpensive. This naturally
> partitions the pipeline into several parts::
>
> a) Fetch-Decode-AddressGen
> b) Cache-LoadAlign
> c) integer execute
> d) FP execute
> e) StoreAlign-writeback
>
> And each partition can have its own operation queue using a simple
> register interlock. This gives partially ordered instructon execution
> performance without the cost of reservation stations (or scroeboards)
> at the lower end of the performance spectum and in no way hinders the
> higher performing end of the performance spectrum.

While I like load+op from the perspective of code density,
instruction count, and liveness information, I also find
the operation scheduling aspect of traditional RISC very
attractive. (I also dislike delaying branch resolution.)
For communicating single-use information, one could use
a form of explicit forwarding (e.g., the load instruction
designating the destination as the Nth implicit operand,
though control flow operations would make such more
complex. (Obviously, hoisting a load ahead of a branch
has issues--not so much for correctness, poison bits are
an obvious method to address such, but because one is not
communicating to the hardware that the only use of the
value is in a particular path. Predication might ease
this, but it might not be practical to generate the
condition early [hoisting a load by predicating on a
potentially future condition seems likely to add more
complexity than benefit].)

(To effectively buffer instructions--assuming most stalls
are from loads and not long-latency computations--it seems
one would want greater fetch bandwidth than execute
bandwidth, is that correct? How difficult would well-
optimized instruction selection/scheduling be for such an
Architecture? It seems the compiler's scheduler would
have something like the MIPS load-delay-slot scheduling,
but scheduling independent operations before the load
[to fill a buffer] rather than after the load [to fill a
delay slot]. I am not convinced that load+op should be
provided, but using such buffers--and decoupling such
parts of the processor does seem desirable--does give me
something to think about.)

(By the way, for a relatively simple processor, would
integrating the scoreboard with the register file [to
avoid redundant row decode] make sense? My wild guess
would be that such would be less power efficient--even
if one used a CAM-like mechanism to conditionally
disable the wordline, the bitlines would still be
activated. [A simple is_valid check would be simple,
but even something like is_expected_version might not
be horrifically expensive.])

Were there any Architectures that provided a load-and-
preserve+op? This _might_ make some sense if the
value was preserved in a larger, less expensive per
entry, storage--otherwise one would just load the
value into a register. (Stack and Knapsack caches
allow for direct indexing--just like registers--, avoid
or prohibit interference--like registers, and are
'prefetch' friendly. [It seems that there are values
like auxiliary stack pointers that may not be used
frequently but for which there would be benefit to a
'guarantee' of cheap access.] Such features might also
benefit from memory-memory transfers.)

Paul A. Clayton

unread,

Jan 1, 2011, 10:16:00 PM1/1/11

to

On Dec 31 2010, 5:13 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-
arch.net> wrote:
[snip]

> The Motorola 68000 was an exception: it had separate address and data
> registers. I conjecture that this was motivated not by speed of
> register file access, but by instruction encoding density: having
> separate address and data register files effectively saved a bit in the
> register encoding.

Specialization can also offer optimization opportunities.
(I am guessing that, as with early FP coprocessors, IBM's
POWER had registers that could be located in a front-end
chip of a multi-chip implementation. The important aspect
may be the communication of intent rather than strictly
the separation of storage, but intent can influence
optimal storage. Addresses [particularly indexing bits]
and conditions tend to have greater criticality than
computational values.)

BTW, the 68K might have been saving more than one bit
since two register operands are frequently specified.
(It loses a fractional bit from redundant opcodes for
address and data registers [ISTR that addition is
supported for both] and opcodes for cross- and mixed-
file operations.)

Paul A. Clayton
just a technophile

> Shortly thereafter floating point started becoming common.

Rob Warnock

unread,

Jan 2, 2011, 12:58:54 AM1/2/11

to

Andy \"Krazy\" Glew <an...@SPAM.comp-arch.net> wrote:
+---------------

| What I can say accurately is that at one time it was common to have a
| single uniform register file. E.g. the PDP-11.

+---------------

Mmm... Not quite uniform: don't forget that in the PDP-11 "r7"
*was* the PC [with all the clever coding tricks that implied]...

The PDP-10's register file was closer to being completely uniform,
except that you couldn't index on "AC0". [You could still branch
to it or indirect through it, though.]

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Andrew Reilly

unread,

Jan 2, 2011, 2:37:39 AM1/2/11

to

On Fri, 31 Dec 2010 14:13:18 -0800, Andy \"Krazy\" Glew wrote:

> Around the same time DSPs started becoming common, really a specialized
> microprocessor. They often had XY memory, and XY register files - i.e.
> separate register files for the "X" registers and the "Y" registers.
> Which were otherwise identical. Sometimes there was a separate "A"
> address register file, sometimes not.
>
> I conjecture (with fairly high confidence) that the main reason for this
> was, or is, once again, instruction encoding. The usage patterns often
> allowed coefficients to be used in only one operand position.

I suspect that your conjecture might only be part of the story. DSP
processors (at least the early ones) are basically a bit of sequencing
and branching logic wrapped around a multiply-accumulate engine. The
entire instruction set and memory spaces were geared around making either
two-reads-and-a-mac or read-mac-write (with associated address
manipulation as well) run back to back. Putting the two reads into
separate address spaces was, I suspect, easier to do than some implicit
way to allow two parallel memory accesses.

In your discussion of off-chip FPUs, don't forget things like the Weitek
FPUs that occupied a chunk of (peripheral) address space: essentially
adding a "mov-only" instruction set: registers and op-codes were encoded
in address patterns. Only one read/write at a time, but they were the
king of the hill for workstation FPU performance for a (little) while...

Cheers,

--
Andrew

Terje Mathisen

unread,

Jan 2, 2011, 4:47:31 AM1/2/11

to

Paul A. Clayton wrote:
> Specialization can also offer optimization opportunities.
> (I am guessing that, as with early FP coprocessors, IBM's
> POWER had registers that could be located in a front-end
> chip of a multi-chip implementation. The important aspect

The origin of POWER was probably the 3-chip version, I believe one of
the last issues of BYTE had one of their great architecture articles
about that cpu (RS6000?).

Anyway, the important part imho was the separate branch (&opcode fetch?)
unit, which meant that you had a comparatively long delay between an
integer compare and the corresponding branch.

Afair you were adviced to hoist such compares 3-5 cycles in front of the
branch to avoid bubbles.

At the time, that was an eternity, with modern OoO you would feel very
lucky if you could get the delay that small. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Andy "Krazy" Glew

unread,

Jan 2, 2011, 3:10:55 PM1/2/11

to

On 12/30/2010 1:25 PM, MitchAlsup wrote:

> Register Files:
>
> The height of the register file (except in rare cases) is set by the
> height of the data path and rarely by topological arguments of the
> register file. The AMD 29000 is an exception to this rule and paid a
> significant area penalty routing the register file bits down to the
> height of the data path.

Terminology check:

I think that by "height" you mean the number of wires across the data path.

E.g. a datapath with a single instruction, 2 input and 1 output, each 32
bits, is 3*32=96 bits high, in your terminology.

Right?

(I betray something of my age and intellectual ancestry because I call
this "width". But then again I usually draw datapaths vertically. That
style was typical at Intel in the P6 era, 1991 to 1995, but sometime
between then and 2000 it got flipped on its side.)

Andy "Krazy" Glew

unread,

Jan 2, 2011, 5:43:45 PM1/2/11

to

I knew when I posted this that somebody would jump on PC=R7.

:-)

MitchAlsup

unread,

Jan 2, 2011, 7:01:37 PM1/2/11

to

On Jan 2, 4:43 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> On 1/1/2011 9:58 PM, Rob Warnock wrote:
>
>
>
>
>

> > Andy \"Krazy\" Glew<a...@SPAM.comp-arch.net> wrote:

> > +---------------
> > | What I can say accurately is that at one time it was common to have a
> > | single uniform register file. E.g. the PDP-11.
> > +---------------
>
> > Mmm... Not quite uniform: don't forget that in the PDP-11 "r7"
> > *was* the PC [with all the clever coding tricks that implied]...
>
> > The PDP-10's register file was closer to being completely uniform,
> > except that you couldn't index on "AC0". [You could still branch
> > to it or indirect through it, though.]
>
> > -Rob

> I knew when I posted this that somebody would jump on PC=R7.
>

History records that PC in the GPRs is "overly uniform."

Mitch

MitchAlsup

unread,

Jan 2, 2011, 8:54:26 PM1/2/11

to

On Jan 2, 2:10 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> On 12/30/2010 1:25 PM, MitchAlsup wrote:
>
> > Register Files:
>
> > The height of the register file (except in rare cases) is set by the
> > height of the data path and rarely by topological arguments of the
> > register file. The AMD 29000 is an exception to this rule and paid a
> > significant area penalty routing the register file bits down to the
> > height of the data path.
>
> Terminology check:
>
> I think that by "height" you mean the number of wires across the data path.
>
> E.g. a datapath with a single instruction, 2 input and 1 output, each 32
> bits, is 3*32=96 bits high, in your terminology.
>
> Right?

Height(register file) is in-line-with the select (word) lines (wires)
and perpendicular to bit lines (wires). So, for example, say one has a
register file of 32-registers each register containing 64-bits. The 64-
bits are in the height dimension and the 32-registers are in the width
direction.

> (I betray something of my age and intellectual ancestry because I call
> this "width". But then again I usually draw datapaths vertically. That
> style was typical at Intel in the P6 era, 1991 to 1995, but sometime
> between then and 2000 it got flipped on its side.)

I've seen data paths (and pipelines) drawn vertically down, left-to-
right, and right-to-left, but never vertically upwards.

The style I became accustomed towards allowed one to write the
datapath around a room on the white boards. There was always more
"stuff" needed to be drawn in the data path direction (perpendicular
to the select lines), and always more white board in the horizontal
direction. The two features just seemed to go with each other.
Although any of the four methods is acceptable in practice.

Note: Even when drawing the register file in a vertically orrented
data path direction, we always considered the height to be across the
data path and not 'with' the data path. I have seen similar
nomenclature differences in RAM designs over the last 35 years, also.
Consider a 4096 KByte SRAM organized as 128-words of 256-bits. I would
consider the 256-bits to be in the height direction and the 128-words
to be in the width direction. You are free to use whatever
nomenclature you company espouses.

Mitch

Brett Davis

unread,

Jan 3, 2011, 12:09:40 AM1/3/11

to

In article <xNWdneOJqpyHWILQ...@giganews.com>,

"Andy \"Krazy\" Glew" <an...@SPAM.comp-arch.net> wrote:

> It is not clear what the BKM for a new ISA would be nowadays. lmost
> undoubtedly it would have separate scalar and SIMD packed vector
> register files - the latter 128b or wider. But would the scalar part
> support both int and FP, or would there be separate int and FP scalar?

The Renesas RX supports floats in one unified register file.

Embedded CPUs have such difficulties getting IPC above ~1.5
that I see no reason to have a separate vector unit.
Just make the registers wider like MIPS did.

Would make task switching more complex, you have to know how wide
the data is in each of your registers to optimize the data save/restore.

> A "side-track" of computer architecture has been accumulator or extended
> precision register files - 40b or 48b integers, 20b and 80 bit FP, etc.

The Renesas RX also includes a 48 bit accumulator for DSP work.

Torben Ægidius Mogensen

unread,

Jan 4, 2011, 5:38:01 AM1/4/11

to

mac <al...@theworld.com> writes:

To summarize the discussion so far, there are two advantages of split
register files: You need fewer r/w ports per file to support the same
total number of reads and writes per cycle and you can save instruction
bits by letting the register file be implicit in the instruction. The
disadvantage is that it can be harder to exploit the full number of
potential reads and writes: If you split into integer and FP register
files, the FP register file may be idle a lot of the time and if you
split into address and data registers, address registers may be idle in
some periods while data registers may be idle in other periods.

The saving in instruction bits and ports are really separate: You can
imply register bits even if there is only one physical register file and
you can split the physical register file even if there is only one
logical register file. But you get most benefit if you do both: You can
tie each register file to its own data paths and ALUs (which is typical
for an integer/FP split) or you can force instructions to use one
argument from each file (like the 2800). The latter allows separate
data paths to the ALU (left argument from one file, right argument from
the other) but shares the ALU.

Some months ago, I suggested that you could split a single logical
register file into one bank for the odd registers and another for the
even registers. Each ALU instruction would specify a full register
number for the result register, which would also be the implied first
argument, but n-1 bits for the second argument, with the last bit
implied to be the negation of the last bit of the first argument. Like
the 2800, this would force a read from each register file. Assymetric
ALU operations (such as subtract) would exist in both versions, so
x:=x-y and x:=y-x could both be encoded.

Such a scheme gives almost the same cost and instruction-bit saving as,
say, separate address and data registers, but there is not the same risk
of starving one register file. Sure, it adds a burden on the register
allocator to reduce the number of transfers between banks, but I don't
believe this to be a major issue.

Torben

nm...@cam.ac.uk

unread,

Jan 4, 2011, 5:27:52 AM1/4/11

to

In article <7zipy5x...@ask.diku.dk>,

Torben �gidius Mogensen <tor...@diku.dk> wrote:
>mac <al...@theworld.com> writes:
>
>> During the discussion of floating-point array indexes, there was a
>> mention that partitioning registers into integer and floating-point sets

>> makes for simpler register access. ...

>
>To summarize the discussion so far, there are two advantages of split
>register files: You need fewer r/w ports per file to support the same
>total number of reads and writes per cycle and you can save instruction
>bits by letting the register file be implicit in the instruction. The
>disadvantage is that it can be harder to exploit the full number of

>potential reads and writes: ...

>
>The saving in instruction bits and ports are really separate: You can
>imply register bits even if there is only one physical register file and
>you can split the physical register file even if there is only one

>logical register file. ...

Yes.

A long time ago I proposed a trivial alternative, that would have
a lot of advantages (and, as usual, was an idea similar to those
used in some older machines).

A single register file - or, as you propose, dual ones, with four
extra operations. Those could be implicit (CISC) or explicit (RISC):

Decompose register as floating-point
Compose floating-point as register
Decompose register as address
Compose address as register

The decomposed forms would be 'hidden' and it would be up to the
implementation when to use them and when not to, though it would
be an error to use a register in the wrong state.

Floating-point comparison and category checking would be allowed
on the raw registers, but nowt else. There would also be 'load
mantissa' and similar register-register operations.

The reason for having address forms is for RAS - if anyone gave a
damn nowadays :-( - but would also give significant performance
gains where TLB thrashing is an issue.

Regards,
Nick Maclaren.

MitchAlsup

unread,

Jan 4, 2011, 6:00:01 PM1/4/11

to

On Dec 31 2010, 4:13 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-

Up through the 68030, the data paths for the A and D processes were
separate parts of the data path (i.e. not bit interleaved). This had
to do
with 1 metal (plus silicide) FAB technology not allowing the now std
bit interleaving.

In addition, the register file sense amplifiers were attached to the
whole register bus, that is over the computation elements as well as
over the register files. This greatly shrank the drivers necessary to
put results back on the register busses. And the use of the SA twice
per instruction went a long way to making these CPUs process no more
than one instruction every 2 clocks.

When data was transfered between A and D sections, 2 n-channel pass
gates were asserted and the sense amp of the receiving side 'fired' to
receive the data.

The heroics one had to perform just to get a 1 metal processor working
is not understood by todays designers with 6,8,10,12 layers of metal.
There were places we used a two transistors as a latch, an AND gate
and a gate driving 32 loads. This particular gate had a total delay of
less than a std 2-input NAND gate when the std NAND gagte was driving
a unit load. Amazing stuff.

Mitch

Anton Ertl

unread,

Jan 6, 2011, 10:12:00 AM1/6/11

to

"Andy \"Krazy\" Glew" <an...@SPAM.comp-arch.net> writes:
>We could go back a long way: "In the beginning there was an
>accumulator. And a memory address register."

...

>The Motorola 68000 was an exception: it had separate address and data
>registers.

It was not really an exception. It had 8 accumulators and 8 memory
address registers; this was an evolution from the 6800 with 2
accumulators and one index registers and the 6809 with a bit more.

>It is not clear what the BKM for a new ISA would be nowadays. lmost
>undoubtedly it would have separate scalar and SIMD packed vector
>register files - the latter 128b or wider. But would the scalar part
>support both int and FP, or would there be separate int and FP scalar?

Looking at how AMD64 is used, I would say that there would be no
separate FP scalar (AMD64 has it, but it is not used); instead, scalar
FP would be performed using the SIMD register file. In this way we
are back at an architecture with accumulators (the SIMD registers) and
index/address registers (the "GPR"s), but with a useful set of
operations on the index/address registers (the 68k was too restricted
there).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Jason Riedy

unread,

Jan 6, 2011, 12:09:14 PM1/6/11

to

And Anton Ertl writes:
> Looking at how AMD64 is used, I would say that there would be no

> separate FP scalar (AMD64 has it, but it is not used); [...]

*I* use it. Eighty bits can be quite handy, and doubled-80-bit
arithmetic removes the need to think about many issues while testing
floating-point algorithms. sigh.

Jason

Andy "Krazy" Glew

unread,

Jan 7, 2011, 12:26:48 AM1/7/11

to

On 1/6/2011 7:12 AM, Anton Ertl wrote:
> "Andy \"Krazy\" Glew"<an...@SPAM.comp-arch.net> writes:
>> We could go back a long way: "In the beginning there was an
>> accumulator. And a memory address register."

> ....

>> The Motorola 68000 was an exception: it had separate address and data
>> registers.
>
> It was not really an exception. It had 8 accumulators and 8 memory
> address registers; this was an evolution from the 6800 with 2
> accumulators and one index registers and the 6809 with a bit more.
>
>> It is not clear what the BKM for a new ISA would be nowadays. lmost
>> undoubtedly it would have separate scalar and SIMD packed vector
>> register files - the latter 128b or wider. But would the scalar part
>> support both int and FP, or would there be separate int and FP scalar?
>
> Looking at how AMD64 is used, I would say that there would be no
> separate FP scalar (AMD64 has it, but it is not used); instead, scalar
> FP would be performed using the SIMD register file. In this way we
> are back at an architecture with accumulators (the SIMD registers) and
> index/address registers (the "GPR"s), but with a useful set of
> operations on the index/address registers (the 68k was too restricted
> there).
>
> - anton

I reserve the term "accumulators" for operations that have extra width,
as in adding 32 bit numbers to a 64 bit accumulator.