Bringing Back 36-bit Floating-Point Numbers

Quadibloc

unread,

Jun 9, 2012, 12:06:17 PM6/9/12

to

On my page at

http://www.quadibloc.com/arch/ar0504.htm#Fast_Intermediate

I have recently made some changes, adding to my scheme of dividing 256
bits of memory into one of the following:

four 64-bit double-precision floats

one 64-bit double-precision float and
four 48-bit intermediate-precision floats

one 64-bit double-precision float and
one 48-bit intermediate-precision float and
four 36-bit single-precision floats

to provide for a few other lengths of number. I had originally
included 40-bit floats as an option, but now one can combine two 40-
bit floats into an 80-bit float as well as two 36-bit floats as a 72-
bit float.

A more difficult modification was adding the ability to divide 256-
bits of memory into

one 64-bit double-precision float and
two 60-bit (alternate) double-precision floats and
two 36-bit single-precision floats.

Right now, though, except for the use of "secondary formats", one can
only set up _one_ scheme of memory subdivision (at a time, for a
single program; context switching to a different PSW changes the
scheme), instead of subdividing parts of memory in a different manner
so as to match the mix of floating-point precisions required for a
given program.

So I may need to rethink this; perhaps packaging the subdivision
schemes instead of specifying them "by hand", and allowing a mode in
which the top three bits of a (64-bit!) base register select the
subdivision scheme, for example.

Of course, the top three bits of a regular 32-bit base register can't
be spared that way.

Incidentally, if what I'm up to in this part of my architecture is not
clear, the basic principle is more fully spelled out on this page:

http://www.quadibloc.com/arch/per05.htm

and the ultimate goal is explained here:

http://www.quadibloc.com/arch/perint.htm

- to have a 36-bit float which combines the exponent range of IBM 360
single precision floats with the precision of IBM 7090 single-
precision floats (thanks to the IEEE 754 hidden bit), and

- to have a 48-bit float which provides 11-digit precision, thus
matching a pocket calculator without providing the excessive precision
that double-precision often provides.

The idea is, of course, that one builds a whole separate Williams tree
for each precision because almost nothing is as urgent, and as worth
spending transistor count on, as shaving every last cycle off the
latency of floating-point computations.

This assumes that somehow - presumably through shrewd marketing and
fancy video games - that the mass market is made to demand processor
chips that are designed like the old-fashioned supercomputers, hence
eliminating the big gap in favor of commodity chips versus custom
scientific computing designs, because the scientific computing designs
_are_ the commodity chips.

John Savard

MitchAlsup

unread,

Jun 9, 2012, 4:34:27 PM6/9/12

to

On Saturday, June 9, 2012 11:06:17 AM UTC-5, Quadibloc wrote:
> This assumes that somehow - presumably through shrewd marketing and
> fancy video games - that the mass market is made to demand processor
> chips that are designed like the old-fashioned supercomputers,

Old fashioned supercomputers had instruction sets designed to perform
double precision directly and to assist the programmer in producing
quad precision and expandable precision numeric codes. This is what
the Add Double, Multiply Double and Normalize instructions were used
for in the CDC 6600. They directly assist in computing exact summs and
exact products. As in:

{double double} TwoSum( double a, double b )
{
double x, y;
x = a+b;
y = ADD_DOUBLE( a, b );
return {x,y};
}

{double double} TwoProd( double a, double b )
{
double x, y;
x = a*b;
y = MUL_DOUBLE( a, b );
return {x,y};
}

Where x+y == a op b,
and x = a op b
and y is all the bits that did not make it into x.

Mitch

Quadibloc

unread,

Jun 10, 2012, 11:54:33 AM6/10/12

to

On Jun 9, 2:34 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> Old fashioned supercomputers had instruction sets designed to perform
> double precision directly and to assist the programmer in producing
> quad precision and expandable precision numeric codes.

Yes, and most mainframes did as well; only a few early mainframes did
double-precision in software. Today's microchips do double-precision
directly, but they don't do single-precision directly on a lower-
latency arithmetic unit.

I have made some corrections and updates to the page announced here;
the relevant section has been split out to its own page, at

http://www.quadibloc.com/arch/ar050402.htm

now. 60-bit precision has been changed from "intermediate" to
"extended" (in the sense of which instructions are used to access it;
it's still a form of double precision for other purposes, such as its
exponent size and which arithmetic unit handles it).

I've included a large, colorful diagram at the bottom of the page,
which should help clarify how the various modes and settings work
together in offering a selection of numeric precisions to the current
program, with flexibility comparable to that of the conventional
floating-point heirarchy, although not quite as good, since when sizes
don't match, there's left-over stuff that one may or may not be able
to use.

John Savard

Del Cecchi

unread,

Jun 10, 2012, 8:59:28 PM6/10/12

to

I believe the likelihood of the return of 36 bit floating point is
described by the old acronym.... ANFW

Terje Mathisen

unread,

Jun 11, 2012, 1:58:22 AM6/11/12

to

Del Cecchi wrote:
>
> I believe the likelihood of the return of 36 bit floating point is
> described by the old acronym.... ANFW

<bg>

I'm not quite so sure: It is _possible_ that some graphics and/or
DSP-like chip will use an internal "extended precision" format when
evaluating single/float numbers.

One or two more exponent bits, three or two more mantissa bits could
avoid some extra checks for underflow/overflow.

Having that format user/programmer-visible anywhere? As you said, ANFW.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

nm...@cam.ac.uk

unread,

Jun 11, 2012, 3:54:18 AM6/11/12

to

In article <11ega9-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Del Cecchi wrote:
>>
>> I believe the likelihood of the return of 36 bit floating point is
>> described by the old acronym.... ANFW
>

>I'm not quite so sure: It is _possible_ that some graphics and/or
>DSP-like chip will use an internal "extended precision" format when
>evaluating single/float numbers.
>
>One or two more exponent bits, three or two more mantissa bits could
>avoid some extra checks for underflow/overflow.
>
>Having that format user/programmer-visible anywhere? As you said, ANFW.

I am not sure. Implausible in the near future, yes, but I have
been surprised at the number of such invariants that have fairly
suddenly disappeared. If it happens, it will come from outfield,
in the wake of an even more revolutionary change.

Regards,
Nick Maclaren.

Michael S

unread,

Jun 11, 2012, 10:40:37 AM6/11/12

to

On Jun 11, 8:58 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:

> Del Cecchi wrote:
>
> > I believe the likelihood of the return of 36 bit floating point is
> > described by the old acronym.... ANFW
>
> <bg>
>
> I'm not quite so sure: It is _possible_ that some graphics and/or
> DSP-like chip will use an internal "extended precision" format when
> evaluating single/float numbers.
>

I don't believe in 36-bit FP in DSP.
Floating point DSPs traditionally used 40-bit FP registers that used
ether extended precision IEEE 40-bit format (1+8+31) or proprietary
format with 32-bit two-compliment mantissa and 8-bit exponent. One
obvious advantage of these formats over 36-bit is an ability to
exactly represent any 32-bit integer number.
Those extended formats were fully visible to programmers and could be
relatively easily saved/restored to/from memory.

I just took a look at ADI web site and to my amusement found out that
at least one of those traditional chips,
At least one of those traditional DSPs, ADSP-21489 has product Status
"Recommended for New Designs"!

Newer TigerSharc DSP processor still supports 40-bit format, but there
it is not free in terms of throughput.

Quadibloc

unread,

Jun 11, 2012, 1:13:10 PM6/11/12

to

On Jun 10, 6:59 pm, Del Cecchi <delcec...@gmail.com> wrote:

> I believe the likelihood of the return of 36 bit floating point is
> described by the old acronym.... ANFW

I agree that it is extremely unlikely.

Given, however, the following facts:

- When the 7090 was succeeded by the 360, a lot of programs had to
switch all the way up to double precision, so the precision provided
by 704 single precision floating-point was useful, and

- The exponent range of IEEE 754 32-bit floats is significantly
smaller than that of IBM 360 single precision, leading to an
incompatibility with a major chunk of software,

the fact that it's possible to construct a 36-bit number format based
on IEEE 754 that has the exponent range of System/360 floats and the
precision of 7090 single-precision floats, thanks to that one extra
hidden bit makes 36-bit floating point numbers attractive.

So, although I don't think anything will happen any time soon, I still
think it useful to illustrate a method by which one could implement 36-
bit floats in 256-bit wide memory...

avoiding wasted space,

avoiding having to do a divide operation to locate the address of the
N-th element of an array, and

avoiding having to cross word boundaries during an operand fetch (and
thus having to fetch twice).

This general technique could be used (and in fact, some subsets have
already been used) in things like graphics cards or other specialized
equipment. Showing how the principle could be more fully extended
increases people's understanding of how computers can work.

John Savard

Quadibloc

unread,

Jun 11, 2012, 2:45:39 PM6/11/12

to

Further extensions to the feature have been made on the page

http://www.quadibloc.com/arch/ar050402.htm

which allow the extended precision instructions to be used to access
an additional floating-point number in either single precision or
intermediate precision within a 256-bit memory line, thus allowing a
more optimal allocation of memory.

John Savard

MitchAlsup

unread,

Jun 11, 2012, 5:06:14 PM6/11/12

to

On Monday, June 11, 2012 12:13:10 PM UTC-5, Quadibloc wrote:
> On Jun 10, 6:59 pm, Del Cecchi <delcec...@gmail.com> wrote:
>
> > I believe the likelihood of the return of 36 bit floating point is
> > described by the old acronym.... ANFW
>
> I agree that it is extremely unlikely.
>
> Given, however, the following facts:
>
> - When the 7090 was succeeded by the 360, a lot of programs had to
> switch all the way up to double precision, so the precision provided
> by 704 single precision floating-point was useful, and
>
> - The exponent range of IEEE 754 32-bit floats is significantly
> smaller than that of IBM 360 single precision, leading to an
> incompatibility with a major chunk of software,
>
> the fact that it's possible to construct a 36-bit number format based
> on IEEE 754 that has the exponent range of System/360 floats and the
> precision of 7090 single-precision floats, thanks to that one extra
> hidden bit makes 36-bit floating point numbers attractive.

We are now in an era where programmers are starting to leave 32-bits behind
wiht ints and pointers becoming 64-bits in size. In addition, few processors
are usefully slower in double precision than in single precision, so, the
question turns around into:

Why not make 64-bit FP the 'single precision' of the 201x-and-onward? Then
support even larger FP formats with the ability to perform exact numerics.

Mitch

Paul A. Clayton

unread,

Jun 11, 2012, 7:06:36 PM6/11/12

to

On Jun 11, 5:06 pm, MitchAlsup <MitchAl...@aol.com> wrote:
[snip]

> We are now in an era where programmers are starting to leave 32-bits behind
> wiht ints and pointers becoming 64-bits in size. In addition, few processors
> are usefully slower in double precision than in single precision,

Except for SIMD/SIMT operation (and, I think, division).
Higher precision also has cache capacity and memory
bandwidth (and communication energy?) implications.

(For computationally dense applications, the substantially
lower energy cost of half-precision multiplication might
be significant.)

> so, the
> question turns around into:
>
> Why not make 64-bit FP the 'single precision' of the 201x-and-onward? Then
> support even larger FP formats with the ability to perform exact numerics.

With ARM (like SSE x86) using 128-bit SIMD registers for
FP, supporting 128-bit FP makes some sense. For x86,
one might support 80-bit FP in SSE/AVX registers at full
speed while potentially implementing slower 128-bit FP
operations. Such might support the deprecation of x87.

(I have thought there might be a place for 80-bit FP
multiplication with 128-bit FP addition, but I suspect
that such would not be worth the bother.)

(Some Power implementations might also map FP to SIMD
registers that are at least 128-bits wide. I am too
lazy to look this up at the moment. IIRC, POWER/PowerPC
was odd in expanding single precision values into
double precision register storage.)

Piotr Wyderski

unread,

Jun 12, 2012, 7:45:23 AM6/12/12

to

MitchAlsup wrote:

> Why not make 64-bit FP the 'single precision' of the 201x-and-onward? Then
> support even larger FP formats with the ability to perform exact numerics.

But continue to support the 32-bit ones, as they are crucial for
the inexact, but high-performance codes, e.g. in raytracing.

Best regards, Piotr

Quadibloc

unread,

Jun 12, 2012, 2:36:32 PM6/12/12

to

For _many_ applications, the most powerful computer in existence will
never be powerful enough. So if it's possible to design chips in such
a way that, with a given technology, with a given feature size and
gate delay, a few nanoseconds can be shaved off the latency of a
floating-point computation of the minimum precision required for a
given problem... it is *highly* desirable to do so.

I make no apologies for suggesting something which I admit is
complicated for the purpose of addressing the needs of researchers
working at the frontiers of science.

Even if my current suggestion may be viewed by some as one of the
arrows in their backs rather than something of positive value.

John Savard

Quadibloc

unread,

Jun 12, 2012, 6:47:11 PM6/12/12

to

Oh, incidentally: I've modified the page at

http://www.quadibloc.com/arch/ar050402.htm

again, to add a large diagram at the bottom of the page which
explicitly shows how floating-point instruction operands will map to
operands of alternate lengths in the final mode, which is the most
complicated one.

John Savard

nm...@cam.ac.uk

unread,

Jun 13, 2012, 5:18:42 AM6/13/12

to

In article <c0e0044b-1dcb-48a5...@ki5g2000pbb.googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>For _many_ applications, the most powerful computer in existence will
>never be powerful enough. So if it's possible to design chips in such
>a way that, with a given technology, with a given feature size and
>gate delay, a few nanoseconds can be shaved off the latency of a
>floating-point computation of the minimum precision required for a
>given problem... it is *highly* desirable to do so.

That is not true, though commonly believed. The vast majority
(probably all) of such problems have highly non-linear complexity,
and a small factor of speed increase gains essentially nothing.
Indeed, a slower machine may actually be faster, if it provides
mechanisms that enable the programmer to use and debug a somewhat
more advanced algorithm, where they are impractical on the faster
machine.

Regards,
Nick Maclaren.

Piotr Wyderski

unread,

Jun 13, 2012, 7:19:43 AM6/13/12

to

nm...@cam.ac.uk wrote:

> That is not true, though commonly believed. The vast majority
> (probably all) of such problems have highly non-linear complexity,
> and a small factor of speed increase gains essentially nothing.

In the discussed case it is all about SIMD. An SSE register
is as wide as it is and it won't be any wider. So it's only
a matter of the programmer's desired partitioning. You can
have either two doubles or four floats at a time. The relative
performance (if any) between double and float is irrelevant
in this context -- if you can double your bandwidth and the
algorithm is fine with the resulting reduced precision, using
floats is an obvious choice.

Anyway, I agree with the spirit of Mitch's post: the double
type should be the default, not an extended precision FP type.
But the reduced precision types are needed too, FP32 and even
FP16.

> Indeed, a slower machine may actually be faster, if it provides
> mechanisms that enable the programmer to use and debug a somewhat
> more advanced algorithm, where they are impractical on the faster
> machine.

The sad reality is that hardly anybody understands advanced
numeric algorithms. Even sadder is that nobody (subtract a finite
number of counterexamples) cares about that. :-(

Best regards, Piotr

Quadibloc

unread,

Jun 13, 2012, 11:24:56 AM6/13/12

to

Further updates have taken place:

- it has been made explicit when Data Memory Width control is
overriden for the Extended Precision instructions,

- it has been noted what happens when Subdivided and Fast floating-
point forms are both selected, and

- the large diagram at the bottom of the page has been modified so
that it is not as wide, and thus it is no longer needed to scroll to
the right to see the most important part of it on most screens.

John Savard

MitchAlsup

unread,

Jun 13, 2012, 1:02:22 PM6/13/12

to

On Wednesday, June 13, 2012 6:19:43 AM UTC-5, Piotr Wyderski wrote:
> Anyway, I agree with the spirit of Mitch's post: the double
> type should be the default, not an extended precision FP type.
> But the reduced precision types are needed too, FP32 and even
> FP16.
>

I think you missed my point! My point was that the C compiler
should take the declarative statement:

float a;

and create a 64-bit floating point variable! (similarly for the
ftn compiler.) After all, we have pointers at 64-bits, integers
at 64-bits, memory references at 64-bits, why not satisfy the
base FP precision at 64-bits?

Yes there are space/speed reasons for keeping 32-bit FP, but most of
the speed reasons exist over in the short vector side of computations.

I can ALSO see the need for smaller FP being packed up in different
ways. However, here, I think a robust bit swizzler that unpacks or
repacks the swizzled data for memory reference purposes sufficient;
leaving the base types the HW computation units deal with fixed.

> > Indeed, a slower machine may actually be faster, if it provides
> > mechanisms that enable the programmer to use and debug a somewhat
> > more advanced algorithm, where they are impractical on the faster
> > machine.
>
> The sad reality is that hardly anybody understands advanced
> numeric algorithms. Even sadder is that nobody (subtract a finite
> number of counterexamples) cares about that. :-(

I was reading up on exact floating point numerics in the last couple of
weeks. In one paper they memtioned that their base level algorithms
use 6-15 times as many FP instructions as the more direct forms,
but only run 2.5 times slower. The author speculated that all of those
sequentially dependent FP instructions become independent "on the next
loop", harming speed by way less that what is obvious. I suspect they
are correct on their suspicion.

What I would like to see is all of those 6-15 FP instruction exact
procedures be expressible in 2 FP instructions, which would dramaticaly
reduce the cost of extended precision activities, probably to the point
numericalists could afford to use them.

But I would not say I am anywhere close to understanding the sublties
of those algorithms--just to the point I know what to add to an ISA.

Mitch

nm...@cam.ac.uk

unread,

Jun 13, 2012, 1:22:12 PM6/13/12

to

In article <ace225b5-f767-4fae...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>On Wednesday, June 13, 2012 6:19:43 AM UTC-5, Piotr Wyderski wrote:
>
>> > Indeed, a slower machine may actually be faster, if it provides
>> > mechanisms that enable the programmer to use and debug a somewhat
>> > more advanced algorithm, where they are impractical on the faster
>> > machine.
>>
>> The sad reality is that hardly anybody understands advanced
>> numeric algorithms. Even sadder is that nobody (subtract a finite
>> number of counterexamples) cares about that. :-(
>
>I was reading up on exact floating point numerics in the last couple of
>weeks. In one paper they memtioned that their base level algorithms
>use 6-15 times as many FP instructions as the more direct forms,
>but only run 2.5 times slower. The author speculated that all of those
>sequentially dependent FP instructions become independent "on the next
>loop", harming speed by way less that what is obvious. I suspect they
>are correct on their suspicion.

I should be extremely surprised if they ARE all sequentially
dependent! None of the multi-precision or Newton-Raphson code I
have written has had less than about 2 ILP, and some have got up
to much larger values. Also, my experience is that most ISAs are
so hostile that 50-75% of the complexity is in control instructions.

>What I would like to see is all of those 6-15 FP instruction exact
>procedures be expressible in 2 FP instructions, which would dramaticaly
>reduce the cost of extended precision activities, probably to the point
>numericalists could afford to use them.

I think that is a bad idea, because you have to fix the extensions
in hardware, and don't allow variants. Indeed, I would go so far
as saying that there is a strong case for providing only enabling
primitives for both floating-point and extended precision, which
would provide 80% of the gain at a fraction of the cost, AND
provide flexibility that is currently lacking.

Yes, I know - I am trying to establish the Campaign for Real RISC
Hardware! And benchmarketing rules :-(

Regards,
Nick Maclaren.

Quadibloc

unread,

Jun 13, 2012, 2:10:14 PM6/13/12

to

On Jun 13, 11:22 am, n...@cam.ac.uk wrote:
> Indeed, I would go so far
> as saying that there is a strong case for providing only enabling
> primitives for both floating-point and extended precision, which
> would provide 80% of the gain at a fraction of the cost, AND
> provide flexibility that is currently lacking.

That does have its points, but the net result would presumably be a
kind of floating-point that is considerably simpler than IEEE 754
floating-point. So we would lose some important gains.

But it is true that enabling primitives supply flexibility, and so
they could be present in addition to packaged floating-point
instructions. Do you mean things like the normalize instruction
provided by the PDP-8 EAE?

I don't believe in fetching and decoding more instructions than I have
to, as is apparent from my strange idea of a good ISA.

John Savard

MitchAlsup

unread,

Jun 13, 2012, 4:23:59 PM6/13/12

to nm...@cam.ac.uk

On Wednesday, June 13, 2012 12:22:12 PM UTC-5, (unknown) wrote:
> >What I would like to see is all of those 6-15 FP instruction exact
> >procedures be expressible in 2 FP instructions, which would dramaticaly
> >reduce the cost of extended precision activities, probably to the point
> >numericalists could afford to use them.
>
> I think that is a bad idea, because you have to fix the extensions
> in hardware, and don't allow variants. Indeed, I would go so far
> as saying that there is a strong case for providing only enabling
> primitives for both floating-point and extended precision, which
> would provide 80% of the gain at a fraction of the cost, AND
> provide flexibility that is currently lacking.

Well, for example, one can code TwoProduct as:

{double, double} TwoProd( double a, double b )
{
double x = a*b;
double y = FMAC( a*b-x );
return {x,y};
}

Yet, nobody proclaims that FMAC is useless outside of
exact FP numerics, quite the contrary. The two other
instructions I support for this endeavour, one has
great utility, similar to FMAC, the other one simply
makes the renormalization process for extendeds fast--
something you cannot do within IEEE rounding (5 cases);
yet requires only about 15 gates to implement, given
you already have IEEE roudning modes and an FADD.

> Yes, I know - I am trying to establish the Campaign for Real RISC
> Hardware! And benchmarketing rules :-(

Einstein would say: "make things as simple as possible,
but no simpler!"

After studying this issue, I think my goal remains sound.
And I would claim that the typical RISC FP instruction set
is, at best, lacking (or CISC for that mater). CDC had the
essence of this in 1963, and compiled SINGLE PRECISION as
a 60-bit variable.

Mitch

MitchAlsup

unread,

Jun 13, 2012, 4:29:20 PM6/13/12

to

On Wednesday, June 13, 2012 1:10:14 PM UTC-5, Quadibloc wrote:
> On Jun 13, 11:22 am, n...@cam.ac.uk wrote:
> > Indeed, I would go so far
> > as saying that there is a strong case for providing only enabling
> > primitives for both floating-point and extended precision, which
> > would provide 80% of the gain at a fraction of the cost, AND
> > provide flexibility that is currently lacking.
>
> That does have its points, but the net result would presumably be a
> kind of floating-point that is considerably simpler than IEEE 754
> floating-point. So we would lose some important gains.

Just a note: My instruction additions are completely within the
leter and spirit of IEEE 754-2008.

> But it is true that enabling primitives supply flexibility, and so
> they could be present in addition to packaged floating-point
> instructions. Do you mean things like the normalize instruction
> provided by the PDP-8 EAE?

No, I mean that additive cousin to FMAC, and a FP add instruction
suitable for the renormalization of extended precision FP numerics.

> I don't believe in fetching and decoding more instructions than I have
> to, as is apparent from my strange idea of a good ISA.

Mine, too. I have a nearly pure RISC instruction set that processes
large (32-bit and 64-bit) immediates and displacements from the
instruction stream--no need to load constants, or paste bit together.

Mitch

nm...@cam.ac.uk

unread,

Jun 13, 2012, 4:42:02 PM6/13/12

to

In article <c4579324-a923-4b77...@vs10g2000pbc.googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>>
>> Indeed, I would go so far
>> as saying that there is a strong case for providing only enabling
>> primitives for both floating-point and extended precision, which
>> would provide 80% of the gain at a fraction of the cost, AND
>> provide flexibility that is currently lacking.
>
>That does have its points, but the net result would presumably be a
>kind of floating-point that is considerably simpler than IEEE 754
>floating-point. So we would lose some important gains.

Firstly, that does not follow. Secondly, quite a lot of the
brass knobs of IEEE 754 are extremely dubious in value, and some
probably have negative value.

>But it is true that enabling primitives supply flexibility, and so
>they could be present in addition to packaged floating-point
>instructions. Do you mean things like the normalize instruction
>provided by the PDP-8 EAE?

I don't know that architecture, but possibly.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Jun 13, 2012, 4:46:08 PM6/13/12

to

In article <fe223d05-100a-4c7a...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>>
>> >What I would like to see is all of those 6-15 FP instruction exact
>> >procedures be expressible in 2 FP instructions, which would dramaticaly
>> >reduce the cost of extended precision activities, probably to the point
>> >numericalists could afford to use them.
>>
>> I think that is a bad idea, because you have to fix the extensions
>> in hardware, and don't allow variants. Indeed, I would go so far
>> as saying that there is a strong case for providing only enabling
>> primitives for both floating-point and extended precision, which
>> would provide 80% of the gain at a fraction of the cost, AND
>> provide flexibility that is currently lacking.
>
>Well, for example, one can code TwoProduct as:
>
> {double, double} TwoProd( double a, double b )
> {
> double x = a*b;
> double y = FMAC( a*b-x );
> return {x,y};
> }
>
>Yet, nobody proclaims that FMAC is useless outside of
>exact FP numerics, quite the contrary. The two other
>instructions I support for this endeavour, one has
>great utility, similar to FMAC, the other one simply
>makes the renormalization process for extendeds fast--
>something you cannot do within IEEE rounding (5 cases);
>yet requires only about 15 gates to implement, given
>you already have IEEE roudning modes and an FADD.

Ah. Cross-purposes. We are on similar wavelengths, but I think
that you need more than 2. Most of them are trivial in hardware
and real beggars in software.

>> Yes, I know - I am trying to establish the Campaign for Real RISC
>> Hardware! And benchmarketing rules :-(
>
>Einstein would say: "make things as simple as possible,
> but no simpler!"
>
>After studying this issue, I think my goal remains sound.
>And I would claim that the typical RISC FP instruction set
>is, at best, lacking (or CISC for that mater). CDC had the
>essence of this in 1963, and compiled SINGLE PRECISION as
>a 60-bit variable.

Yes. I agree with both, though I would somewhat dispute the
merits of the CDC design!

Regards,
Nick Maclaren.

Quadibloc

unread,

Jun 13, 2012, 5:50:45 PM6/13/12

to

On Jun 13, 2:42 pm, n...@cam.ac.uk wrote:
quoting me,

> >Do you mean things like the normalize instruction
> >provided by the PDP-8 EAE?

> I don't know that architecture, but possibly.

Then, I will elucidate. That instruction shifted the fixed-point
accumulator left as far as needed to put the most significant 1 in the
first bit position, and returned a count of the shifts in another
register.

John Savard

Uncle Steve

unread,

Jun 13, 2012, 8:46:00 PM6/13/12

to

On Sun, Jun 10, 2012 at 08:54:33AM -0700, Quadibloc wrote:
> On Jun 9, 2:34�pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > Old fashioned supercomputers had instruction sets designed to perform
> > double precision directly and to assist the programmer in producing
> > quad precision and expandable precision numeric codes.
>
> Yes, and most mainframes did as well; only a few early mainframes did
> double-precision in software. Today's microchips do double-precision
> directly, but they don't do single-precision directly on a lower-
> latency arithmetic unit.
>
> I have made some corrections and updates to the page announced here;
> the relevant section has been split out to its own page, at
>
> http://www.quadibloc.com/arch/ar050402.htm

Reading up a bit in that specification, you suggest that the added
complexity of the processor is intended to accommodate programmer
preferences, but I think you're heading in slightly the wrong
direction. I think it will be more of a matter if selecting a
different microcode load for the processor to select things like
floating-point addressing arrangements, rather than etching it
directly into Si, or whatever.

As to the constraints described in the referenced page, you could do a
search of 256bit memory configurations sorted by least number of
types, and compare it with the kinds of numbers used in common
numerical calculations to see how useful these accessing modes might
be. Admittedly, I haven't yet seen how it would be meaningfully
useful to have a distribution of available FP numbers such as 64, 48,
36, 36, 36, 36, although some of the other arrangements look slightly
interesting at first glance. OTOH, I wrote a few years ago about
addressing extensions to allow operations on arrays, which would make
memcpy a single instruction and would potentially simplify matrix
calculations.

Regards,

Uncle Steve

--
"Giving money and power to the government is like giving whiskey and
car keys to teenaged boys" - P. J. O'Rourke

Rick Jones

unread,

Jun 13, 2012, 8:59:26 PM6/13/12

to

MitchAlsup <Mitch...@aol.com> wrote:
> After all, we have pointers at 64-bits, integers at 64-bits, memory
> references at 64-bits, why not satisfy the base FP precision at
> 64-bits?

IIRC we have ILP32 most everywhere, LP64 for *nix "64-bit" and P64 for
Windows "64-bit."

rick jones
--
oxymoron n, Hummer H2 with California Save Our Coasts and Oceans plates
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

Terje Mathisen

unread,

Jun 14, 2012, 2:48:10 AM6/14/12

to

nm...@cam.ac.uk wrote:
> In article <ace225b5-f767-4fae...@googlegroups.com>,
> MitchAlsup <Mitch...@aol.com> wrote:
>> I was reading up on exact floating point numerics in the last couple of
>> weeks. In one paper they memtioned that their base level algorithms
>> use 6-15 times as many FP instructions as the more direct forms,
>> but only run 2.5 times slower. The author speculated that all of those
>> sequentially dependent FP instructions become independent "on the next
>> loop", harming speed by way less that what is obvious. I suspect they
>> are correct on their suspicion.
>
> I should be extremely surprised if they ARE all sequentially
> dependent! None of the multi-precision or Newton-Raphson code I
> have written has had less than about 2 ILP, and some have got up
> to much larger values. Also, my experience is that most ISAs are
> so hostile that 50-75% of the complexity is in control instructions.

From the little asm code I wrote for Itanium, I'd say that you can do
extenden/arbitrary precision with close to zero sequential dedendency,
as long as you can afford to work in a redundant format like carry-save.

I.e. pairs of values for each block of bits, and each basic operation
does a 3-way (or 4-way) Add_with_carry instead of a regular add,
bringing in the carries from the previous operation on the next lower
significant block.

At this point you can do all of the block operations in parallel.

>> What I would like to see is all of those 6-15 FP instruction exact
>> procedures be expressible in 2 FP instructions, which would dramaticaly
>> reduce the cost of extended precision activities, probably to the point
>> numericalists could afford to use them.
>
> I think that is a bad idea, because you have to fix the extensions
> in hardware, and don't allow variants. Indeed, I would go so far
> as saying that there is a strong case for providing only enabling
> primitives for both floating-point and extended precision, which
> would provide 80% of the gain at a fraction of the cost, AND
> provide flexibility that is currently lacking.

It would allow code like I outlined above to run 128-bit operations at
up to 50% of the speed of 64-bit, but more probably 25%.

>
> Yes, I know - I am trying to establish the Campaign for Real RISC
> Hardware! And benchmarketing rules :-(

You won't get rid of the current IEEE fp operations, but you might get
access to lower-level primitives...

nm...@cam.ac.uk

unread,

Jun 14, 2012, 5:13:06 AM6/14/12

to

In article <b2eoa9-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>

> From the little asm code I wrote for Itanium, I'd say that you can do
>extenden/arbitrary precision with close to zero sequential dedendency,
>as long as you can afford to work in a redundant format like carry-save.

About O(1/N), where N is the number of words you are using, in the
code of that form that I have written.

Regards,
Nick Maclaren.

Quadibloc

unread,

Jun 14, 2012, 12:56:13 PM6/14/12

to

On Jun 13, 6:46 pm, Uncle Steve <stevet...@gmail.com> wrote:
> I think it will be more of a matter if selecting a
> different microcode load for the processor to select things like
> floating-point addressing arrangements, rather than etching it
> directly into Si, or whatever.

The trouble with that is that using microcode instead of hardwired
logic has a performance penalty, and so the performance gain of using
optimum-sized floats is more than taken away.

The intent isn't simply to accomodate a programmer preference,
although the ability to run legacy code with minimal conversion is one
of my goals (Data Memory Width Control, an alternative method, which
is more elegant than the one discussed here, but which only works well
on smaller problems that fit in cache, is best adapted to that task),
but to achieve maximum machine efficiency by using an ALU designed to
the precision of the numbers needed.

Basically, I see the architecture as having four kinds of floating-
point ALU:

single-precision, for floats up to 36 bits long,
intermediate-precision, for floats up to 48 bits long,
double-precision, for floats up to 72 bits long, and
extended-precision, for floats up to 128 bits long.

In a large scale implementation, there would be this many of each on
the chip:

64 + 8 + 1 (73) single-precision ALUs,
64 + 1 (65) intermediate-precision ALUs,
64 + 4 + 1 (69) double-precision ALUs,
2 + 1 (3) extended-precision ALUs.

There would be 64 each of single, intermediate, and double-precision
ALUs to permit a Cray-style vector operation to be initiated *in each
cycle* provided all operands are on the chip's 8 megabyte L2 cache.

There would also be a 256-bit short vector unit, similar to AltiVec or
the successors to MMX, including two extended-precsion ALUs, four
double-precision ALUs, and eight single-precision ALUs. (A smaller
scale implementation would have the short vector unit, but handle
larger vectors strictly through pipelining in a single ALU, like the
Cray did.)

And, of course, there would be one of each type of ALU for scalar
operations.

Fixed-point ALUs would have a similar distribution.

Fully parallel decimal ALUs with advanced features like Wallace Trees,
on the other hand, would also be present but _not_ in multiples for
vector operations.

John Savard

Stephen Fuld

unread,

Jun 14, 2012, 1:35:07 PM6/14/12

to

On 6/14/2012 9:56 AM, Quadibloc wrote:

snip

> Basically, I see the architecture as having four kinds of floating-
> point ALU:
>
> single-precision, for floats up to 36 bits long,
> intermediate-precision, for floats up to 48 bits long,
> double-precision, for floats up to 72 bits long, and
> extended-precision, for floats up to 128 bits long.
>
> In a large scale implementation, there would be this many of each on
> the chip:
>
> 64 + 8 + 1 (73) single-precision ALUs,
> 64 + 1 (65) intermediate-precision ALUs,
> 64 + 4 + 1 (69) double-precision ALUs,
> 2 + 1 (3) extended-precision ALUs.
>
> There would be 64 each of single, intermediate, and double-precision
> ALUs to permit a Cray-style vector operation to be initiated *in each
> cycle* provided all operands are on the chip's 8 megabyte L2 cache.

Huh? Is this a typo? Do you mean L1 cache? If you have single cycle
access to the L2, why have an L1?

Also, do you comprehend what bandwidth is required for this? If you
need 64*64 = 4096 bits per clock of bandwidth from the L1, you seem to
require a "more than heroic" memory design. For even a 1 GHz clock,
that is 4 Tb/sec!!! As several people have pointed out before, the
"magic" of the Cray style design is not so much in the CPU, but in the
memory system's ability to deliver the required bandwidth.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Quadibloc

unread,

Jun 14, 2012, 2:18:53 PM6/14/12

to

On Jun 14, 11:35 am, Stephen Fuld <SF...@alumni.cmu.edu.invalid>
wrote:

> Huh? Is this a typo? Do you mean L1 cache? If you have single cycle
> access to the L2, why have an L1?

Basically, each functional unit, of which the computer has scads, has
its own L1 cache. These L1 caches are envisaged as rather small, some
small multiple of 4K data items.

The idea is that the functional units will work from their own L1
caches whenever possible, so that the one bus to the L2 cache will be
available for those functional units that need it.

> Also, do you comprehend what bandwidth is required for this? If you
> need 64*64 = 4096 bits per clock of bandwidth from the L1, you seem to
> require a "more than heroic" memory design. For even a 1 GHz clock,
> that is 4 Tb/sec!!! As several people have pointed out before, the
> "magic" of the Cray style design is not so much in the CPU, but in the
> memory system's ability to deliver the required bandwidth.

The bus coming from the L2 cache is indeed 4096 bits wide. Thus, a
vector operation may use the entire L2 cache bus, so if the character
and decimal and fixed-point and short vector functional units are to
initiate operations on the _same cycle_, they will need to get *their*
operands from their own L1 caches.

Note that a sizable chunk of an L1 cache can get loaded from the L2
cache over such a wide bus in a single cycle.

The chip might work at 4 GHz internally, but buses to external memory
apparently can't be more than 1 GHz with today's technology, and I
envisage the chip's data bus as being wide but not ridiculous - 256
bits.

So when things don't fit well in cache, the chip's performance will be
compromised, there's no two ways about that.

A general diagram of the structure of the proposed large-scale
implementation is on the page at

http://www.quadibloc.com/arch/ar05.htm

As noted, though, the architecture can be implemented in a number of
ways, right down to lots of microcode and an 8-bit wide ALU.

John Savard

Quadibloc

unread,

Jun 14, 2012, 2:26:53 PM6/14/12

to

On Jun 14, 12:18 pm, Quadibloc <jsav...@ecn.ab.ca> wrote:
> On Jun 14, 11:35 am, Stephen Fuld <SF...@alumni.cmu.edu.invalid>
> wrote:

> > Also, do you comprehend what bandwidth is required for this? If you
> > need 64*64 = 4096 bits per clock of bandwidth from the L1, you seem to
> > require a "more than heroic" memory design.

> So when things don't fit well in cache, the chip's performance will be
> compromised, there's no two ways about that.

So, basically, I'm envisaging a relatively plain memory design - not
the heroic bus to memory of the SX-6 chip with a 2,048-bit wide bus to
main memory and no cache, but simply a wide 256-bit bus - and a large,
but not exotic cache.

Instead, I just accept that sometimes I'll be waiting for memory. But
when things are in cache, since the chip has not only many
conventional ALUs in parallel, but also other functional units for
things like decimal and character operands that can run in parallel, I
think that the biggest practical obstacle to such a design - now that
it actually _is_ possible to put enough transistors on a chip to
realize it - is heat dissipation.

At times of peak performance, I will have a _lot_ of those transistors
all doing work, and thus potentially switching, at the same time.

So perhaps it's just as well the chip will be spending some of its
time waiting for memory. It will need that time to cool off in any
event.

John Savard

Stephen Fuld

unread,

Jun 14, 2012, 4:09:13 PM6/14/12

to

On 6/14/2012 11:18 AM, Quadibloc wrote:
> On Jun 14, 11:35 am, Stephen Fuld <SF...@alumni.cmu.edu.invalid>
> wrote:
>
>> Huh? Is this a typo? Do you mean L1 cache? If you have single cycle
>> access to the L2, why have an L1?
>
> Basically, each functional unit, of which the computer has scads, has
> its own L1 cache. These L1 caches are envisaged as rather small, some
> small multiple of 4K data items.
>
> The idea is that the functional units will work from their own L1
> caches whenever possible, so that the one bus to the L2 cache will be
> available for those functional units that need it.
>
>> Also, do you comprehend what bandwidth is required for this? If you
>> need 64*64 = 4096 bits per clock of bandwidth from the L1, you seem to
>> require a "more than heroic" memory design. For even a 1 GHz clock,
>> that is 4 Tb/sec!!! As several people have pointed out before, the
>> "magic" of the Cray style design is not so much in the CPU, but in the
>> memory system's ability to deliver the required bandwidth.
>
> The bus coming from the L2 cache is indeed 4096 bits wide. Thus, a
> vector operation may use the entire L2 cache bus, so if the character
> and decimal and fixed-point and short vector functional units are to
> initiate operations on the _same cycle_, they will need to get *their*
> operands from their own L1 caches.

I hope that someone who is a real circuit designer can chime in here,
but I get the impression that trying to do a 8 MB cache that can operate
in a single cycle at 4GHz is at best heroic, and with today's
technology, basically impossible. This is even with a large cache line
size of 512 bytes (4096 bits).

snip

> A general diagram of the structure of the proposed large-scale
> implementation is on the page at
>
> http://www.quadibloc.com/arch/ar05.htm

Thanks for the pointer. I get the impression that, in your design, a
chip would only contain one of these cores. If more than one, then is
the L2 shared or is there some shared L3 or no shared cache at all?

Quadibloc

unread,

Jun 14, 2012, 6:05:06 PM6/14/12

to

On Jun 14, 2:09 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:

> Thanks for the pointer. I get the impression that, in your design, a
> chip would only contain one of these cores.

Yes, that's right. Since the one core contains 64 sets of fixed and
floating ALUs, which may execute independent programs as well as being
linked in vector operations (the former mode being something like the
Cell processor, but the capabilities are far more limited)

http://www.quadibloc.com/arch/ar0104.htm

one core is about all one could expect to put on a chip.

That doesn't preclude having multiple cores on a chip where the
individual cores are more conventional implementations of the
architecture, omitting support for the more extravagant vector
features. In that case, the cache would be shared between the cores
following conventional practice.

Note, too, that the design is intended as being very aggressively
multi-threaded. One core, in a large-scale implementation, is
envisaged as having the capacity of running from 64 to 256 threads
simultaneously at maximum.

John Savard

MitchAlsup

unread,

Jun 14, 2012, 9:25:10 PM6/14/12

to SF...@alumni.cmu.edu.invalid

On Thursday, June 14, 2012 3:09:13 PM UTC-5, Stephen Fuld wrote:
> On 6/14/2012 11:18 AM, Quadibloc wrote:
> > The bus coming from the L2 cache is indeed 4096 bits wide. Thus, a
> > vector operation may use the entire L2 cache bus, so if the character
> > and decimal and fixed-point and short vector functional units are to
> > initiate operations on the _same cycle_, they will need to get *their*
> > operands from their own L1 caches.
>
> I hope that someone who is a real circuit designer can chime in here,
> but I get the impression that trying to do a 8 MB cache that can operate
> in a single cycle at 4GHz is at best heroic, and with today's
> technology, basically impossible. This is even with a large cache line
> size of 512 bytes (4096 bits).

The typical SRAM macro (the thing that is stepped and repeated) will have
128 or 256 bits (if you strip off the y-multiplexor). So to get 4096 bits,
you are going to be operating 16 macros at the same time. Since these macros
come with 1KB or 2KB of data, you are, in essence, accessing 32KB of data
per access. Thus power might end up being a concern.

It is entirely possible to make an 8 MB L2 that can sustain 1 access per
cycle at 4-ish GHz, but you are going to see at least 5-7 cycles of wire delay
into the array of macros, and another 5-7 cycles of wire delay comming back
out of the array of macros--best case. So, you better count on 11-15 cycles
minimum access time (but I predict actual access delays of 17-20 due to
"stuff happens when there are that many long wires".) Add cycles every time
accesses come from several places and a routing decision has to be made.

Some random data: An SRAM macro is about 1 clock wide on a single wire at
M6-M7 and the ideal repeater distance is somewhat shorter than the SRAM
macro is tall. Thus, buffering the wires over the macros is going to be
challenging.

Overall, when I was doing these kinds of things, we would prefer having an
L2 have at least one access begin every cycle and no beating of data (d0,
d1, d2, d3,...). Thus multiple banks (say 32-64) are desired and can start
a new access every cycle, complete an access every cycle, but the banks
them selves are not pipelined but serially resusable resources. One design
we did had the SRAM holding the tags be fully pipelined and the SRAMs
holding the data were not-pipelined at all. We saved a lot of power by
doing this operating the bank of SRAM macros with the timing from the top
of a column.

We also typically prefered that the L1 caches be optimized for processor
access widths (128-bits) so when 4096 bits showed up from the
L2 it would take a number of cycles to cram it into the L1. This causes
avoidable latency for ongoing and newly issued requests.

Mitch

Quadibloc

unread,

Jun 15, 2012, 11:45:29 AM6/15/12

to

On Jun 14, 7:25 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> It is entirely possible to make an 8 MB L2 that can sustain 1 access per
> cycle at 4-ish GHz, but you are going to see at least 5-7 cycles of wire delay
> into the array of macros, and another 5-7 cycles of wire delay comming back
> out of the array of macros--best case. So, you better count on 11-15 cycles
> minimum access time (but I predict actual access delays of 17-20 due to
> "stuff happens when there are that many long wires".) Add cycles every time
> accesses come from several places and a routing decision has to be made.

Not being a real circuit designer, I just thought that, oh, if I just
use six transistors per cell instead of four transistors per cell,
then the L2 cache will be just as fast as the register file and only
50% larger than usual...

and wire delays? Light travels a foot in a nanosecond; electricity
travels down wires at half the speed of light; and, of course, you
want the rise time of a pulse to be less than 50% of the width of the
pulse.

So the chip is less than 3 inches on a side, and you're good for 1
GHz; 3/4 inch, and 4 GHz. But that is a narrow squeak, now that you
mention it.

Since my design also involves the bits from the L2 cache going through
a rather elaborate switching network if Data Memory Width Control is
in effect, this becomes even more of an issue. That one has long
traces.

> We also typically prefered that the L1 caches be optimized for processor
> access widths (128-bits) so when 4096 bits showed up from the
> L2 it would take a number of cycles to cram it into the L1. This causes
> avoidable latency for ongoing and newly issued requests.

The idea here was that the L1 cache would have some sort of special
design so as to have both a 4096-bit wide port for transfers to and
from L2, and a port with a conventional width like 64 bits to
communicate with the ALU to which it belongs.

John Savard

MitchAlsup

unread,

Jun 15, 2012, 12:08:52 PM6/15/12

to

On Friday, June 15, 2012 10:45:29 AM UTC-5, Quadibloc wrote:
> On Jun 14, 7:25 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > It is entirely possible to make an 8 MB L2 that can sustain 1 access per
> > cycle at 4-ish GHz, but you are going to see at least 5-7 cycles of wire delay
> > into the array of macros, and another 5-7 cycles of wire delay comming back
> > out of the array of macros--best case. So, you better count on 11-15 cycles
> > minimum access time (but I predict actual access delays of 17-20 due to
> > "stuff happens when there are that many long wires".) Add cycles every time
> > accesses come from several places and a routing decision has to be made.
>
> Not being a real circuit designer, I just thought that, oh, if I just
> use six transistors per cell instead of four transistors per cell,
> then the L2 cache will be just as fast as the register file and only
> 50% larger than usual...

You have a lot to learn, Grasshopper.....

> and wire delays? Light travels a foot in a nanosecond; electricity
> travels down wires at half the speed of light; and, of course, you
> want the rise time of a pulse to be less than 50% of the width of the
> pulse.

While I agree that the speed of the electromagnetic wavefront moves
at c/2 in a wire, you have a wire that has a medium R and a fairly
large C which creates a rather insurmountable RC component. Wires
in a chip are not like wires you can touch and feel. Much of the
capacitance is to the wire just a wire's width away to the side and
a wire's height away above. In 1 mm the wire may have mutual inductance
tens of thousands of wires!

> So the chip is less than 3 inches on a side, and you're good for 1
> GHz; 3/4 inch, and 4 GHz. But that is a narrow squeak, now that you
> mention it.

In Opteron, at 16 gates of delay per clock cycle (plus flop jitter skew)
It took 1 cycle to get over the integer and data sections of the chip
and arrive at the FP instruction buffers. Which was about 70 microns.

> Since my design also involves the bits from the L2 cache going through
> a rather elaborate switching network if Data Memory Width Control is
> in effect, this becomes even more of an issue. That one has long
> traces.

Better block out some room and time for this switch/router/swizzler.

> > We also typically prefered that the L1 caches be optimized for processor
> > access widths (128-bits) so when 4096 bits showed up from the
> > L2 it would take a number of cycles to cram it into the L1. This causes
> > avoidable latency for ongoing and newly issued requests.
>
> The idea here was that the L1 cache would have some sort of special
> design so as to have both a 4096-bit wide port for transfers to and
> from L2, and a port with a conventional width like 64 bits to
> communicate with the ALU to which it belongs.

You did not catch the "SRAMs have 128 or 256 bits per macro" That is to
say the width of the SRAM macro is 512 wires for the 256 bits*2 wires
per bit (plus about 64 wires for the width of the decoder and word line
driver.) In order to dump 4096 bits into a cache in as single cycle, you
would need 16 SRAM macros--or about 32-64KBytes per one of these L1s.

Mitch

Joe keane

unread,

Jun 15, 2012, 3:37:06 PM6/15/12

to

In article <jrd7cb$t09$1...@dont-email.me>,

Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>Huh? Is this a typo? Do you mean L1 cache? If you have single cycle
>access to the L2, why have an L1?

vector operation

inverse-single-cycle bandwidth, not single-cycle latency

Quadibloc

unread,

Jun 15, 2012, 4:11:11 PM6/15/12

to

On Jun 15, 1:37 pm, j...@panix.com (Joe keane) wrote:
> In article <jrd7cb$t0...@dont-email.me>,

Oh, that _too_, although I had not thought in terms of the delays
likely in a real design.

The ISA does provide for "repeat" instructions that vectorize other
instructions, and it is explicitly noted that vector instructions can
be vectorized again in that fashion.

So a vector instruction that handles 64 numbers at once - all 64 of
which start their voyage of calculation on the same cycle, in 64 ALUs
- can be followed, in *classic* Cray fashion, by another vector
instruction that handles 64 numbers at once that starts on the same
cycle.

As I said, I think the *big* problem with the design is that if the
chip ever did manage to work efficiently, it would melt. So I didn't
sweat the small stuff.

John Savard

Stephen Fuld

unread,

Jun 15, 2012, 4:37:57 PM6/15/12

to

On 6/14/2012 3:05 PM, Quadibloc wrote:

snip

> Note, too, that the design is intended as being very aggressively
> multi-threaded. One core, in a large-scale implementation, is
> envisaged as having the capacity of running from 64 to 256 threads
> simultaneously at maximum.

I looked at the architecture diagram again, but did not see any "control
structure" much less multiple instruction pointers and the mechanism of
keeping track of all of those simultaneous threads.

MitchAlsup

unread,

Jun 15, 2012, 4:36:46 PM6/15/12

to

On Friday, June 15, 2012 3:11:11 PM UTC-5, Quadibloc wrote:

> So a vector instruction that handles 64 numbers at once - all 64 of
> which start their voyage of calculation on the same cycle, in 64 ALUs
> - can be followed, in *classic* Cray fashion, by another vector
> instruction that handles 64 numbers at once that starts on the same
> cycle.

Err, not quite. {I hate to be picky, but}

A vector is a list of operations to a list of operands that pass
through a single computation unit one at a time.

An array is a list of operations to a list of operands that pass
through an array of computation units starting at the same time.

The CRAYs had a single FP adder a single FP multiplier, a single
FP Reciprocator,... that were fully pipelined. Memory references
were pipelined out into the interconnect but processed array style
in the memory banks themselves.

> As I said, I think the *big* problem with the design is that if the
> chip ever did manage to work efficiently, it would melt. So I didn't
> sweat the small stuff.

Perhaps a heat magnet would be appropriate <he says in jest>!

Mitch

Stephen Fuld

unread,

Jun 15, 2012, 4:43:17 PM6/15/12

to

Yes, but . . .

As John indicated elsewhere in this thread, he did not understand the
latency issues of the large cache. The architecture diagram doesn't
show any instruction scheduling mechanism, re-order buffer, etc. so it
appears to be an in-order, albeit super-scaler design. There, such
large latency matters.

Quadibloc

unread,

Jun 16, 2012, 1:19:41 PM6/16/12

to

On Jun 15, 2:43 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:
> The architecture diagram doesn't
> show any instruction scheduling mechanism, re-order buffer, etc. so it
> appears to be an in-order, albeit super-scaler design. There, such
> large latency matters.

I am envisaging that the design will be rather simple. Instead of
using out-of-order execution, it works by having a lot of threads
running at the same time. (The possibility of switching to an o-o-o
mode if there aren't many threads, though, is something I've
considered.)

There is an instruction scheduling mechanism, it's just illustrated
elsewhere:

http://www.quadibloc.com/arch/ar0501.htm

with a pretty color picture on the very next page.

John Savard

Quadibloc

unread,

Jun 16, 2012, 1:21:58 PM6/16/12

to

On Jun 15, 2:37 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:

> I looked at the architecture diagram again, but did not see any "control
> structure" much less multiple instruction pointers and the mechanism of
> keeping track of all of those simultaneous threads.

Yes, the architecture diagram does not illustrate instruction decode
or related matters at all, only data and computation.

The next page notes that the machine is superscalar, and shows how
microops are issued for later execution in parallel by the several
functional units. But instruction decode, and whether anything like
out-of-order issue is attempted, are not really discussed.

John Savard

Andy (Super) Glew

unread,

Jun 17, 2012, 4:56:50 PM6/17/12

to MitchAlsup

On 6/11/2012 2:06 PM, MitchAlsup wrote:
> Why not make 64-bit FP the 'single precision' of the 201x-and-onward? Then
> support even larger FP formats with the ability to perform exact numerics.

Of course FP16 is also pretty popular - although AFAIK FP16 is only ever
used to represent numbers in memory, and is converted to/from FP32 on
load and store. I do not know anyone who has used FP16 for actual
computation.

(Blatant troll: if there is such, I want to update
http://semipublic.comp-arch.net/wiki/FP16_(16-bit_floating_point_format)
)

Niels Jørgen Kruse

unread,

Jun 18, 2012, 10:09:43 AM6/18/12

to

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:

> (Blatant troll: if there is such, I want to update
> http://semipublic.comp-arch.net/wiki/FP16_(16-bit_floating_point_format)
> )

The usual word is a plug, I believe.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Tim McCaffrey

unread,

Jun 18, 2012, 10:40:11 AM6/18/12

to

In article <4FDE4492...@SPAM.comp-arch.net>, an...@SPAM.comp-arch.net
says...

The closest I know of was an article in Dr. Dobb's Journal circa 1978 for a
full floating point library for the 8080 by C.B. Falconer (and he called it
the Falconer Floating Point Library). It was not IEEE, however.

- Tim

Stephen Fuld

unread,

Jun 18, 2012, 5:28:52 PM6/18/12

to

On 6/16/2012 10:19 AM, Quadibloc wrote:
> On Jun 15, 2:43 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:
>> The architecture diagram doesn't
>> show any instruction scheduling mechanism, re-order buffer, etc. so it
>> appears to be an in-order, albeit super-scaler design. There, such
>> large latency matters.
>
> I am envisaging that the design will be rather simple. Instead of
> using out-of-order execution, it works by having a lot of threads
> running at the same time. (The possibility of switching to an o-o-o
> mode if there aren't many threads, though, is something I've
> considered.)

OK, I understand. But consider this. I looked at the architectural
diagram again. I couldn't tell for sure, how many registers and how
wide were in a user's context. Let's just say for argument that there
are 512 bytes of registers. Now remember, these have to duplicated for
each thread. So given that the document says you will have 256 threads,
you need a 256*512 = 128K bytes register file. If you support the
execution of two instructions at once, it needs six ports (four read and
two write) Now some of this is alleviated by the fact that you have
separate register files for vector operations, but again, ISTM that this
ill be hard to do. Again, perhaps Mitch or another circuit designer
can comment, but having this big a register file accessed in a single
cycle seems to me to be "hard".

Quadibloc

unread,

Jun 18, 2012, 8:00:42 PM6/18/12

to

On Jun 18, 3:28 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:

> OK, I understand. But consider this. I looked at the architectural
> diagram again. I couldn't tell for sure, how many registers and how
> wide were in a user's context.

That information is in other diagrams on the page.

And, yes, the register file that is envisaged is rather extravagant.
There might be only two processors that have the full complement of
registers of the architecture - since most processes will be ordinary
scalar ones that don't use the vector hardware.

However, I've since realized that although I thought that 64 to 256
threads would be enough, it looks like I need the ability to have 512
threads to keep the architecture busy. I'm envisaging that the L1
caches will typically have 16k words of whatever size of data in them,
and each process will have a 4k window. The idea being that having
four threads to rotate between within each ALU will deal with data
dependencies, and allow the ALUs to be kept busy.

4 times 128 (64 fixed-point and 64 floating-point) is 512. That's a
few threads short, since there's the primary ALU and the decimal ALUs
and so on, but I envisage that the vector ALU banks need not be kept
100% full with scalar processes.

John Savard

Stephen Fuld

unread,

Jun 19, 2012, 12:05:13 AM6/19/12

to

On 6/18/2012 5:00 PM, Quadibloc wrote:
> On Jun 18, 3:28 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:
>
>> OK, I understand. But consider this. I looked at the architectural
>> diagram again. I couldn't tell for sure, how many registers and how
>> wide were in a user's context.
>
> That information is in other diagrams on the page.
>
> And, yes, the register file that is envisaged is rather extravagant.
> There might be only two processors that have the full complement of
> registers of the architecture - since most processes will be ordinary
> scalar ones that don't use the vector hardware.
>
> However, I've since realized that although I thought that 64 to 256
> threads would be enough, it looks like I need the ability to have 512
> threads to keep the architecture busy.

So, since each thread needs its own set of registers within the CPU,
you need to duplicate at least the fixed and floating point registers
512 times within the CPU. It isn't the size of the cache I am talking
about here, but the size of the register file.

Quadibloc

unread,

Jun 19, 2012, 2:14:38 PM6/19/12

to

On Jun 18, 10:05 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid>
wrote:

> So, since each thread needs its own set of registers within the CPU,
> you need to duplicate at least the fixed and floating point registers
> 512 times within the CPU. It isn't the size of the cache I am talking
> about here, but the size of the register file.

Yes, and just below the illustration showing how the register file
works, I note that it would be (as a reasonable _minimum_) 384
kilobytes in size.

Processes that don't use the supercomputer-style vector facility only
require 8 32-bit arithmetic/index registers, 8 128-bit floating-point
registers, 32 64-bit base registers, and 16 256-bit short vector
registers; that is 928 bytes per process.

So that was a large enough register file when there was a maximum of
256 processes.

John Savard

Stephen Fuld

unread,

Jun 19, 2012, 5:10:02 PM6/19/12

to

On 6/19/2012 11:14 AM, Quadibloc wrote:
> On Jun 18, 10:05 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid>
> wrote:
>
>> So, since each thread needs its own set of registers within the CPU,
>> you need to duplicate at least the fixed and floating point registers
>> 512 times within the CPU. It isn't the size of the cache I am talking
>> about here, but the size of the register file.
>
> Yes, and just below the illustration showing how the register file
> works, I note that it would be (as a reasonable _minimum_) 384
> kilobytes in size.

OK. The point I am trying to make is that you probably can't achieve
single cycle access (required for registers) to a register file that
large, especially as you need six ported accesses to it. If my
supposition is right, it makes having so many processes available for
immediate execution not feasible.

Again, a real circuit designer please confirm or refute my supposition.

Quadibloc

unread,

Jun 20, 2012, 2:20:33 PM6/20/12

to

On Jun 19, 3:10 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:

> OK. The point I am trying to make is that you probably can't achieve
> single cycle access (required for registers) to a register file that
> large, especially as you need six ported accesses to it. If my
> supposition is right, it makes having so many processes available for
> immediate execution not feasible.

That could well be true, but, of course, there _is_ a simple solution.

If I bring things out of L2 cache to L1 cache... I could also *cache
the register file* so that I only lose cycles when I switch from one
functional unit to another.

It wouldn't work like conventional caching, since data would be pushed
by the central control unit, not pulled by the individual ALU. So the
floating-point register contents would be sent to the local floating
registers of the floating ALU being used, and the fixed-point
registers to the local fixed-point registers of the fixed-point ALU
being used.

Instructions that convert from floating to fixed, or from fixed to
decimal, would involve some extra delays in this model.

So _of course_ I can have "that many" processes - they just won't be
as tightly coupled as I might have hoped, and the advantage over such
a thing as Knight's Corner decreases.

John Savard

Quadibloc

unread,

Jun 26, 2012, 3:38:16 PM6/26/12

to

On Jun 20, 12:20 pm, Quadibloc <jsav...@ecn.ab.ca> wrote:

> So _of course_ I can have "that many" processes - they just won't be
> as tightly coupled as I might have hoped, and the advantage over such
> a thing as Knight's Corner decreases.

Incidentally, in real life, the Cray Threadstorm processors, from the
Cray XMT, which fit into Opteron sockets, have 128 threads. However,
apparently they're *barrel* processors, while my design issues
instructions on a more _ad hoc_ basis (since the machine is also
superscalar, and has functional units of radically different types,
such as decimal units, this is required for full efficiency - instead
of having 128 threads cycling on one ALU, so that by the time we get
back to the same thread, there are no pipeline dependencies, I'm
assuming we have 512 threads approaching issuing one instruction each
clock, and so dependencies *do* cause delays, although there's plenty
of other stuff to do while waiting - threads with dependencies execute
more slowly, but no resources are wasted as a result).

John Savard

Jean-Marc Bourguet

unread,

Jun 26, 2012, 4:06:07 PM6/26/12

to

You probably are wasting the register files used to hold your threads
state. The trick is in balancing all the factors, not minimizing one of
them.

Powerwise, I wonder how long superscalar designs would be able to sustain a
100% load factor on all execution units.

Yours,

--
Jean-Marc

Quadibloc

unread,

Jun 26, 2012, 8:15:03 PM6/26/12

to

On Jun 26, 2:06 pm, Jean-Marc Bourguet <j...@bourguet.org> wrote:

> Powerwise, I wonder how long superscalar designs would be able to sustain a
> 100% load factor on all execution units.

Current real-world superscalar designs, I should hope, would not have
too much trouble.

I have noted, however, that my hypothetical design probably would have
heat management as its most severe problem, because I have proposed
one that is rather massively superscalar. Of course, if Intel can
manufacture Knight's Corner, that problem _is_ solvable - and, indeed,
it is solvable in a trivial way. Lower the clock rate enough.

But since the whole point of the design is to make a more powerful
chip, that is self-defeating. Even so, the idea of the architecture is
to provide a more tightly-coupled parallelism than multiple cores
would, so it may still be a good way to organize X transistors _on the
same chip_.

The real debate should be about whether we should continue to put more
transistors on the same chip if all we do with them is make more cores
- instead of making smaller chips that run cooler and have higher
yields. Packaging costs - and the _very_ extraneous factor of the way
Microsoft licenses Windows - become the issues here, perhaps.

Although cache coherency between cores is the legitimate rationale for
this...

John Savard

Quadibloc

unread,

Jul 30, 2012, 1:57:58 PM7/30/12

to

On Jun 19, 3:10 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:

> OK. The point I am trying to make is that you probably can't achieve
> single cycle access (required for registers) to a register file that
> large, especially as you need six ported accesses to it.

This has given me some inspiration.

Note, though, that the Cray-I had eight vector registers, each with 64
double-precision numbers - and the Cray X1 had thirty-two such
registers.

My imaginary architecture was simply offering sixty-four registers,
each with 64 extended-precision numbers. In addition to that factor of
four, the register file was designed to allow a maximum of four
processes which used the sets of sixty-four vector registers.

So it's only sixteen times larger than a register file already in use
in existing (and somewhat old!) hardware.

But registers are difficult, and cache is easier.

And modern designs have so much on-chip cache that it seems to me that
given that old-fashioned computers were able to do a lot of work with
only, say, 32K words of memory, that it may be useful to include in an
architecture the ability to access scratchpad memory on the chip,
built from the same memory cells as the L2 cache.

Thus, we could have behavior like that of the Control Data 6600, where
the off-chip DRAM is treated like bulk core, accessed only by block
move instructions. Although I think that single-word load-store
instructions, at least, would also be required, since sometimes one
will have large and sparse or randomly-accessed arrays.

John Savard

MitchAlsup

unread,

Jul 30, 2012, 3:46:23 PM7/30/12

to

On Monday, July 30, 2012 12:57:58 PM UTC-5, Quadibloc wrote:
> This has given me some inspiration. Note, though, that the Cray-I had eight vector registers, each with 64 double-precision numbers

The 8 vector registers were accessed with 5 levels of 4:1 multiplexers (which would NOT fit in a single cycle (latency) but could be cycled at a single cycle. The 6-th level of multiplexers contained 2 inputs for the registers (even/odd or first-four/second-four) and 2 chaining inputs.

Vector register machines ruled while the entirety of memory latency could be absorbed by the number of registers in the vector. And then their generation passed.

> - and the Cray X1 had thirty-two such registers. My imaginary architecture was simply offering sixty-four registers, each with 64 extended-precision numbers.

X1 was not a 10 gate machine like CRAY-1s, more like a 16-20 gate machine more like opteron,....

Also note: machines pushing towards 10 gates per cycle will expend as much as 33% of all gates simply being the clock delivery mechanism, whereas 16 gate machines will expend 20%, 20 gate machines 12%-15%,.... short cycle times eat power. It is my current belief, that machines of 20-odd gates per cycle are more power efficient and allow more scalarity than faster machines. With an instruction set architected for this design space, one might save 10% of the instruction count.

> In addition to that factor of four, the register file was designed to allow a maximum of four processes which used the sets of sixty-four vector registers. So it's only sixteen times larger than a register file already in use in existing (and somewhat old!) hardware. But registers are difficult, and cache is easier.

Err, not quite.... Several multithreaded and VLSI machines accesses their set of registers over several cycles keeping the port count down (size and power). Caches cannot be so partitionied--at least to the extend I have seen.

> And modern designs have so much on-chip cache that it seems to me that given that old-fashioned computers were able to do a lot of work with only, say, 32K words of memory, that it may be useful to include in an architecture the ability to access scratchpad memory on the chip, built from the same memory cells as the L2 cache. Thus, we could have behavior like that of the Control Data 6600, where the off-chip DRAM is treated like bulk core, accessed only by block move instructions.

There are lots of reasons to move back to an architecture where one has a large but not huge coherent main memory backed up by a even larger "Extended Core Store" accessed in page sized transfers which is not cache coherent. These architectures have the advantage that one could perform I/O to (or only to) the ECS saving the coherence cycles on things that should not be in someone's cache (while I/O is going on.) These work for stadium sized multiprocessors with fairly easy scaling.

> Although I think that single-word load-store instructions, at least, would also be required, since sometimes one will have large and sparse or randomly-accessed arrays.

Vector Indirect (i.e gather/scater) is the appropriate technology to support random vector reads and writes. Leaving the single word reads and writes to support the scalar vector subset.

Mitch

Ivan Godard

unread,

Jul 30, 2012, 4:46:18 PM7/30/12

to

On 7/30/2012 12:46 PM, MitchAlsup wrote:

> Also note: machines pushing towards 10 gates per cycle will expend as
> much as 33% of all gates simply being the clock delivery mechanism,

> +whereas 16 gate machines will expend 20%, 20 gate machines

> 12%-15%,.... short cycle times eat power. It is my current belief,
> that machines of 20-odd gates per cycle are more power efficient and
> allow more scalarity than faster machines. With an instruction set
> architected for this design space, one might save 10% of the
> instruction count.

That's our experience too. Shallow pipes and slow clocks not only save
power, they save in the complexity of fault handling, interrupts, and
stalls. 32 stages is ridiculous. Of course it's different for streaming
processors (signal processing, GPUs).

However... would you say the same if you were working in an unclocked
(asynchronous) process?

MitchAlsup

unread,

Jul 30, 2012, 5:43:44 PM7/30/12

to

On Monday, July 30, 2012 3:46:18 PM UTC-5, Ivan Godard wrote:
> On 7/30/2012 12:46 PM, MitchAlsup wrote:
> Also note: machines pushing towards 10 gates per cycle will expend as
> much as 33% of all gates simply being the clock delivery mechanism,
> +whereas 16 gate machines will expend 20%, 20 gate machines
> 12%-15%,.... short cycle times eat power. It is my current belief,
> that machines of 20-odd gates per cycle are more power efficient and
> allow more scalarity than faster machines. With an instruction set
> architected for this design space, one might save 10% of the
> instruction count.

> That's our experience too. Shallow pipes and slow clocks not only save power, they save in the complexity of fault handling, interrupts, and stalls. 32 stages is ridiculous. Of course it's different for streaming processors (signal processing, GPUs).

I once did a 20 gate design for a 6-wide OoO superscalar RISC design point that was only 6 pipe stages deep (7 if you count a hidden pipe stage). When I tried to do a similar design at 12 gates, it balloned out towards 24 stages. I developed a rule of thumb for this explosion::

For every pipe stage you chop into 2 stages, you end up with at least 2.5 stages in the final microarchitecture. So a 10 stage 20 gate design will end up as at least 25 stages at 10 gates.

> However... would you say the same if you were working in an unclocked (asynchronous) process?

If you adequately account for all the fan-out buffering you aren't going to get below 8% of all circuitry being the clock. Here, in a Southerland sense, the Muller C-gates are considered part of the clock.

The difficulty wiht asynchronous designs (Southerland style) is the concordance problem--where signals from mor than one pipe stage merge into a new calculation (like register oprand forwarding.) This gets "bad" with superscalar data paths.

Mitch

Andy (Super) Glew

unread,

Jul 30, 2012, 7:01:01 PM7/30/12

to

I have a different diagnosis of a problem with asynchronous logic (apart
from the fact that all companies that I have encountered are afraid of
it, with both good and bad reasons).

Merging, what you call concordance, is annoying. Perhaps even painful.
Say that you are merging two asynchronous channels - I'll say
"channels", for want of a better word. If you don't always have to
merge - i.e. sometimes a "packet" arrives on channel 0, and gets sent
forward, sometimes on channel 1, and gets forwarded on, but sometimes a
packet arrives on channel 0 that must wait for something to arrive on
channel 1 - then there must be an indication that waiting for the other
channel is necessary, and some sort of identification of who to wait
for. (If you always wait for the other channel, not so bad.)

Basically an arbitration. With the arbitration time paid for at point
of use.

I think the real problem with asynchronous logic is pipelining. Later
operations in the pipeline are delayed by the worst case timing of
operations earlier in the pipeline. Of course, if the earlier
operations are fast, then no delay. But if they are not fast, then you
get delayed.

If you want to bypass out of the middle of such an asynchronous
pipeline, to allow some operations that do not need the slowed down
pipestages to complete, then you have the concordance problem.

One of the advantages of sequential circuits is that usually you know
exactly when you are going to write, e.g. to a shared register file
port. Basically you pre-arbitrate, in advance, so the arbitration time
is overlapped, and does not influence the rate of advance as much.

Now, I suppose that you could do a similar pre-arbitration for asynch.
Instead of matching the IDs, you could predict that, say, the bypasses
should be selected in the order 0-1-2-4-5-3-6-7-8-... Note the minor
reordering. which might correspond to the arbitrary circuit allowing
0fast-1fast-2fast-4fast-5fast-3slow-6fast-7fast-8fast, where fast and
slow are the inputs from say, an integer adder (latency 3?) and a
multiplier (latency 5?).

This pre-arbitration is faster than the general arbitration needed for
general merging. But it is still a bit more complex than the merging
that occurs in a synchronous circuit.

--

I guess that I am agreeing with you in one sense, Mitch: merging,
concordance, is a pain.

I am just responding to the people who say "a fast asynch pipeline
removes the need for out-of-order". (A conversation I have had with
Sutherland. Hi, Ivan! He sometimes reads comp.arch, and lives in
Portland.) Note: by pipeline here I mean what I heard somebody call, I
think incorrectly, a systolic pipeline - a simple 1D pipeline, not a
pipeline network.

A fast pipeline of any sort removes the need for out-of-order, if all
operations are equally fast, and if you never have a slow operation
delay a fast operation.

Ditto if the slow operations are rare.

But if the fast operations are not uncommon, then you want early out.
(Early out may mean in-order dispatch, but out-of-order completion.)

And early out requires merging.

And merging adds delay.

--

By the way: one of the reasons that I got into OOO was that I expected
it to be better able to take advantage of the data value dependent
variable latencies that asynchronous circuits provide.

I'm still sympathetic to asynch, and I like thinking about how to make
the asynch circuits competitive. I'm willing to handwave away the
validation and reliability issues, which asynch fans think are largely
solved. (By the way, a few years ago the state of the art in asynch
validation was - use simple linear pipelines only. Then there is
deterministic state evolution, just in sequence, not at particular
sample times.)

But the problem of arbitration, merging, concordance can affect
computation rates.

Tim McCaffrey

unread,

Jul 31, 2012, 11:13:44 AM7/31/12

to

In article <5017122D...@SPAM.comp-arch.net>, an...@SPAM.comp-arch.net
says...
>
[snip]

>
>Merging, what you call concordance, is annoying. Perhaps even painful.
>Say that you are merging two asynchronous channels - I'll say
>"channels", for want of a better word. If you don't always have to
>merge - i.e. sometimes a "packet" arrives on channel 0, and gets sent
>forward, sometimes on channel 1, and gets forwarded on, but sometimes a
>packet arrives on channel 0 that must wait for something to arrive on
>channel 1 - then there must be an indication that waiting for the other
>channel is necessary, and some sort of identification of who to wait
>for. (If you always wait for the other channel, not so bad.)
>
>Basically an arbitration. With the arbitration time paid for at point
>of use.
>

[snip]

>
>By the way: one of the reasons that I got into OOO was that I expected
>it to be better able to take advantage of the data value dependent
>variable latencies that asynchronous circuits provide.
>

This is perhaps stating the obvious, but it does seem like dataflow and async
logic compliment each other. Dataflow already has to solve the merging
problem.

- Tim

Quadibloc

unread,

Jul 31, 2012, 1:39:51 PM7/31/12

to

On Jul 30, 2:46 pm, Ivan Godard <igod...@pacbell.net> wrote:
> On 7/30/2012 12:46 PM, MitchAlsup wrote:
>
> > Also note: machines pushing towards 10 gates per cycle will expend as
> > much as 33% of all gates simply being the clock delivery mechanism,
> > +whereas 16 gate machines will expend 20%, 20 gate machines
> > 12%-15%,.... short cycle times eat power. It is my current belief,
> > that machines of 20-odd gates per cycle are more power efficient and
> > allow more scalarity than faster machines. With an instruction set
> > architected for this design space, one might save 10% of the
> > instruction count.
>
> That's our experience too. Shallow pipes and slow clocks not only save
> power, they save in the complexity of fault handling, interrupts, and
> stalls. 32 stages is ridiculous. Of course it's different for streaming
> processors (signal processing, GPUs).

This explains the (partial) demise of the Pentium IV architecture...

The thing is, though, that if you change a 20-gate machine to a 10-
gate machine, yes, you've got more gates, but by some measures, what
you've got is as good - or *better*, because it's more tightly coupled
- than *two cores* of the 20-gate machine.

So, what, 23% extra gates is worse than 100% extra gates? Obviously
I've still got a problem here...

John Savard

Paul A. Clayton

unread,

Jul 31, 2012, 2:45:44 PM7/31/12

to

On Tuesday, July 31, 2012 1:39:51 PM UTC-4, Quadibloc wrote:
[snip]

> The thing is, though, that if you change a 20-gate machine to a 10-
> gate machine, yes, you've got more gates, but by some measures, what
> you've got is as good - or *better*, because it's more tightly coupled
> - than *two cores* of the 20-gate machine.
>
> So, what, 23% extra gates is worse than 100% extra gates? Obviously
> I've still got a problem here...

The branch misprediction penalty will increase and the latency in
cycles of cache accesses will increase (operation latency reduces
execution parallelism which tends to increase the need for a
larger window to feed the same number of memory accesses/misses
[MLP]). Other 'loops' would also tend to be stretched with
negative consequences. (Establishing equal time pipeline
stages tends to become more difficult as the cycle time shrinks
[bin packing problem?], so the design cost would increase.
[There was an interesting paper which proposed using more
power-efficient and slower logic in some stages rather than
using a strict equal work per stage division. A simple
application of this might allow stages to be separated at
more natural divisions to simplify design.])

Power density would also be more of a problem. More gates
switched at twice the rate would increase power more than
two fold in an area less than twice as large. A speed
demon design might also use more fast (leaky) transistors,
so static power might also increase. If more aggressive
power saving mechanisms are not used, the short bursts of
idleness caused by slow caches et al. would tend to use
more power than for a design with a longer clock cycle.

With DVFS seeming to approach hard limits, cutting off
power to a separate (slower) core may be *much* more
power efficient than cutting performance in half in a
single fast core. Core size and complexity could also
influence the speed with which power saving changes can
be applied (even in absolute time, very likely in cycles).

Well, that is my 2-cent speculation.

MitchAlsup

unread,

Jul 31, 2012, 3:05:32 PM7/31/12

to

On Tuesday, July 31, 2012 12:39:51 PM UTC-5, Quadibloc wrote:
> The thing is, though, that if you change a 20-gate machine to a 10- gate machine, yes, you've got more gates, but by some measures, what you've got is as good - or *better*, because it's more tightly coupled - than *two cores* of the 20-gate machine. So, what, 23% extra gates is worse than 100% extra gates? Obviously I've still got a problem here...

You have at minimum 100% more flip flops, and 3 hand full more of logic gates. Plus any microarchitectural differences used to make up for the longer pipeline (<better or more kinds of> predictors, early outers,...)

For example, 1/2 of the gate count in the Opteron FP multiplier are FlipFlops (over 4200 flip flops). In order to pipe this one twice as fast, you end up with 12000 flip flops (a superlinear expansion). So, when we considered the implementation of the K9 as an 8 gate machine it made sense to use two multiplier arrays at 1/2 frequency saving both power and area. We altered the p[hase of the operand delivery, moved the flops by 1/2 cycle and dropped the power and area (compared to Opteron) by substantial numbers (power by close to 33% saving area by 15% or so.)

Since around 1/2 of the power consumption of a 16 gate machine is in the clock and flip flops, making the clock go faster by adding a lot of flip flops is counter productive to power efficiency. Going backwards down the gate path from 16 towards 20, one looses the least amount of frequency while gaining the most amount of IPC (my opinion). Most of this comes from a serious reduction in the number of pipe stages, combined with a small gain in IPC from a host of effects (like misaligned memory accesses cost essentialy zero additional cycles).

There are some things you really don't want to pipe at 2X the current rate, and it makes a lot of sense to run a pair in even/odd parallel. Caches and memory fall into similar circumstances.

Mitch

Quadibloc

unread,

Jul 31, 2012, 8:15:31 PM7/31/12

to

On Jul 31, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> For example, 1/2 of the gate count in the Opteron FP multiplier are
> FlipFlops (over 4200 flip flops). In order to pipe this one twice as fast,
> you end up with 12000 flip flops (a superlinear expansion).

But you don't use flip-flops to pipeline with.

Because most logic circuits end up with an "or" combining the outputs
of several "and"s, it is possible to make a circuit that isolates the
stages of a pipeline without adding any delay.

I forget the name of it...

John Savard

MitchAlsup

unread,

Jul 31, 2012, 9:35:37 PM7/31/12

to

On Tuesday, July 31, 2012 7:15:31 PM UTC-5, Quadibloc wrote:
>But you don't use flip-flops to pipeline with.

Flip flops (generic for latch or register hooked to clock) is exactly what one pipelines with!

clock-----V----------------V----------
logic->flipflop->logic->flipflop->logic

Flipflogs provide the stable assertion of the input signals to the logic until the stable outputs of the logic can be captured by the next layer of flip flops.

Most CMOS designs use registers, which are 2 back to back latches (more with scan paths) set up to avoid race conditions at clock edges and create the necessary 2 gates of dealy through the latch to eliminate race conditions. Flipflop is simply a generaic term that can stand in for either latch (in all its forms) and register (in all its forms) as the timed unit of storage used to break compuations into series of logical calculations broken by those stable storeage elements--that is:: pipelineing!

>Because most logic circuits end up with an "or" combining the outputs of several "and"s, it is possible to make a circuit that isolates the stages of a pipeline without adding any delay.

Its called an Earle Latch--used in 360/91 and gets rid of one of the 3 necessary gate delays of the latch, at the cost of power, where the input logic is combined with both clock high and clock low feedback paths to eliminate a gate of possible race condition. Pairs of these used in series can be used to pipeline static random logic. The latches using in 21064, 21164 were CMOS equivalents to the Earle latch, and if stories are accurate, cause a number of timing headaches.

Mitch

Paul A. Clayton

unread,

Aug 1, 2012, 6:59:05 AM8/1/12

to

On Tuesday, July 31, 2012 9:35:37 PM UTC-4, MitchAlsup wrote:
[snip]

> Flip flops (generic for latch or register hooked to clock) is
> exactly what one pipelines with!

Side question: does wave pipelining become more practical with
longer pipeline stages (so that process, data, voltage/frequency,
and thermal [?] variation would still not lead to timing issues)?

It seems that wave pipelining might interact interestingly
with asynchronous design.

Quadibloc

unread,

Aug 1, 2012, 7:32:45 AM8/1/12

to

On Jul 31, 7:35 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> Its called an Earle Latch--used in 360/91 and gets rid of one of the 3
> necessary gate delays of the latch, at the cost of power, where the
> input logic is combined with both clock high and clock low feedback
> paths to eliminate a gate of possible race condition.

Thanks. So it isn't as simple as I thought...

John Savard

MitchAlsup

unread,

Aug 1, 2012, 11:40:40 AM8/1/12

to

Wave pipelining is completely dependent on <lack of> dispersion of the waves being pipelined. The fastest wave of the subsequent computation cannot catch up wiht the slowest wave of the current computation. And then there is the issue of catching the wave in a flipflop at the end.

Mitch

Paul A. Clayton

unread,

Aug 1, 2012, 5:32:33 PM8/1/12

to

On Wednesday, August 1, 2012 11:40:40 AM UTC-4, MitchAlsup wrote:
> On Wednesday, August 1, 2012 5:59:05 AM UTC-5, Paul A. Clayton wrote:

[sni]

>> Side question: does wave pipelining become more practical with
>> longer pipeline stages (so that process, data, voltage/frequency,
>> and thermal [?] variation would still not lead to timing issues)?
>>
>> It seems that wave pipelining might interact interestingly
>> with asynchronous design.
>
> Wave pipelining is completely dependent on <lack of> dispersion
> of the waves being pipelined. The fastest wave of the subsequent
> computation cannot catch up wiht the slowest wave of the current
> computation. And then there is the issue of catching the wave in
> a flipflop at the end.

That is why I speculated that a slower clock might help; it
seemed that a slower clock would reduce the difficulty of more
timing-dependent design. (I have read very little about the
technique and have only read of it being used for cache accesses
[by an HP PA-RISC implementation and, possibly--poor memory--,
by a MIPS implementation]. Wave pipelining seems like a
simpler, more limited variation on timing-based asynchronous
design.)

I could imagine a cascaded ALU or multiplier possibly being
able to exploit wave pipelining in a design (but with my
ignorance, my imagination is not constrained by most physical
limits much less practical technological or economic limits).
(I wonder how wave pipelining would fit with a width-pipelined
ALU. It would be kind of neat--I imagine--if a diagonal
wave was used, where the carry was the crest of the wave.
Such might almost be practical with a ripple carry adder--
which might be used in the ARM Cortex-M0.)

If substantial portions of a design had more aggressive
balancing of timing across bits (as urged by the use of
wave pipelining?), there might be some simplification of
asynchronous (or less synchronous) design, I am guessing.
Without aggressive balancing of timing, wave pipelining
would seem to interact with asynchronous design in the
less pleasant sense of "interestingly".

Your mentioning of latching made me wonder if there could
ever be a use for a switch rather than a latch. In a wave
pipelined design with bit-interleaved ALUs, I could imagine
a switch might be able to avoid some wave overlap (e.g., if
one ALU was a half-cycle out of phase with the other).

MitchAlsup

unread,

Aug 2, 2012, 12:34:44 PM8/2/12

to

On Wednesday, August 1, 2012 4:32:33 PM UTC-5, Paul A. Clayton wrote:
> On Wednesday, August 1, 2012 11:40:40 AM UTC-4, MitchAlsup wrote:

>That is why I speculated that a slower clock might help; it seemed that a slower clock would reduce the difficulty of more timing-dependent design.

A slower clock allows MORE time for the fastest part of the subsequent wave to catch up with the slowest part of the previous wave, and thus increase the need for low dispersion in the circuits being wave pipelined.

Mitch

Andy (Super) Glew

unread,

Aug 1, 2012, 12:06:40 PM8/1/12

to

On 7/31/2012 6:35 PM, MitchAlsup wrote:
> On Tuesday, July 31, 2012 7:15:31 PM UTC-5, Quadibloc wrote:
>> But you don't use flip-flops to pipeline with.
>
> Flip flops (generic for latch or register hooked to clock) is exactly what one pipelines with!
>
> clock-----V----------------V----------
> logic->flipflop->logic->flipflop->logic
>
> Flipflogs provide the stable assertion of the input signals to the logic until the stable outputs of the logic can be captured by the next layer of flip flops.
>
> Most CMOS designs use registers, which are 2 back to back latches (more with scan paths) set up to avoid race conditions at clock edges and create the necessary 2 gates of dealy through the latch to eliminate race conditions. Flipflop is simply a generaic term that can stand in for either latch (in all its forms) and register (in all its forms) as the timed unit of storage used to break compuations into series of logical calculations broken by those stable storeage elements--that is:: pipelineing!

This is one of those terminology morasses. Some designers are adamant
that they are not use flipflops, they are using latches. Or registers. I
know some folks talk about "register files" being seas of gates (and
latches or flipflops) whereas others are adamant that register files
really are only ever arrays of SRAM-like cells. Same say arrays one only
ever of SRAM', others say that arrays can be any regular structure.

Much is related to tools. If your CAD tools only support arrays of SRAM
cells, and you only get "sea of" synthesized logic I then yarn names
move in that direction.

Anyway, pipelines have storage between chunks of logic. The
distinction between latches and flipflops is mainly at the circuit
level. Transparent latches can be somewhat visible to architects,
particularly with 2 or multiphase clocking, where you try to balanced
adjacent chunks of logic that compute indifferent phases. (Note that I
am carefully not saying pipestages: although I think it is reasonable to
call adjacent chunks of logic that compute on different phases
pipestages most teams I know don't.)

Anyway: I tend to use flipflop to refer to the classic storage clement
that has back to back cross connected logic, like a D flipflop. And
"latch" to refer a storage elementthat is more like a pass gate, a
capacitor, and another pass gate. I.e.flipflops are active whereas
latches are not. But there are all sort of in between such as a latch
which is connected to an active storage cell, but only when clock stopped.

Bottom line: make sure all parties mean the same thing for common words.

>
>> Because most logic circuits end up with an "or" combining the outputs of several "and"s, it is possible to make a circuit that isolates the stages of a pipeline without adding any delay.
>
> Its called an Earle Latch--used in 360/91 and gets rid of one of the 3 necessary gate delays of the latch, at the cost of power, where the input logic is combined with both clock high and clock low feedback paths to eliminate a gate of possible race condition. Pairs of these used in series can be used to pipeline static random logic. The latches using in 21064, 21164 were CMOS equivalents to the Earle latch, and if stories are accurate, cause a number of timing headaches.
>
> Mitch
>

--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Properties, Intel, AMD, Motorola, and Gould), I
reveal this only so that the reader may account for any possible bias I
may have towards my employer's products. The statements I make here in
no way represent my employer's position on the issue, nor am I
authorized to speak on behalf of my employer.

Andy (Super) Glew

unread,

Aug 4, 2012, 4:35:13 PM8/4/12

to

On 7/31/2012 10:13 AM, Tim McCaffrey wrote:
> In article <5017122D...@SPAM.comp-arch.net>, an...@SPAM.comp-arch.net
> says...
>>
> [snip]
>>

>> Merging, what you call concordance, is annoying. [snip]

>>
>> By the way: one of the reasons that I got into OOO was that I expected
>> it to be better able to take advantage of the data value dependent
>> variable latencies that asynchronous circuits provide.
>>
>
> This is perhaps stating the obvious, but it does seem like dataflow and async
> logic compliment each other. Dataflow already has to solve the merging
> problem.

That's what I thought. Unfortunately, dataflow is even more sensitive
to the performance lossage due to asynch convergent paths than in-order is.

On an OOO micro-dataflow machine, the OOO circuitry has delay. Latency.
It would be bad if back to back dependent operations had to go all the
way through the OOO scheduling circuitry before we could start the
dependent instruction. So, we schedule speculatively, based on
assumptions about when the data is going to be ready - e.g. three cycles
to go through the scheduler, + 1 cycle of latency. But in asynch all of
the above may be variable. 3-4 cycles are typical for OOO. (Some of
the bigger designs that eliminate CAMs may want to hide 18+ cycles of
scheduler latency - at which point you are scheduling groups of
instructions not single instructions.)

Now, this is not all that different from in-order. Basically, the inner
core of OOO is an in-order machine designed to hide scheduler latency.

But when an asynch in-order machine stalls because of convergence, well,
that's just what you have. When an ooo machine stalls because of
convergence, that's an opportunity lost - you might have done better
scheduling something else.

Put this way, it's not really a problem, but an opportunity. But its
still annoying.

Tim McCaffrey

unread,

Aug 8, 2012, 11:37:40 AM8/8/12

to

In article <501D8781...@SPAM.comp-arch.net>, an...@SPAM.comp-arch.net
says...

I have very limited hardware experience (but lots of software), and I always
envisioned a dataflow machine consisting of functional units that have a
queue of operands and a set of operations. When you get the necessary
operands for an operation (perhaps queue is the wrong word, as things can be
done in any order), you do it, and forward the result to the next functional
queue. I don't see scheduling as really necessary. If you need certain
operations (like memory access) to be done in a certain order, that is the
problem of the memory interface functional unit, which persumably would have
operations that specify ordering.

But, what they hey, its not like I've actually built anything like that. So
I don't know how practical that is. Yes, I know matching operands to
operators is somewhat of a challenge (but probably pretty straightforward).
I would be more worried about interconnect between functional units...

- Tim

EricP

unread,

Aug 8, 2012, 2:18:59 PM8/8/12

to

If all you want is one or a set of output values for a given
set of inputs, then that would work. You init your input vars,
say "go" and everything calculates ASAP and out pops the results.

However if you include loops, feedback and/or time varying values
then you do need to consider value version tags and values queues.

C = A + B
L = I - J + C
Z = X * Y + C

If A and B are ready, then C(version 1) is produced and forwarded
to those waiting for C.
If A is updated to version 2, then C(v2) is produced and forwarded.

Managing versions becomes part of the problem.
Either A(v1) and A(v2) can both exist at once,
which implies potentially infinite resources,
or you need some handshake mechanism to ACK when all consumers
of A(v1) have used it, clears A's busy flag, and allow it to be reused,
which implies forward values flows and backward ACK flows.

Eric

Tim McCaffrey

unread,

Aug 8, 2012, 4:49:58 PM8/8/12

to

In article <a5yUr.45916$Zl3....@newsfe06.iad>,
ThatWould...@thevillage.com says...

Hmm, being a software guy I would take an approach more akin to SSA: If a
result needs to be used in multiple places, send it to a functional unit that
duplicates it and sends a copy to all the places it needs to go. No need for
busy counts, etc.

However, unless somebody comes up with something really tricky, it is likely
some kind of throttling mechanism would be needed.

- Tim

Quadibloc

unread,

Sep 8, 2012, 7:18:06 AM9/8/12

to

Of course, since I posted this, I did a search and found new papers
about improved versions of the Earle latch which still claim that the
latches "work as advertised".

Meanwhile... as I wrote,

> - When the 7090 was succeeded by the 360, a lot of programs had to
> switch all the way up to double precision, so the precision provided
> by 704 single precision floating-point was useful, and
>
> - The exponent range of IEEE 754 32-bit floats is significantly
> smaller than that of IBM 360 single precision, leading to an
> incompatibility with a major chunk of software,
>
> the fact that it's possible to construct a 36-bit number format based
> on IEEE 754 that has the exponent range of System/360 floats and the
> precision of 7090 single-precision floats, thanks to that one extra
> hidden bit makes 36-bit floating point numbers attractive.

and now I've started a new thread, because I have finally come up with
a technique that allows memory to be used, with a very limited amount
of waste as overhead, for arrays of 36-bit numbers only, without the
need to mix in some 64-bit and 48-bit values to avoid wasting a lot of
memory... *and* without the need to employ division (even by a
constant) in address calculation, my previous stumbling block.

John Savard