Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RISC != real-time control

18 views
Skip to first unread message

Philip Koopman

unread,
Apr 25, 1988, 1:15:53 PM4/25/88
to

One aspect of RISC processors for real time control that I
have not seen discussed is the conflict between
deadline scheduling and the statistical nature of
RISC performance figures.

Real-time control programs often have a situation where only
X microseconds are available to perform a task. Therefore,
the code to perform the task must be GUARANTEED to complete
within X microseconds. In real-time control, a late answer
is a wrong answer.

The problem with RISC designs is that they promise a performance
of Y MIPS in the average case over large sections of code and
relatively long periods of time. It seems to me that this
is not an applicable performance measure for real-time control.
What is more important is worst-case performance (maximum
possible cache misses for that program, branch-target buffer
misses, etc.) It may be the case that a slower processor
with uniform performance can be rated at a higher usable
MIPS rate than a RISC processor with inconsistent
instantaneous performance.

So, what is a real-time control designer to do?

-- De-rate the RISC MIPS ratings to assume 100% cache misses?

-- Use (probably) non-existent tools to compute worst-case
program execution time under all possible conditions?

-- Not use RISC in an environment with short deadline events?


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Phil Koopman 5551 Beacon St. ~
~ Pittsburgh, PA 15217 ~
~ koo...@faraday.ece.cmu.edu (preferred address) ~
~ koo...@a.gp.cs.cmu.edu ~
~ ~
~ Disclaimer: I'm a PhD student at CMU, and I do some ~
~ work for WISC Technologies. ~
~ My opinions are my own, etc. ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Larry Weber

unread,
Apr 25, 1988, 9:57:49 PM4/25/88
to
In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
>
>One aspect of RISC processors for real time control that I
>have not seen discussed is the conflict between
>deadline scheduling and the statistical nature of
>RISC performance figures.
>
> ...

>So, what is a real-time control designer to do?
>
>-- De-rate the RISC MIPS ratings to assume 100% cache misses?
>
>-- Use (probably) non-existent tools to compute worst-case
> program execution time under all possible conditions?
>
>-- Not use RISC in an environment with short deadline events?
>
Cache effects can be present in any machine that has a cache: CISC or RISC.

Answer 1 will provide a general guide-line of the effect only if you
know how YOUR application maps onto the MIPS rating. Even if your
program followed the MIPS rating in a number of trials, you still have to
know how the time is allocated between memory references and other operations
which do not have a statistical nature.

Answer 2 will give a worst case bound on the performance. The MIPS compilers
have tools that will inform you of the number cycles, instructions and
memory references for a given run of the program. Computing worst
case times is really a matter of multiplication. This answer is
really over kill because not all applications require worst case times to be
used for every part of the problem. For example, assume
you had to accept a piece of data and queue it for processing while
interupts were disabled. The critical time is how long are interupts
disabled because data could be lost in that period.

Answer 3 is like throwing out the baby with the bath water - this solution
should be generalized to any hardware that has a statistical nature.
This leaves out the 68020 and 030 too.
--
-Larry Weber DISCLAIMER: I speak only for myself, and I even deny that.
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!larry, DDD:408-720-1700, x214
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Henry Spencer

unread,
Apr 25, 1988, 10:47:24 PM4/25/88
to
> So, what is a real-time control designer to do?

The same thing he does with a high-powered CISC: swear loudly, try to
estimate worst-case performance, and contemplate going back to the Z80.
At least RISC instruction times are more or less predictable, unlike those
of, say, the 68020.

More generally, there is a fundamental clash between trying to make the
performance simple and predictable and trying to maximize it by exploiting
regularities in the workload. If you want absolutely predictable speed,
then (for example) you will either have to live without caches or else
manage them very carefully so you know what they're doing. The same applies
to optimizing compilers, buffered I/O devices, asynchronous buses, etc etc.
--
"Noalias must go. This is | Henry Spencer @ U of Toronto Zoology
non-negotiable." --DMR | {ihnp4,decvax,uunet!mnetor}!utzoo!henry

Philip Koopman

unread,
Apr 26, 1988, 11:39:39 AM4/26/88
to
In article <15...@pt.cs.cmu.edu>, koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
> One aspect of RISC processors for real time control that I
> have not seen discussed is the conflict between
> deadline scheduling and the statistical nature of
> RISC performance figures.
> [stuff deleted]

Thanks for the response so far.

I have received several replies of the form that any machine with
cache has problems with predictability of performance.
I agree, but that isn't the whole question/answer. I thought
that RISCs had a higher cache miss rate (in misses per second,
not miss ratio) since they need more instructions, or is this
solved with increased line size/prefetching?

A better question is: is it appropriate to be using a RISC
on embedded applications? What if you can't afford off-chip cache
memory -- doesn't the increased instruction bandwidth required
for a RISC cause problems? I get the feeling that cache helps a CISC
somewhat, but that a RISC simply dies without a lot of cache -- is
that really the case?

Another concern has to do with program size. Everything I've seen
says that RISCs have programs about twice as big as CISCs. What
does that do in an embedded environment -- NO, Memory is NOT cheap
when it costs power/weight/cooling/volume/dollars/chip count in a highly
constrained application!

Thanks for the feedback,

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Phil Koopman 5551 Beacon St. ~
~ Pittsburgh, PA 15217 ~
~ koo...@faraday.ece.cmu.edu (preferred address) ~
~ koo...@a.gp.cs.cmu.edu ~
~ ~
~ Disclaimer: I'm a PhD student at CMU, and I do some ~
~ work for WISC Technologies. ~

~ (No one listens to me anyway!) ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ag...@urbsdc.urbana.gould.com

unread,
Apr 26, 1988, 11:43:00 AM4/26/88
to

>So, what is a real-time control designer to do?
>
>-- De-rate the RISC MIPS ratings to assume 100% cache misses?

You have to do this for CISCs with caches, not just RISCs.

>-- Use (probably) non-existent tools to compute worst-case
> program execution time under all possible conditions?

In a hard real time environment you have to do thisd for CISCs
as well as RISCs. I don't know of any tools to do this *well*
in either camp, but building them should be considerably easier
for a RISC than a CISC, given the preponderance of short,
single cycle instructions, and explicitness of timing constraints.
On a CISC you never know what interlock is going to bite you.
In fact, wasn't this one of the original reasons for RISC -
simple instructions make performance of code sequences easier
to calculate, and hence easier to choose between in optimization?

>-- Not use RISC in an environment with short deadline events?

I rather think that the GE RPM-40 guys will disagree with you about
that...

ag...@gould.com

Donald Schmitz

unread,
Apr 26, 1988, 12:36:29 PM4/26/88
to
In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:

>Real-time control programs often have a situation where only
>X microseconds are available to perform a task. Therefore,
>the code to perform the task must be GUARANTEED to complete
>within X microseconds. In real-time control, a late answer
>is a wrong answer.

This may be straying somewhat from the original point, but what sort of
applications really have such exact timing deadlines? I have done a little
real-time motion control, using a CPU to implement a discrete position
control law for robot axes, and in general a few percent deviation in cycle
time has next to no effect. As long as the deviation is small and well
distributed, ie. delays of no more than 20% and occuring less than 10
sample periods in a row, I can't imagine a mechanical system reacting to the
error.

Don Schmitz (sch...@fas.ri.cmu.edu)

Joe Petolino

unread,
Apr 26, 1988, 12:53:04 PM4/26/88
to
>One aspect of RISC processors for real time control that I
>have not seen discussed is the conflict between
>deadline scheduling and the statistical nature of
>RISC performance figures.
>
. . .

>So, what is a real-time control designer to do?

First (as others have pointed out) this problem has more to do with having a
cache than with using any particular type of processor. RISC processors
complicate this a little by providing opportunities for varying levels of
optimization for a given piece of code. However, once it's cast into machine
code, execution time (barring memory system effects) is quite predictable
for most processors (either CISC or RISC), and could be determined with a
good simulator.

You could attack the cache problem by clever system design. A former
employer of mine at one point contemplated building a RISC-based system aimed
at real-time applications. Our plan was to use a set-associative instruction
cache, and include a control bit in each cache set (writable by the operating
system) which could 'lock' one of the elements of the set into the cache: if
the bit was set, that cache block would never get swapped out of the cache
(the rest of the set was still available for 'non-critical' stuff, which
would suffer a higher miss rate due to the reduced cache size). If you
loaded your response-critical code into the cache, then locked it in, one big
variable went away. Unfortunately, this system never was built. Has anyone
else done something like this?

-Joe

Brian Case

unread,
Apr 26, 1988, 2:37:40 PM4/26/88
to
In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
>One aspect of RISC processors for real time control that I
>have not seen discussed is the conflict between
>deadline scheduling and the statistical nature of
>RISC performance figures.

?????? And CISC (or whatever you consider an alternative to RISC) doesn't
have the so-called "statistical nature" of performance?!?!

>The problem with RISC designs is that they promise a performance
>of Y MIPS in the average case over large sections of code and
>relatively long periods of time.

?????? How do alternatives to RISC differ?

>What is more important is worst-case performance (maximum
>possible cache misses for that program, branch-target buffer
>misses, etc.)

Worst-case performance is always *most* important for real-time systems.
Because of fundamental limitations of technology (big DRAMs are slower
than small SRAMs), any processor that runs as fast as the technology will
allow will rely on caching to some degree (I claim). To the extent that
your real-time code can't depend on the cache(s) containing your working
set (probably can't depend on it at all), you may be better off, in terms
of cost, designing the hardware without caches. If the caches are on-chip,
then you have no choice of course. Now, it *is* possible that, in an
environment where the cache(s) is(are) always missing, cache(s) will actually
make the system run slower. However, it will be more and more difficult
to find any fast processor, CISC, RISC, or whatever-ISC, without on-chip
caches. In fact, many CISCs will soon be implemented with a very RISC-
like core. Oops, I guess I could have summarized this whole spiel by
simply saying "your problem isn't RISC, its statistical techniques in
general. These techniques are universally used." Maybe a good-old 68000
is your best bet?

b...@pedsga.uucp

unread,
Apr 27, 1988, 8:19:19 AM4/27/88
to
In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU.UUCP writes:
> { questioning the suitability of RISC processors for Real-Time use }
> ...

It seems to me that it is much *easier* to predict worst
case performance for RISC processors because

1) Most execute one instruction/clock. You don't have to figure
out how many cycles each instruction actually takes.
2) Most don't have interuptible instructions. Who knows how
long it takes?

If you are really concerned about cache misses, you would design
your system so that all the memory was fast enough for the processor.
And you wouldnt do demand-paging either.

Just my opinion.
Bob Weiler.

Franklin Reynolds

unread,
Apr 27, 1988, 11:32:32 AM4/27/88
to
Another similar question about RISC vrs. realtime is whether the philosophy
of optimising for the general case instead of the exception is appropriate.

As I understand it, optimising for the general case is fundamental to most
RISC designs. Modern, sophisticated realtime systems that have to deal with
hard time constraints and overload conditions might be better served by
architectures that are optimized for various exceptional conditions.

You could imagine an architecture optimized for speedy interrupt handling,
context switching, process ordering, IPC, etc. This architecture might have
advantages for certain types of realtime applications over designs that
optimized for throughput in the general case.

Franklin Reynolds Kendall Square Research Corporation
f...@ksr.uucp Building 300 / Hampshire Street
ksr!f...@harvard.harvard.edu One Kendall Square
harvard!ksr!fdr Cambridge, Ma 02139

John Danskin

unread,
Apr 27, 1988, 1:24:48 PM4/27/88
to

We have a leetle teeny ucode engine (read risc by Weitek) that needs
some things locked into cache (a real time constraint that involves
the bus hanging if we slip by even one cycle (our fault, not
weitek's)). Fortunately, our system uses direct mapped caches, so we
changed the linker so that modules which should be locked into
cache get unique addresses (modulo the cache size). This works
just fine, and since we have hardly any of this critical code, caused
only a 2% overall code growth (because of all of the little holes)
--
John Danskin | decwrl!jmd
DEC Technology Development | (415) 853-6724
100 Hamilton Avenue | My comments are my own.
Palo Alto, CA 94306 | I do not speak for DEC.

David Schachter

unread,
Apr 27, 1988, 2:19:05 PM4/27/88
to
In article <15...@pt.cs.cmu.edu> sch...@FAS.RI.CMU.EDU (Donald Schmitz) writes:
>In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
>>Real-time control programs often have a situation where only
>>X microseconds are available to perform a task. Therefore,
>>the code to perform the task must be GUARANTEED to complete
>>within X microseconds. In real-time control, a late answer
>>is a wrong answer.
>
>This may be straying somewhat from the original point, but what sort of
>applications really have such exact timing deadlines?...
>[I]n general a few percent deviation in cycle

>time has next to no effect. As long as the deviation is small and well
>distributed, ie. delays of no more than 20% and occuring less than 10
>sample periods in a row, I can't imagine a mechanical system reacting to the
>error.

Not all real-time system control mechanical objects.

I wrote code for a radio-controlled clock. The microcontroller takes a non-
maskable interrupt every millisecond. If the interrupt service routine ever
takes more than a millisecond to execute, the results are:

1) The stack may get trashed, or it may not.
2) The clock will lose a millisecond.
3) Certain I/O ports may not be completely updated.
4) The clock may lose an output character (sending time to the host)
5) The clock may lose input characters (receiving commands from the host.)

Depending on the customer's usage of the clock, the result could be a simple
as a traffic light "slipping" a millisecond" or as bad as a wide-area network
losing packets and not being able to restart after a network crash.

I put in code to reset the clock if nested NMI's occur and I spent a lot of
time counting clocks and doing measurements with an oscilloscope, to insure
the interrupt service routine will alway take a less than a millisecond.
Worst case time: 900 microseconds. Usual case: 100 microseconds.

Before the work, the clock would often crash for no apparent reason. Turned
out the previous programmer (this is two years ago) was allowing the ISR to
take more than ten milliseconds (i.e. nesting NMI's ten levels deep!)

Disclaimer: this article was written by Schroedinger's cat, Bill.

Peter J Desnoyers

unread,
Apr 27, 1988, 3:41:10 PM4/27/88
to
Problems like this have already cropped up in the modem field, where
you have RISC-like processors (e.g. TMS32020) which require very fast
memory running code which has to run every sample time, and then a lot
of random code to control the front panel, RS232, MNP, and other
random piddling stuff. The solution until now was to use an 8 bit
micro (sometimes a 68000) to do the piddling stuff that took up 80-90%
of the code volume, and a signal processing micro to do the fast
stuff, and give them each their own slow and fast memory,
respectively.

Things have changed. It is now possible to get at least one of these
chips (I think it's the 32020) to do wait states on memory, and
someone (I don't remember who) has now put their MNP implementation
and a few other things on this processor, in slow ROM, while their
signal processing code runs in fast (20ns?) RAM. It takes a lot more
ROM space than an eight bit micro (simple, fixed-length (32 bit?)
instructions, poor handling of anything but integer multiplies and
accumulates) , but you still end up with fewer chips, lower cost, and
a negligible load added to the signal processor.

The interesting thing to notice is that there is no need for fast
memory to be used as a cache in an embedded application. Just load
your time-critical code into fast memory, and your random stuff into
slow memory. If the time-critical part of the code is huge, then a
cache wouldn't help anyway.


Peter Desnoyers
pe...@athena.mit.edu

David Keppel

unread,
Apr 28, 1988, 1:35:30 PM4/28/88
to

I talked with our local real-time guru, Alan Shaw, who said something
to the effect of (not an exact quote, but I'll try to get the message
across):

Doing any kind of timing analysis is very hard. You can't
assume in your analysis that there's going to be bus contention
every memory cycle, or your estimated performance is going to
look much worse than it ever will in practice. What people
really do is come up with reasonable figures based on the
probability of there being N consecutive bus contention cycles,
and make your timing analysis based on some number of contention
cycles that will happen with a probability that is smaller than
the chance of other catastrophic failure.

Note that this analysis is independent of RISC/CISC or almost
anything else. The key point here is that you can measure and
estimate probabalistically, and in practice the failure rate
from other sources (e.g., hardware failures) will be the
dominant mode of failure.

;-D on ( Well it looked good when I closed my eyes ) Pardo

Jeff Collins

unread,
Apr 28, 1988, 9:16:00 PM4/28/88
to

Here's an interesting question for the SPARC guru's of the world:

Given that the SPARC must use a virtual cache to get optimal
performance, how does one build a multiprocessor with a SPARC?

As far as I know, no one has solved the virtual cache coherency
problem yet...



Guy Harris

unread,
Apr 29, 1988, 4:53:55 AM4/29/88
to
> Given that the SPARC must use a virtual cache to get optimal
> performance, how does one build a multiprocessor with a SPARC?

Excuse me? Where is it "given that the SPARC must use a virtual cache to get
optimal performance?" It is the case that the SPARC requires some form of very
fast memory to get optimal performance, but that's true of a hell of a lot of
machines these days. The virtual cache permits you to bypass the MMU on a
cache hit, but I don't know that this is *required* for SPARC - or for any of a
number of other microprocessor architectures.

It may be the case that with the current SPARC implementations, with no on-chip
MMU, that it's easier to get high performance with a virtual cache than with a
physical cache, or even that you can't get optimal performance *for those
implementations* with a physical cache (although I suspect the latter is not
true). This certainly doesn't say that the SPARC *architecture* requires a
virtual cache.

Another way of putting this is that I have no particular reason to believe that
all high-end SPARC machines built by Sun will have virtual caches (it is
already the case that not all SPARC machines built by Sun have virtual caches;
the 4/110 SCRAM memory acts more like a physical cache than a virtual one).

Gert Slavenburg

unread,
Apr 29, 1988, 12:32:50 PM4/29/88
to
> As far as I know, no one has solved the virtual cache coherency
> problem yet...

Coherency can be maintained in a virtual address cache in exactly the same
way as it is maintained in a multiprocessor : pick you favorite 'bus watch'
style multiprocessor consistency protocol. Now use this protocol at the
back (bus) end of a virtual address cache, where requests go out in terms
of physical addresses, applying it TO YOURSELF AS WELL AS TO OTHERS. For
example, if you have an ownership like protocol, aquiring ownership
over a main memory slot (the memory unit that goes into a cache line)
requires 'negotiation' with other caches that hold a copy of this slot,
including yourself if you hold such copies under different virtual address
names.

In order to implement this, the virtually addressed cache needs to be able
to 'observe' (and/or 'interact with' in some ownership schemes) bus
transactions that occur on those physical addresses of which this cache
holds copies. Again, this can be done in a variety of ways, either involving
a double set of tags or some form of reverse address translation (this may
be a one to many mapping).

The architecture of a multiprocessor with virtual address caches that applies
the above ideas in some unconventional ways was presented at the 13th
Symposium on Computer Architecture in Tokyo (1986) (`Software-controlled
caches in the VMP Multiprocessor', by D.R. Cheriton, G.A. Slavenburg and
P.D. Boyle). At this year's 15th symposium on Computer Architecture in Hawaii,
measurement results of the system, as in operation at Stanford University,
will be presented.

just thought you might like to know that this problem has been solved,

Gerrit A. Slavenburg

Walter Bays

unread,
Apr 29, 1988, 2:51:57 PM4/29/88
to
In article <16...@alliant.Alliant.COM> je...@alliant.UUCP (Jeff Collins) writes:
> Given that the SPARC must use a virtual cache to get optimal
> performance, how does one build a multiprocessor with a SPARC?
> As far as I know, no one has solved the virtual cache coherency
> problem yet...

Clipper uses 'bus watch' to invalidate references to stale data, when
used in multiprocessor (including CPU-IOP) modes, and when using
'copy-back' (as opposed to write-through) cache modes. The newly
announced Motorola 88000 uses a similar scheme, called 'bus snoop'.

With SPARC, Clipper without CAMMU chips, or 88000 without CAMMU chips,
you implement your own cache, and can build whatever you choose.
--
------------------------------------------------------------------------------
Any similarities between my opinions and those of the
person who signs my paychecks is purely coincidental.
E-Mail route: ...!pyramid!garth!walter
USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303
Phone: (415) 852-2384
------------------------------------------------------------------------------

Jeff Collins

unread,
Apr 29, 1988, 3:23:09 PM4/29/88
to
In article <51...@sun.uucp> g...@gorodish.Sun.COM (Guy Harris) writes:
>> Given that the SPARC must use a virtual cache to get optimal
>> performance, how does one build a multiprocessor with a SPARC?
>
>Excuse me? Where is it "given that the SPARC must use a virtual cache to get
>optimal performance?" It is the case that the SPARC requires some form of very
>fast memory to get optimal performance, but that's true of a hell of a lot of
>machines these days. The virtual cache permits you to bypass the MMU on a
>cache hit, but I don't know that this is *required* for SPARC - or for any of a
>number of other microprocessor architectures.
>
>It may be the case that with the current SPARC implementations, with no on-chip
>MMU, that it's easier to get high performance with a virtual cache than with a
>physical cache, or even that you can't get optimal performance *for those
>implementations* with a physical cache (although I suspect the latter is not
>true). This certainly doesn't say that the SPARC *architecture* requires a
>virtual cache.
>

Sorry, Guy but you misunderstood the purpose of my posting. I was not
attempting to attack the SPARC. I am simply attempting to see if
anyone has given any thought the the problem of using the SPARC with a
virtual cache in a multiprocessor. It is my understanding (and I
admit that you are closer to the issue than I am) that it is not easy,
and may not be possible, to put a virtual to physical translation
before the SPARC's cache and still get no-wait-state performance.
(Please note that I didn't say that it was impossible to do this, only
that the performance would be degraded.)

If one looks at other current microprocessors (NS32532, Mot. 88000,
MIPS 3000, etc.) they all have standard MMUs available with the part
and have a fairly credible story of how to make them perform in a
multiprocessor. There is no standard MMU (to my knowledge) available
for the SPARC, and at least the current implementations seem to have a
major drawback when attempting to put them in a multiprocessor.

The question still remains - if you want to run the SPARC with a
virtual cache in a multiprocessor - how do you do it? Other questions
that get to the same issue are:

- How does one get no-wait-state performance in a multiprocessor
using the SPARC?
- Is there a standard MMU going to be available, and how will it
work with a multiprocessor?
- Is there a solution to the virtual cache problems?
- Is there an announced version of the SPARC that allows time
between address-ready and data-ready to have an MMU before the
cache?

Mike Taylor

unread,
Apr 29, 1988, 5:49:55 PM4/29/88
to

Amdahl machines starting with the 580 series in 1982 have used a coherent
virtual cache.
--
Mike Taylor ...!{ihnp4,hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's. ]

David Emberson

unread,
Apr 29, 1988, 8:37:31 PM4/29/88
to

1) There is nothing inherent in the SPARC architecture that requires the use
of a virtual cache to obtain high performance.

2) It is possible to implement snoopy caches with caches that appear to the
processors as virtual caches. 'Nuff said.

3) We have a very good architectural solution to this problem for SPARC
which we are developing now. We are not yet prepared to go public with it
but it has been presented to some selected customers on a non-disclosure
basis. It is somewhat frustrating to us not to have an announced solution,
but we'd rather do it right than do something that does not scale well
(Yes, I did get my hands on the 88200 data sheet!). All I can say is that
your patience will be rewarded.

4) When the pieces are in place, you can bet that the technology will be
open (i.e. available to all) and will embrace the concept of standard-
ization that is so dear to us here at Sun. In other words, it is the
stated policy of the company that we will make the components available
so that if you want to build a SPARC machine--even one that competes with
ours--Sun will encourage you to do so. A license from Sun is not required
to use SPARC components.

I may be shot for talking about this stuff, but I think it is important to
the success of SPARC that Sun not be perceived as ignoring this important
technology. Support for multiprocessors has already been announced as a
deliverable item in Phase III of the AT&T-Sun Unix merge.

Dave Emberson (d...@sun.com)

Guy Harris

unread,
Apr 29, 1988, 11:33:03 PM4/29/88
to
> Sorry, Guy but you misunderstood the purpose of my posting. I was not
> attempting to attack the SPARC.

No, I didn't. I merely pointed out that you were, as far as I could tell,
making an unwarranted assumption at the start of your discussion; this
assumption unnecessarily colored the rest of your discussion. David Emberson
of Sun has noted that there is nothing in the *architecture* that demands a
virtual cache. Current *implementations* may make a virtual cache the best, or
only, way to get maximum performance, but that's a different matter. If you
were, in fact, referring to the current chips, rather than the SPARC
architecture, I apologize.

> I am simply attempting to see if anyone has given any thought the
> the problem of using the SPARC with a virtual cache in a
> multiprocessor.

David Emberson has also already given an answer to this question; the
answer is "yes, Sun has". Unfortunately, as he indicated, he's not in a
position to discuss it in detail now.

Rick Richardson

unread,
Apr 30, 1988, 5:03:08 PM4/30/88
to
In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
>
>A better question is: is it appropriate to be using a RISC
>on embedded applications? What if you can't afford off-chip cache
>memory -- doesn't the increased instruction bandwidth required
>for a RISC cause problems? I get the feeling that cache helps a CISC
>somewhat, but that a RISC simply dies without a lot of cache -- is
>that really the case?
>

I'm still looking for the RISC that does ~4K (C language) Dhrystones,
has no cache, clocks around 4 Mhz, has a 16 bit bus, can address maybe 1MB,
is a power miser, can't do floating point, and costs no more than $15.

In HUGE quantities. Just think of the millions and millions of next
generation consumer products that could use the extra performance,
while still meeting EMI, power consumption, and cost requirements.

Come on guys, I know that there's a lot of prestige in
having the fastest micro-* around, but theres a LOT of HIGH VOLUME
applications out there that just can't use all that power.

You might sell 10K-100K of these super high performance chips.
Wouldn't you rather sell *tens of millions*?
--
Rick Richardson, President, PC Research, Inc.

(201) 542-3734 (voice, nights) OR (201) 834-1378 (voice, days)
uunet!pcrat!rick (UUCP) rick%pcrat...@uunet.uu.net (INTERNET)

Brian Case

unread,
Apr 30, 1988, 5:55:40 PM4/30/88
to
In article <16...@alliant.Alliant.COM> je...@alliant.UUCP (Jeff Collins) writes:
>

Read about the SPUR project being done at Berkeley. They have a large
virtual cache, in-cache address translation (only done on misses), and
the system concept is a small (about 10) multiprocessor. Some neat
ideas.

Eugene D. Brooks III

unread,
Apr 30, 1988, 6:12:09 PM4/30/88
to
In article <16...@alliant.Alliant.COM> je...@alliant.UUCP (Jeff Collins) writes:
> - Is there an announced version of the SPARC that allows time
> between address-ready and data-ready to have an MMU before the
> cache?
>
This is the key to bringing these RISC chips into the realm of real
supercomputing and will of course happen as it will be driven by market
pressures. You need to allow an "arbitrary" time between address-ready
and data-ready to allow successful use in a multiprocessor environment
where the latency of memory is more or less undetermined due to conflicts
in the shared memory subsystem. Basically, part of the address ready lines
include a tag which indentifies the request, and when the response to the
request returns the copy of the tag that arrives with it allows the cpu to
figure out what to do with the data. The number of tag bits limits the
number of outstanding requests for the cpu. One is likely to sequence the
tag bits in order for speed so the number of outstanding requests will be
further limited by flucuations in arrival order of the responses. If the
request gets satisfied by the cache it comes back with a low latency, but if
it goes to main memory (shared memory in a multiprocessor) it might have a
substantial latency. By allowing many requests to be pending at once one
can get "no wait state" performance in the same sense that internal pipelining
delivers "no wait state" performance for the internal cpu functions. The cpu
must be able to efficiently handle the fact that requests come back out of
order, which means that simple fifo's a la the WM machine won't do.


Rest assured that some future SPARC, MOT88000, Clipper, ..., implementation
will provide this capability by some means as it will be the only way to
further increase performance in the face of the memory latency of a shared
memory multiprocessor. Whether MIPS will pull this off is not clear, their
basis design principle (religion) of not having harware interlocks would seem
orthogonal to doing it, in short they will find that their basic design
princible was wrong when they try to pipeline cache misses in a shared memory
environment.

ag...@urbsdc.urbana.gould.com

unread,
Apr 30, 1988, 8:15:00 PM4/30/88
to

>As far as I know, no one has solved the virtual cache coherency
>problem yet...

There sure are a lot of folk who think they have, though not
commercially (yet). The virtual cache consistency problem is just
like the physical cache consistency problem, except that you
need a physical index for bus snooping.

[Knowing I'm gonna get flamed :-) ]: of course, Alliant doesn't
have too much to do with cache consistency - after all, the CEs
talk to the same cache, don't they, so don't have any consistency
problems? But how far can this scale? I suppose that the IPs
have to be kept coherent, and I believe that's writeback, but
the duty cycle doesn't have to be very high.

ag...@gould.com

Hank Dietz

unread,
May 1, 1988, 10:56:46 AM5/1/88
to
I hate to add to this pile of news, but why hasn't anyone talked about the
fact that processors like SPARC are not really designed for large-scale
multiprocessing, e.g., they have no provision for "hiding" big, stochastic,
memory reference delays across a log n stage interconnection network, etc.?
I think it's pretty uninteresting to talk about multi-processor systems
which are small enough that "snooping caches" work; as many of you have
pointed-out, that's been essentially a non-problem for quite some years.
How about some discussion of what processors do when simple cache protocols
aren't good enough or are not implementable?

It seems we nearly got into such a discussion when WM was brought up, but WM
isn't really billed as being a processor design for large-scale
multiprocessors (read MIMDs), hence people didn't seem to notice that it was
addressing the stochastic delay memory reference problem so endemic to big
MIMDs. Consider all those other processors with microtasking or other
out-of-order or multiple-memory-pipeline structures. Anyone care to get a
discussion along these lines going?

-hankd

Mike Coffin

unread,
May 1, 1988, 3:37:56 PM5/1/88
to
From article <80...@pur-ee.UUCP>, by ha...@pur-ee.UUCP (Hank Dietz):

> I hate to add to this pile of news, but why hasn't anyone talked about the
> fact that processors like SPARC are not really designed for large-scale
> multiprocessing, e.g., they have no provision for "hiding" big, stochastic,
> memory reference delays across a log n stage interconnection network, etc.?
> [...]

> Anyone care to get a discussion along these lines going?
> -hankd

One machine that attacked this problem was the Denelcor HEP. Although
it's now defunct, my impression is that its demise had little to do
with technical merit.
--

Mike Coffin mi...@arizona.edu
Univ. of Ariz. Dept. of Comp. Sci. {allegra,cmcl2,ihnp4}!arizona!mike
Tucson, AZ 85721 (602)621-4252

Mike Butts

unread,
May 2, 1988, 12:56:19 PM5/2/88
to
From article <16...@alliant.Alliant.COM>, by je...@Alliant.COM (Jeff Collins):

>
> Given that the SPARC must use a virtual cache to get optimal
> performance, how does one build a multiprocessor with a SPARC?
>
> As far as I know, no one has solved the virtual cache coherency
> problem yet...
>

Here's another log for the fire... Apollo says in their marketing booklet about
the new DN10000 architecture (which has 1 to 4 15-MIPS-RISC processors sharing
main memory):

"...the Series 10000's caches incorporate the best features of both fully physical
and fully virtual caches. The result is a new *virtually indexed, physically
tagged write-through cache* that lets cache RAM access proceed entirely overlapped
in time with any virtual-to-physical address translation. This overlap reduces
memory access pipeline depth, guaranteeing single-cycle execution and eliminating
the translation penalty typical with physically indexed designs.

The physical tags allow cache validation across processors. The addressing
scheme, based on the virtual address coupled with the physical tag, maintains
coherency and allows data to be shared by multiple processors."
--
Mike Butts, Research Engineer KC7IT 503-626-1302
Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton OR 97005
...!{sequent,tessi,apollo}!mntgfx!mbutts OR mbu...@pdx.MENTOR.COM
These are my opinions, & not necessarily those of Mentor Graphics.

Brian Case

unread,
May 2, 1988, 2:15:05 PM5/2/88
to
In article <4...@pcrat.UUCP> ri...@pcrat.UUCP (Rick Richardson) writes:
>I'm still looking for the RISC that does ~4K (C language) Dhrystones,
>has no cache, clocks around 4 Mhz, has a 16 bit bus, can address maybe 1MB,
>is a power miser, can't do floating point, and costs no more than $15.

Oh, that's easy! The Acorn RISC Machine (ARM). Yes, I know it has a
32-bit bus now, but just talk to VTI (they have the ARM and use it as a
cell, I think): if you are right about volumes, they'll make a mod to
give it a 16-bit bus. On every other account, the ARM is what you want.
I think you could even get it for around $10 instead of $15 (I think that
price is currently available for large quantities).

On second thought, with a 16-bit bus, it might slow down a lot. It seems
worth looking into though.

Jeff Collins

unread,
May 2, 1988, 6:51:18 PM5/2/88
to
In article <6...@garth.UUCP> wal...@garth.UUCP (Walter Bays) writes:
>In article <16...@alliant.Alliant.COM> je...@alliant.UUCP (Jeff Collins) writes:
>> Given that the SPARC must use a virtual cache to get optimal
>> performance, how does one build a multiprocessor with a SPARC?
>> As far as I know, no one has solved the virtual cache coherency
>> problem yet...
>
>Clipper uses 'bus watch' to invalidate references to stale data, when
>used in multiprocessor (including CPU-IOP) modes, and when using
>'copy-back' (as opposed to write-through) cache modes. The newly
>announced Motorola 88000 uses a similar scheme, called 'bus snoop'.
>
>With SPARC, Clipper without CAMMU chips, or 88000 without CAMMU chips,
>you implement your own cache, and can build whatever you choose.
>--

Actually I am familiar with the Clipper and the 88000. I know how they
support multiprocessing. The point here was that these chips put the
cache after the MMU. This means that the caches contain physical
address and the tag stores record physical addresses. When an address
goes across the bus it is not a big deal to "watch it" and determine
if the address is in the cache or not. Current implementations of the
SPARC, on the other hand, get optimal performance using a virtual
cache - ie. the cache is BEFORE the MMU. It is a much more
complicated procedure to "watch" these addresses. With a virtual
cache, each physical address on the bus must first be translated to
it's corresponding virtual address through some sort of reverse TLB,
then the address can be provided to the bus watcher to see if the data
is in the cache.

This presents a number of problems that are not present in the case of
the 88000 or the Clipper (or any other micro that uses physical
caches). One of the problems is aliasing - this is when multiple
virtual addresses refer to the same physical address. In order to
properly snoop, all of the aliases must be checked for in the cache.
Another problem, is simply the implementation of the reverse TLB -
what happens when the reverse TLB misses? Which set of page tables
should the TLB walk?

Randell E. Jesup

unread,
May 3, 1988, 4:01:20 AM5/3/88
to
In article <4...@pcrat.UUCP> ri...@pcrat.UUCP (Rick Richardson) writes:
>I'm still looking for the RISC that does ~4K (C language) Dhrystones,
>has no cache, clocks around 4 Mhz, has a 16 bit bus, can address maybe 1MB,
>is a power miser, can't do floating point, and costs no more than $15.

Yeah, and what technology is this wonder-chip implemented in???
Whatever it is, I can think of dozens of Si companies that would give away
all their current facilites for that process. Oh, and I'm not even worrying
about cost.

Back to reality, it just can't be done, except MAYBE with a state of
the art chip optimized to NOTHING but fast dhrystones (which, by the way,
are a pretty poor predicter for most applications, due to string handling.)
4 Mhz is REAL slow. A 4Mhz rpm-40 would be equivalent to maybe a 14Mhz
68000 (note: not '020). At such slow speeds, CISC chips may well show
superiority due to wanting to maximize the usefulness of every bus cycle.

// Randell Jesup Lunge Software Development
// Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180
\\// beowulf!lunge!je...@steinmetz.UUCP (518) 272-2942
\/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup
(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

Brian Case

unread,
May 3, 1988, 1:49:51 PM5/3/88
to
In article <8...@imagine.PAWL.RPI.EDU> je...@pawl18.pawl.rpi.edu (Randell E. Jesup) writes:
= Yeah, and what technology is this wonder-chip implemented in???
=Whatever it is, I can think of dozens of Si companies that would give away
=all their current facilites for that process. Oh, and I'm not even worrying
=about cost.
=
= Back to reality, it just can't be done, except MAYBE with a state of
=the art chip optimized to NOTHING but fast dhrystones (which, by the way,
=are a pretty poor predicter for most applications, due to string handling.)
=4 Mhz is REAL slow. A 4Mhz rpm-40 would be equivalent to maybe a 14Mhz
=68000 (note: not '020). At such slow speeds, CISC chips may well show
=superiority due to wanting to maximize the usefulness of every bus cycle.

On the contrary. Let me say it again: the ARM from VTI and ACORN. At low
clock rates (so that memory access time isn't an issue), the ARM gets about
1K dhrystones per MHz (using the rather decent ACORN C compiler). The
process is (was) junky 2 or 3 micron CMOS. Current price for the ARM
(VTI 86000 I think is the part number) is very low in quantity, < $15 I
think. The only problem for meeting the original poster's requirements is
the 32-bit bus of the ARM.

Allen J. Baum

unread,
May 3, 1988, 2:09:41 PM5/3/88
to
--------
[]

>In article <4...@pcrat.UUCP> ri...@pcrat.UUCP (Rick Richardson) writes:
>
>I'm still looking for the RISC that does ~4K (C language) Dhrystones,
>has no cache, clocks around 4 Mhz, has a 16 bit bus, can address maybe 1MB,
>is a power miser, can't do floating point, and costs no more than $15.
>

Except for the 16bit bus, the ARM chip seems to meet your qualifications.
It looks very good for controller kinds of applications. Its simple, small
(die size) and therefore, cheap. It does not require a cache, and knows how
to talk to DRAMs with page mode access cycles to get good performance with
no cache.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385

Walter Bays

unread,
May 3, 1988, 2:26:19 PM5/3/88
to
>>In article <16...@alliant.Alliant.COM> je...@alliant.UUCP (Jeff Collins) writes:
>>> Given that the SPARC must use a virtual cache to get optimal
>>> performance, how does one build a multiprocessor with a SPARC?
>>> As far as I know, no one has solved the virtual cache coherency
>>> problem yet...

>>[I replied about 'bus watch' citing Clipper and 88000.]

In article <16...@alliant.Alliant.COM> je...@alliant.UUCP (Jeff Collins) writes:
> Actually I am familiar with the Clipper and the 88000. I know how they
> support multiprocessing. The point here was that these chips put the

> cache after the MMU. [Good description of the problems in 'bus
> watching' with virtual caches.]

You're right. I missed your point. Clipper has a physical cache, while
the Sun 4 SPARC has a virtual cache.

gru...@convex.uucp

unread,
May 3, 1988, 6:58:00 PM5/3/88
to

>/* Written 5:51 pm May 2, 1988 by je...@alliant.Sun.Com
> ... It is a much more

> complicated procedure to "watch" these addresses. With a virtual
> cache, each physical address on the bus must first be translated to
> it's corresponding virtual address through some sort of reverse TLB,
> then the address can be provided to the bus watcher to see if the data
> is in the cache.
> ...In order to

> properly snoop, all of the aliases must be checked for in the cache.
> Another problem, is simply the implementation of the reverse TLB -
> what happens when the reverse TLB misses? Which set of page tables
> should the TLB walk?

Nah, it ain't that hard. One can have a virtually mapped, physically tagged
cache. As the virtual cache is filled with data, the translated physical
address from the MMU is written into the tag RAMs. The physical tags are
then used by the bus watcher to selectively invalidate the cache entries.

This is in fact how cache coherency is maintained in the Convex C-2 machines.
Each of the CPUs has a set of remote invalidation tag memories which watches
all of the other physical address buses (3 other CPUs and one I/O). When
there is a hit, the validity bits are cleared for that particular entry in
the virtually mapped cache.

Jeff Gruger
(ihnp4!convex!gruger)

Geoff Steckel

unread,
May 3, 1988, 7:27:16 PM5/3/88
to
In article <49...@bloom-beacon.MIT.EDU> pe...@athena.mit.edu (Peter J Desnoyers) writes:
>Things have changed. It is now possible to get at least one of these
>chips (I think it's the 32020) to do wait states on memory, and
>someone (I don't remember who) has now put their MNP implementation
>and a few other things on this processor, in slow ROM, while their
>signal processing code runs in fast (20ns?) RAM.

The scheme mentioned is very close to one with which I am currently working.
I recently surveyed all the DSP chips for which I could get documentation.
Only the TI 320xxx series have a 'memory access done' pin. All the other
chips (Moto, AD, NEC, OKI, ...) either have a programmable # wait states or
assume external program or data memory is sufficiently fast to work
synchronously.

This makes ganging of DSP chips using shared (peer-to-peer) global memory
difficult, and makes using mixed slow and fast program memory impossible.
The designers seem to assume:
1) All parts of the application must run equally fast.
2) Programs will be small.
3) Data will be small or only accessed a little at a time.
4) The DSP chip will own all resources to which it is connected.
5) Any resource the DSP chip does not own are:
a) connected via a serial port (a la Transputer, etc), or
b) sufficiently unimportant that polling a ready line is good enough, or
c) very fast, or
d) nonexistent

Can any of the DSP mavens comment on DSP architectures which
1) Can be connected to large (> 64K) shared memories, which the DSP may
use, but does not own (i.e. must request and be granted access)
and whose access time has an upper bound but is not deterministic
below that bound.
2) Can run 'background' tasks (servicing panels, SCSI, etc., etc.)
which require serious processing but much less than the 'foreground'
task does, preferably with the code in slow (> 70nS, cheap!) memory.
while doing 'foreground' classic DSP?

Right now only TI's 320xx chips seem to have some of the hardware support, with
the large advantage of an extremely narrow program memory path (16 bits!).
The corresponding disadvantage is an extremely baroque and assymmetrical
instruction set.

The chip described is very close to a general purpose RISC chip, but with
the following differences:
1) Onboard multiply must be very very fast (for convolutions, etc).
2) sub-wordsize (byte, etc.) performance not very important
DSP almost (ha) never does divides, but 1000000s of multiplies.
3) barrel shifter very useful to required
4) extended precision adder for multiply and accumulate vital
(e.g. if a * b yields 32 bits, at least 34 bits in the sum, preferably
more like 40!). You don't have time to check for overflow.
5) Floating point is **really** nice, but many applications can be
bludgeoned into fixed point. Painfully.
If you do put in floating point, make it FAST. Like 2-3 cycles.
6) Cheaper than the RISC chips are running. $100/ea in moderate quantity.

geoff steckel (ste...@alliant.COM)

Przemyslaw Klosowski

unread,
May 3, 1988, 8:33:25 PM5/3/88
to
In article <16...@alliant.Alliant.COM> je...@alliant.UUCP (Jeff Collins) writes:
>
> As far as I know, no one has solved the virtual cache coherency
> problem yet...
>
See ``An in-cache address translation mechanism'' by D.A. Wood et all. in:
Proc 13th annual ACM/IEEE symposium on comp. arch. Tokyo June 1986.
They describe SPUR solution to the problem. Basically it consists of forcing
Global Virtual address (which is almost but not quite virtual addr) to be the
samo for all shared data.
Other solutions used Reverse TLB's to retrieve virtual address and then use
normal (i.e. bus snooping techniques).
Last but not least, why not delegate responsibility to software? It is only
a question of finding a convenient paradigm.

prz...@psuvaxg.bitnet
psuvax1!gondor!przemek

prz...@psuvaxg.bitnet
psuvax1!gondor!przemek

Przemyslaw Klosowski

unread,
May 3, 1988, 8:39:23 PM5/3/88
to
In article <51...@sun.uucp> g...@gorodish.Sun.COM (Guy Harris) writes:
>> Given that the SPARC must use a virtual cache to get optimal
>> performance, how does one build a multiprocessor with a SPARC?
>
>Excuse me? Where is it "given that the SPARC must use a virtual cache to get
>optimal performance?"
IMHO the virtual cache ($) has serious advantages, one of which is that separate
TLB (which is nothing else but separate cache for translation data) is not
necessary. And of course convenient overlapping of $ access with V->R
translation.

prz...@psuvaxg.bitnet
psuvax1!gondor!przemek

prz...@psuvaxg.bitnet
psuvax1!gondor!przemek

John Hanley

unread,
May 4, 1988, 12:53:22 AM5/4/88
to
In article <80...@pur-ee.UUCP> ha...@pur-ee.UUCP (Hank Dietz) writes:
>...why hasn't anyone talked about the fact that processors like SPARC are not

>really designed for large-scale multiprocessing, e.g., they have no provision
>for "hiding" BIG, stochastic, memory reference delays across a log n stage

>interconnection network, etc.? I think it's pretty uninteresting to talk about
>multi-processor systems which are small enough that "snooping caches" work....

My favorite method of keeping the CPU busy while a memory-read request is
traversing the network is the one used by the Denelcor HEP: context-switch
on a cache miss. When a process requests that a disk block be read on a
time-sharing system, the scheduler tries to get useful work done during the
rotational latency by blocking the process and executing another one; when
a process requests a memory word on a HEP'ish system the CPU tries to get
useful work done during the network latency by executing another process.
This, of course, requires extremely light-weight processes so that time to
context-switch is comparable to time to execute any other instruction.
One way of doing this is to have a very memory-intensive architecture, with
almost no registers besides PC and PSW (and even the status word can be
dispensed with; c.f. recent comp.arch discussion). Another method is to
sacrifice single-process speed for parellel speedup, by putting a multiplexor
in front of every single register, so that a context switch is effected by
simply changing the index register that addresses the MUX. To prevent the
need for reloading the MMU's page-descriptors on every context-switch, it is
preferable for most switches to be between "threads" of the same program
(same virtual address space) rather than between processes running unrelated
programs.

Another tack is to have a conventional ("context switches are expensive")
processor that rarely waits on cache misses. Writing to a memory location
is always fast because you don't have to wait around for it to finish. Reads
cause problems. Suppose you come to a code fragment that is about to do three
array references (low probability of being in the cache). Rather than saying
LD A, <wait>, LD B, <wait>, LD C, <wait>, you could say PREFETCH A, PREFETCH B,
PREFETCH C, LD A, LD B, LD C. If the time to execute the three non-blocking
prefetch instructions is comparable to the network latency, you win big, since
they execute during time that would have been spent idle anyway. Code density
is shot to hell, and the compiler has to be _very_ smart about cache-hit
likelihoods (or else runtime profiling has to be done, which is tricky because
adding and removing where the PREFETCH instructions go is going to change the
pattern of what's in the cache when).

Something I haven't seen is the above PREFETCH instructions implemented in
hardware. Call it an intelligent look-ahead cache, or an aux. CPU.
Predictive memory requests are made not only on the instruction stream,
but also on the data stream, a few instructions ahead of time. Is this
impractical because the aux. CPU has to be nearly as complicated as the CPU
itself, so you'd get better elapsed times from dual processors that spend a
lot of time waiting, rather than a single processor that hardly ever waits
on a cache miss? (The aux. CPU could be on the same chip as the CPU -- do
any available processors do data prefetch as well as instruction prefetch?)
In some cases the address to prefetch simply can't be computed soon enough
(LD i, LD base_addr+4*i), but usually there's enough play in the data
dependencies that instructions can be rescheduled to allow predictions to be
made in time (LD i, do something else useful while simultaneously computing
base_addr+4*i, prefetch base_addr+4*i while doing some other useful things,
do the array reference (either by recalculating base_addr+4*i or by grabbing
the already computed result from the aux. CPU)).

Since all we're interested in is reducing the percentage of cache misses, it
is by no means necessary to make the aux. CPU as intelligent as the primary
CPU; the aux. is permitted to give up on a complicated address calculation
and say, "I don't know," incurring only the penalty of a few wasted cycles
on the primary. Is this a loose enough constraint to make the aux. CPU
paractical, or is the idea economically infeasible (the dollars for extra
compute power would be better spent on another processor) ?


--John Hanley
System Programmer, Manhattan College
..!cmcl2.nyu.edu!manhat!jh or han...@nyu.edu (CMCL2<=>NYU.EDU)

Ralph Hyre

unread,
May 4, 1988, 11:43:44 AM5/4/88
to
In article <93...@apple.Apple.Com> bc...@apple.UUCP (Brian Case) writes:
>In article <4...@pcrat.UUCP> ri...@pcrat.UUCP (Rick Richardson) writes:
>>I'm still looking for the RISC that does ~4K (C language) Dhrystones,
>>has no cache, clocks around 4 Mhz, has a 16 bit bus, can address maybe 1MB,
>>is a power miser, can't do floating point, and costs no more than $15.
>
>Oh, that's easy! The Acorn RISC Machine (ARM). Yes, I know it has a
>32-bit bus now, but just talk to VTI (they have the ARM and use it as a
>cell, I think):
Hmm, 2 people from Apple are talking about ARM. Wonder if that means
anything....

OK, where would I get a similarly cheap machine to do ARM development
on? (It should cost less than a Sun-3/60, maybe about what a Mac II costs,
and maybe somewhere in between what a '386 clone and IBM PS2/80 costs.)
It would be nice to get away with 200-250ns DRAMS, for example.
Judging from the glowing reviews in Byte, you'd think that somebody would be
hot to import some of these nifty U.K developments. But then, whatever
happened to the Torch XXX PC, which looked like an serious Amiga 2000/Mac II
competitor if ever I saw (a review of) one.
--
- Ralph W. Hyre, Jr.

Internet: ral...@ius2.cs.cmu.edu Phone:(412)268-{2847,3275} CMU-{BUGS,DARK}
Amateur Packet Radio: N3FGW@W2XO, or c/o W3VC, CMU Radio Club, Pittsburgh, PA

Steven McGeady

unread,
May 4, 1988, 1:39:14 PM5/4/88
to

In article <8...@imagine.PAWL.RPI.EDU> je...@pawl18.pawl.rpi.edu (Randell E. Jesup) writes:
>In article <4...@pcrat.UUCP> ri...@pcrat.UUCP (Rick Richardson) writes:
>>I'm still looking for the RISC that does ~4K (C language) Dhrystones,
>>has no cache, clocks around 4 Mhz, has a 16 bit bus, can address maybe 1MB,
>>is a power miser, can't do floating point, and costs no more than $15.
>
> Yeah, and what technology is this wonder-chip implemented in???
>Whatever it is, I can think of dozens of Si companies that would give away
>all their current facilites for that process. Oh, and I'm not even worrying
>about cost.
>
> Back to reality, it just can't be done, except MAYBE with a state of
>the art chip optimized to NOTHING but fast dhrystones (which, by the way,
>are a pretty poor predicter for most applications, due to string handling.)
>4 Mhz is REAL slow. A 4Mhz rpm-40 would be equivalent to maybe a 14Mhz
>68000 (note: not '020). At such slow speeds, CISC chips may well show
>superiority due to wanting to maximize the usefulness of every bus cycle.
>

Mr. Jesup is unnecessarily negative about the prospects of such a machine.
With the introduction of the 80960, Intel has made a commitment to building
a range of price/performance solutions for real-time and embedded computing
systems. The current implementation, the 80960KA, runs at 7-10 MIPS (12-15
Kdhry) at 20 MHz, addresses 2^26 bits of physical memory, has no
floating-point, has a 32-bit multiplexed bus, and costs (at introduction, in
quantity 100) $174.

Now, this is three times the performance and 10x the cost of Mr. Richardson's
request, but look more closely. One could:

a) run the chip at 16Mhz (or 10, for that matter);
b) put 1Mb of relatively inexpensive 2 or 3 wait-state memory on
the bus;
c) buy the chips in lots of, say, 1,000,000, and expect a healthy
discount (I believe this was stated in Mr. Richardson's original
article);

This would reduce the cost (and performance) of the overall system to the
area that Mr. Richardson is investigating.

If one was really going to buy chips in large lots. You still have slightly
higher support costs because of a 32-bit bus, rather than 16, but that's
not such a big deal.

As with any piece of silicon, one can expect the price to drop as the part
matures, and as quantities rise.

Chip price has almost *nothing* directly to do with architecture
complexity. It has everything to do with die size, wafer yield, and
marketing. Once a particular implementation is released, the die size
does not change except by shrinks (which have a desirable side-effect
of making the chip run faster, thus drawing a higher price), and then
only by 10-20% at a time. Yield is affected by experience with a
process and by volume, also not by chip complexity (except as reflect
in size). Marketing is marketing. The rate at which recently
announced parts will become "commoditized" is left as an exercise to
the reader. I suggest looking at the price curves for the 8086 and 68000
as examples, taking into account single- versus multi-source issues.


S. McGeady
Intel Corp

Scott Edwards

unread,
May 4, 1988, 6:03:05 PM5/4/88
to
From article <15...@pt.cs.cmu.edu>, by sch...@FAS.RI.CMU.EDU (Donald Schmitz):

> In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
>
>>Real-time control programs often have a situation where only
>>X microseconds are available to perform a task. .....
>
> This may be straying somewhat from the original point, but what sort of
> applications really have such exact timing deadlines? I have done a little
> real-time motion control, ....

I worked on a project a while back that implimented a motion control servo
loop with a microprocessor and every time the uP didn't make the deadline
the loop would go unstable and lost all control.

It was fun to watch! We finally had to change the time period so that the
processor always completed it's job on time, even tho in other modes it
was idle 60% of the time.

-- Scott

t...@alice.uucp

unread,
May 4, 1988, 7:24:22 PM5/4/88
to
As far as the "new virtually indexed, physically tagged write-through
cache" in the Apollo goes, there ain't anything *new*. The name implies
they use the virtual address to address the cache and then compare
the tags to the result of the translation, which will have occured
simultaneously (remember: the TLB is a cache as well and thus has about
the same access time).
Since nothing comes for free (usually), there is a penalty using this
scheme: the cache set size is limited by the number of address bits
available before translation (i.e. the bits which aren't translated).
This usually means the cache set size is limited to the size of a page.
This number is usually around 4Kbytes these days. If you want a larger
cache, you have to make it set associative, which is a pain in discrete
logic (multiplexing etc...).
I think the virtual-index/physical-tag caches will become very attractive
when integrated on a chip, since on a chip, set associative caches are
easier to build and are are anyway (more or less) required for speed.

Thorsten von Eicken
AT&T Bell Laboratories
research!tve
t...@research.att.com

ag...@urbsdc.urbana.gould.com

unread,
May 5, 1988, 12:04:00 AM5/5/88
to

> Another problem, is simply the implementation of the reverse TLB -
> what happens when the reverse TLB misses? Which set of page tables
> should the TLB walk?

If we are still talking about virtual caches, then you don't need a reverse TLB.
What you do is have two sets of cache directories, one physical, one virtual.
Use the virtual directory on local access to the cache.
Use the physical directory when snooping on the bus.
The bus carries physical addresses, not virtual.

The virtual/physical synonym problem can be solved in several ways.
(1) arrange the cache so all virtual addresses for the same physical location
map to the same set. Requires a bit of work when replacing synonyms in the
same set, but is better than having to search the entire cache.
(2) still use a virtual to physical translation, but take it off the critical
path. Ie. access the cache using the virtual address, assuming that there are
no synonyms. But, at the same time, initiate the virtual to physical
translation. Later, on the next cycle if your TLB is fast enough, or much later
if the TLB misses (may have to stall processor, but TLB misses are rare),
when you've got the physical address see if there was already such a location
in cache - backup and repair if stale data was read.
(3) Arrange with the OS so that condition (1) is satisfied, or even to maintain
the invariant that there are no synonyms, or that synonyms are used in
carefully controlled situations.

Brian Case

unread,
May 5, 1988, 2:11:07 PM5/5/88
to
In article <15...@pt.cs.cmu.edu> ral...@ius3.ius.cs.cmu.edu (Ralph Hyre) writes:
>Hmm, 2 people from Apple are talking about ARM. Wonder if that means
>anything....

If it could mean something, I wouldn't have posted about it!

>OK, where would I get a similarly cheap machine to do ARM development
>on? (It should cost less than a Sun-3/60, maybe about what a Mac II costs,
>and maybe somewhere in between what a '386 clone and IBM PS2/80 costs.)
>It would be nice to get away with 200-250ns DRAMS, for example.
>Judging from the glowing reviews in Byte, you'd think that somebody would be
>hot to import some of these nifty U.K developments.

ACORN makes a development system (or someone does), it is cheap compared to
other development systems. I don't think you can get away with 200ns DRAM,
besides I don't think anyone makes DRAM that slow anymore (at least in
256K-and-up desities). The glowing reviews are justified, but the ARM still
has deficiencies that make it less than great for fully general-purpose
systems.

Hank Dietz

unread,
May 5, 1988, 4:16:48 PM5/5/88
to
In article <3...@mancol.UUCP>, j...@mancol.UUCP (John Hanley) writes:
> In article <80...@pur-ee.UUCP> ha...@pur-ee.UUCP (Hank Dietz) writes:
> >...why hasn't anyone talked about the fact that processors like SPARC are not
> >really designed for large-scale multiprocessing, e.g., they have no provision
> >for "hiding" BIG, stochastic, memory reference delays across a log n stage
> >interconnection network, etc.? I think it's pretty uninteresting to talk about
> >multi-processor systems which are small enough that "snooping caches" work....
>
> My favorite method of keeping the CPU busy while a memory-read request is
> traversing the network is the one used by the Denelcor HEP: context-switch
> on a cache miss.... [or alternatively....]

> LD A, <wait>, LD B, <wait>, LD C, <wait>, you could say PREFETCH A, PREFETCH B,
> PREFETCH C, LD A, LD B, LD C. If the time to execute the three non-blocking
> prefetch instructions is comparable to the network latency, you win big, since
> they execute during time that would have been spent idle anyway....

> Something I haven't seen is the above PREFETCH instructions implemented in
> hardware. Call it an intelligent look-ahead cache, or an aux. CPU.
> Predictive memory requests are made not only on the instruction stream,
> but also on the data stream, a few instructions ahead of time....

Burton Smith, of HEP fame, is sort-of doing both in his latest machine; so
are we (CARP -- the Compiler-oriented Architecture Research group at Purdue).

I believe Burton's machine microtasks, a la HEP, but he also has a method
whereby many memory references (or other slow operations) can be initiated
without waiting for earlier ones to complete. (I still don't know how much
of his design is in the public domain, so I can't say much more about it.)

The CARP machine doesn't microtask, but we do some very sneaky interrupt
enabling... the CARP machine processor has provision for multiple delayed
operations to be initiated without waiting for earlier ones to complete, and
interrupts are only enabled when the processor has to wait LONGER than the
compiler expected. The interrupt latency is high this way (perhaps dozens
of instructions between interrupt accept states), but this isn't such a bad
problem when you consider a multiprocessor machine where ANY processor could
service ANY interrupt. There are actually two separate delayed operation
mechanisms in the CARP machine: one for compile-time known delays and one
for delays where only the expected delay is known at compile-time. For some
operations, the expected-delay-based mechanism is late targeting; i.e., the
destination register in register address space is not specified until the
item has arrived, hence the usable register address space is not reduced by
having multiple items pending (selection of a staging register is implicit
in the type of the delayed operation).

We look at it this way: if you want to get high speedup by multiprocessing,
since not everything can be parallelized, we don't want to slow the
sequential parts by microtasking. The result is that we implement machine
use priorities by dynamically changing the parallelism-width dedicated to
each task, and we concentrate on other mechanisms for hiding delays...
preferably mechanisms which do not "use-up" parallelism that we could have
used to achieve speedup through parallel execution.

-hankd

Jeff Sewall

unread,
May 6, 1988, 2:36:46 PM5/6/88
to
In article <63900014@convex> gru...@convex.UUCP writes:
>
>>/* Written 5:51 pm May 2, 1988 by je...@alliant.Sun.Com
>> ...In order to
>> properly snoop, all of the aliases must be checked for in the cache.
>
>Nah, it ain't that hard. One can have a virtually mapped, physically tagged
>cache. As the virtual cache is filled with data, the translated physical
>address from the MMU is written into the tag RAMs. The physical tags are
>then used by the bus watcher to selectively invalidate the cache entries.
>
>This is in fact how cache coherency is maintained in the Convex C-2 machines.
>Each of the CPUs has a set of remote invalidation tag memories which watches
>all of the other physical address buses (3 other CPUs and one I/O). When
>there is a hit, the validity bits are cleared for that particular entry in
>the virtually mapped cache.
>
>Jeff Gruger

This only works for small caches. Once the size of each set of the cache
exceeds your page size, the location of an entry in a physically mapped
"remote invalidation tag memory" will be different than the location of
that entry in the virtually mapped cache. This is because the page index
alone is no longer sufficient to address the cache.

This problem can be solved by passing part of the virtual page address
along with the physical address on memory requests. Then the remote tag
can be addressed with the virtual address and guarantee mapping to the
same location. But this works only if synonyms are not allowed. BTW, this
is the approach taken in the SPUR architecture.

I think that the original poster's question is still valid. Is there a
good solution to snooping a virtually mapped cache with the following
constraints:
(1) Synonyms are allowed
(2) The cache is large enough that the size of a set exceeds the page size

David Collier-Brown

unread,
May 6, 1988, 2:47:12 PM5/6/88
to
In article <3...@mancol.UUCP> j...@mancol.UUCP (John Hanley) writes:
>Something I haven't seen is the above PREFETCH instructions implemented in
>hardware. Call it an intelligent look-ahead cache, or an aux. CPU.
>Predictive memory requests are made not only on the instruction stream,
>but also on the data stream, a few instructions ahead of time.

Well, not on a RISC machine... It is logically similar to (perhaps
identical to?) a pipeline on a CISC. My old 'bun used to look
"forward" in the instruction stream with wild abandon, beacuse it
took so long to decode the instructions (:-)).

--dave (data, now, is quite a different matter) c-b
--
David Collier-Brown. {mnetor yunexus utgpu}!geac!daveb
Geac Computers International Inc., | Computer Science loses its
350 Steelcase Road,Markham, Ontario, | memory (if not its mind)
CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.

Joe Petolino

unread,
May 6, 1988, 8:04:57 PM5/6/88
to
>>> ...In order to
>>> properly snoop, all of the aliases must be checked for in the cache.
>>
>>Nah, it ain't that hard. One can have a virtually mapped, physically tagged
>>cache. As the virtual cache is filled with data, the translated physical
>>address from the MMU is written into the tag RAMs. The physical tags are
>>then used by the bus watcher to selectively invalidate the cache entries.
>
>This only works for small caches. Once the size of each set of the cache
>exceeds your page size, the location of an entry in a physically mapped
>"remote invalidation tag memory" will be different than the location of
>that entry in the virtually mapped cache. This is because the page index
>alone is no longer sufficient to address the cache.
>
>This problem can be solved by passing part of the virtual page address
>along with the physical address on memory requests. Then the remote tag
>can be addressed with the virtual address and guarantee mapping to the
>same location. But this works only if synonyms are not allowed.
>
> . . . Is there a

>good solution to snooping a virtually mapped cache with the following
>constraints:
>(1) Synonyms are allowed
>(2) The cache is large enough that the size of a set exceeds the page size
^^^
(you don't really mean 'set' here, do you :-) )

The problem of finding virtual aliases in a virtually-addressed cache is not
unique to multiprocessor systems. A cache deeper than the page size can
harbor more than one copy of the same line if unrestricted aliases are allowed.
That can cause problems even with a single processor. I've seen a few
solutions to this problem:

1) Use a high enough degree of cache associativity so that you get the desired
cache size with a one-page-deep cache. This was used on the Amdahl 470/V7
(eight-way set associative!). This violates constraint 2 (above).

2) Restrict virtual addresses so that no two aliases map into different cache
sets ('set' used differently here than above). This is used by Sun systems.
It technically violates constraint 1 (but not by much).

3) When doing the snooping, sequentially search all the possible places
where that data might be (this is done while waiting for memory to
respond). This is used in the Amdahl 5890, which has to look in four
2-element instruction-cache sets and eight 2-element data-cache sets on
each processor.

My first two examples are from single-processor machines.

-Joe

Chris Torek

unread,
May 7, 1988, 3:53:03 AM5/7/88
to
[Virtual caches vs bus snooping with physical addresses]

Another possibility---if perhaps somewhat unusual and maybe quite
difficult---would be to be able to spot invalid cache accesses and
fault the original instruction before it is allowed to complete.
That is, allow the instruction to execute using the cached data,
and if the cached data is stale, zap it before it is too late.

I imagine this would get nightmarish when debugging the hardware.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: ch...@mimsy.umd.edu Path: uunet!mimsy!chris

Peter da Silva

unread,
May 7, 1988, 1:36:48 PM5/7/88
to
In article <15...@pt.cs.cmu.edu>, sch...@FAS.RI.CMU.EDU.UUCP writes:
> In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman)
talks about hard realtime when he writes:

> >Real-time control programs often have a situation where only
> >X microseconds are available to perform a task.

> This may be straying somewhat from the original point, but what sort of


> applications really have such exact timing deadlines?

How about jet engine control systems in fighters? Or the software that
lands the space shuttle?
--
-- Peter da Silva `-_-' ...!hoptoad!academ!uhnix1!sugar!peter
-- "Have you hugged your U wolf today?" ...!bellcore!tness1!sugar!peter
-- Disclaimer: These aren't mere opinions, these are *values*.

Jack Bonn

unread,
May 8, 1988, 9:38:06 AM5/8/88
to
From article <15...@pt.cs.cmu.edu>, by sch...@FAS.RI.CMU.EDU (Donald Schmitz):
> In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
>
>>Real-time control programs often have a situation where only
>>X microseconds are available to perform a task. .....

>
> This may be straying somewhat from the original point, but what sort of
> applications really have such exact timing deadlines? I have done a little
> real-time motion control, ....

The worst system for real time deadlines I ever worked on was one
that implemented the control functions for a bottle making machine.
This wasn't a bottler; it took molten glass and formed it into bottles.

We had a 2.5 MHz Z-80 and a periodic interrupt whose period was 1 msec.
Doesn't leave much time for background processing.

The worst case was if an output to the scoop was delayed. Rather than
catching the molten gob of glass in flight, it would fling it across the
plant floor. If it hit anyone, it would stick to their skin and most likely
result in an amputation.

Since I had previously worked on central office software, this gave me
a much more clear view of real time. I used to worry about what would
happen if a dial tone or compelled signaling tone was delayed. Ah, the
good old days.

-Jack
--
Jack Bonn, <> Software Labs, Ltd, Box 451, Easton CT 06612
uunet!swlabs!jack

Ed Nather

unread,
May 9, 1988, 11:06:51 AM5/9/88
to
In article <8...@swlabs.UUCP>, ja...@swlabs.UUCP (Jack Bonn) writes:
> From article <15...@pt.cs.cmu.edu>, by sch...@FAS.RI.CMU.EDU (Donald Schmitz):
> > In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
> >
> > This may be straying somewhat from the original point, but what sort of
> > applications really have such exact timing deadlines?
>
> We had a 2.5 MHz Z-80 and a periodic interrupt whose period was 1 msec.
> Doesn't leave much time for background processing.
>

Our data acquisition system for time-series analysis of variable stars also had
1 msec interrupts, imposed on a Nova minicomputer, ca. 5 usec add time reg to
reg. If your interrupt routine chews up 100 usec, you still have 90% of the
CPU left to do "background" processing (I always thought of it as "forground,"
because it's what the user sees -- keyboard response, display, etc.) That
meant keeping the interrupt routine short in the worst case, and allowing ONLY
the timing interrupt -- all other I/O was polled or DMA. That allowed us to
specify the worst case condition -- when everything was active all at once --
and verify we'd never lose an interrupt. It was a disaster if we did: we'd get
data that looked fine but was actually wrong. Not as dramatic as slinging
molten glass at someone, of course, but still awful.

I suspect time-critical software design will become more and more common as
computers get faster, just because you can consider software control where
only hardware was fast enough before.


--
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nat...@astro.AS.UTEXAS.EDU

mcdo...@uxe.cso.uiuc.edu

unread,
May 9, 1988, 3:19:00 PM5/9/88
to

>In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
>>Real-time control programs often have a situation where only
>>X microseconds are available to perform a task. Therefore,
>>the code to perform the task must be GUARANTEED to complete
>>within X microseconds. In real-time control, a late answer
>>is a wrong answer.

>
>This may be straying somewhat from the original point, but what sort of
>applications really have such exact timing deadlines?...
>[I]n general a few percent deviation in cycle
>time has next to no effect. As long as the deviation is small and well
>distributed, ie. delays of no more than 20% and occuring less than 10
>sample periods in a row, I can't imagine a mechanical system reacting to the
>error.

Sometimes microseconds can matter. Our most complicated real-time
system runs a scanning interferometer and a laser. The interferometer
is a mechanical plunger riding in a sleeve 0.00025 inch larger in
diameter than the moving part, at a temperature of -196 Kelvin, on
a cushion of pressurized helium. The "wiggle" tolerance on the motion
is +- 0.000005 inch. This can only be achieved if the motion is smooth;
this part is taken care of by servo hardware. This hardware detects
the position of the mirror mounted on the plunger by counting
interference fringes of a laser. It sends signals to the computer every
100 microseconds. The computer converts several error signals from
the hardware and decides if they are within tolerance. If not, it skips
a data point. If they are OK it starts the complicated process of
firing the various parts of the laser so that the sixth anticipated
trigger signal will occur just at the time the laser is really ready to
go; the actual firing is by hardware. The computer again checks to see
if the collected data is OK or garbage. Then it can start over again.
The computer also checks on the "quality" of the servo loop inputs;
if they get weak the moving parts have been known to self-destruct
($5000) - there are hardware "stops" to prevent destruction, but
using them ruins the alignment and we have to warm up to room
temperature to fix it, a three day process. We are using a PDP-11/73,
with ALL interrupts disabled. The program was written in assembler,
checking the timing of every instruction -- we can see by its outputs
on a scope how much time we have to spare, and of course there are
variations due to the cache hit/not hit probability, but we know
FOR SURE that it won't overrun, as we give it 25% to spare, in the worst
case. The code was an absolute nightmare to write, but it is actually
rather simple , in fact only about 3000 lines.
I would consider this to be "real-time".
Doug McDonald

John Bartlett

unread,
May 10, 1988, 10:32:22 PM5/10/88
to
In article <63900014@convex> gru...@convex.UUCP writes:
>
>>/* Written 5:51 pm May 2, 1988 by je...@alliant.Sun.Com
>> ... It is a much more
>> complicated procedure to "watch" these addresses. With a virtual
>
>Nah, it ain't that hard. One can have a virtually mapped, physically tagged
>cache. As the virtual cache is filled with data, the translated physical
>address from the MMU is written into the tag RAMs. The physical tags are
>then used by the bus watcher to selectively invalidate the cache entries.


This sounds easy, but every time I have anaylzed this I have come to the
conclusion that one of those tag stores has to be fully associative, to insure
that the two tag stores will always have the same addresses allocated. Am I
missing something here?

In our systems, we can't afford the realestate for a fully associative tag store
for each processor cache.


John Bartlett {ihnp4,decvax,allegra,linus}!encore!bartlett
Encore Computer Corp.
257 Ceder Hill Street
Marlboro, Mass. 01752
(617) 460-0500

Opinions are not necessarily those of Encore Computer Corp.

gru...@convex.uucp

unread,
May 11, 1988, 3:04:00 PM5/11/88
to

>/* Written 9:32 pm May 10, 1988 by bart...@encore.Sun.COM in convex:comp.arch */

>
>This sounds easy, but every time I have anaylzed this I have come to the
>conclusion that one of those tag stores has to be fully associative, to insure
>that the two tag stores will always have the same addresses allocated. Am I
>missing something here?
>
>In our systems, we can't afford the realestate for a fully associative tag store
>for each processor cache.
>

Maybe there is some semantics problem here in the communication...
Which "two tag stores" are you referring to? By "fully associative"
I take it you mean a true content addressable memory for the entire tag
memory?? I believe you only need to have high associativity when
searching through multiple sets of your data cache.

Our cache structures consist of:
a) a virtually addressed data RAM
b) a virtually addressed validity RAM
c) a physically addressed tag RAM
The tag ram is only written as read data returns to the cache. The
tag ram is read only as remote processor writes occur, and if there
is a hit, the validity bits are cleared.

All these RAMs certainly take up a lot of space (although we manage
to pull a lot inside gate arrays). We also reached the conclusion
that we could not afford the real estate of multiple cache sets and
the increased complexity/cost/low-pay-back.

Prior responses have discussed the increasing complexity of the tag RAM
as your data cache gets deeper - you have to have search more than just
one tag as your cache size increases beyond the page size. We have found
it quite effective in a _vector_processing_ machine to have a fairly
small cache equal to page size for scalar operands only. We bypass vector
operands around the cache and invalidate entries if a vector load/store
encounters one. There is NO performance improvement in running
vector data through a cache if you have enough basic bandwidth (which
not all parallel-vector machines have). Large caches best serve plain
old scalar machines that have to stuff entire data arrays into cache
in order to achieve performance.

Jeff Gruger

Philip Kos

unread,
May 12, 1988, 4:05:46 PM5/12/88
to
>In article <15...@pt.cs.cmu.edu> koo...@A.GP.CS.CMU.EDU (Philip Koopman) writes:
>This may be straying somewhat from the original point, but what sort of
>applications really have such exact timing deadlines?...

I worked on some real-time data acquisition applications at the University
of Illinois between 1980 and 1984, and if my program wasn't ready to read
that data word and put it someplace appropriate when it was ready to be
read (affectionaly known as "overrun"), we had to throw out the whole trial
and do it over again. Some of the experiments I assisted were simple
enough, but most were not easily reproducible (particularly the ones
dealing with muscle fatigue) and I never again want to suffer the wrath of
a grad student facing a grant or thesis deadline. Like the original
article said, if it's late, it might as well be wrong.

Phil Kos
Information Systems
...!uunet!pyrdc!osiris!phil The Johns Hopkins Hospital
Baltimore, MD

Mark Smotherman

unread,
May 13, 1988, 4:41:33 PM5/13/88
to

What type of work has been done on benchmarks for real-time systems?
The applications seem so specialized as to make most comparisons into
apples versus oranges. Are there any standard, "representative" tasks
that could be used to indicate the relative merit of a machine/OS?
In evaluating a machine, do you rely mainly on interrupt latency measures,
or on what?

Please email responses and I will post a summary. Thanks.

--
Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634
INTERNET: ma...@hubcap.clemson.edu UUCP: gatech!hubcap!mark

Lars Aronsson

unread,
May 15, 1988, 4:24:18 PM5/15/88
to
>>>the code to perform the task must be GUARANTEED to complete
>>>within X microseconds. In real-time control, a late answer
>>>is a wrong answer.
>>
>>This may be straying somewhat from the original point, but what sort of
>>applications really have such exact timing deadlines?...
>
>Sometimes microseconds can matter. Our most complicated real-time
>system runs a scanning interferometer and a laser. The interferometer

Enough! Obviously, real-time applications do exist. No more
interferometers in this news group, please.

A few years ago, there was a discussion on why you wouldn't use UNIX
for real-time applications. This was because of the virtual memory
system. Today, we have UNIX clones which allow you to lock a process
in main memory, just like the UNIX kernel. Since a virtual memory
system is but a cache mechanism for the disk, the following thoughts
come naturally to me:

Before I start: This might turn out to be Todays Dumb Suggestion.
Maybe my ideas are already implemented on lots of systems or totally
useless. Please, let me know!

As far as I know, RISC instruction caches are a gain only when the
processor runs through loops. What about the ability to declare
cache-resident functions (procedures/subroutines)? This might not be
the solution to real-time applications, but seems potentially useful
in many other cases.

Things normally managed by super-CISC instructions (decimal
arithmetics, string instructions and the like) in such machines, would
then be done with neat library functions declared as "register". The
CISC equivalent to this would be to allow users to define new machine
instructions at run-time.

Of course, you would have to decide on what to do on a context switch.
Maybe the the register functions should belong to a shared library and
be more or less permanently in the cache.

Perhaps, this kind of register functions would make the RISC vs CISC
debate fade a little.

Bill O

unread,
May 17, 1988, 10:28:29 PM5/17/88
to
In article <19...@sics.se> aron...@sics.se (Lars Aronsson) writes:

>Before I start: This might turn out to be Todays Dumb Suggestion.
>Maybe my ideas are already implemented on lots of systems or totally
>useless. Please, let me know!

Yes, I think they have been to a certain extent. More in a bit...

>
>As far as I know, RISC instruction caches are a gain only when the
>processor runs through loops. What about the ability to declare
>cache-resident functions (procedures/subroutines)? This might not be
>the solution to real-time applications, but seems potentially useful
>in many other cases.
>
>Things normally managed by super-CISC instructions (decimal
>arithmetics, string instructions and the like) in such machines, would
>then be done with neat library functions declared as "register". The
>CISC equivalent to this would be to allow users to define new machine
>instructions at run-time.
>
>Of course, you would have to decide on what to do on a context switch.
>Maybe the the register functions should belong to a shared library and
>be more or less permanently in the cache.

Actually, there is no need to use *associative* cache for this
purpose, because the "associative" part is really just a mechanism to
enable the computer to keep in fast memory a portion of the code which
it predicts will be referenced in the near future (the prediction is
usually based on past use). For functions declared as being "fast"
or, as suggested, "register", all you really need is good old
fashioned fast memory.

What follows are excerpts from a couple of recent (past few months)
postings relating to the way this sort of thing was done on the pdp 10
and 11 (the second excerpt gives new meaning to the declaration
"register")

[Dean W. Anneser, Pratt & Whitney Aircraft]
-We have 7 of these beasties [pdp-11/55], and they're still running
-strong. The memory configuration is 0-32kw bipolar, and 32-124kw MOS.
-We keep the time- critical code in the bipolar. DEC has never
-produced a faster PDP-11. We have benchmarked and are currently using
-the 11/73, 11/83, and 11/84, and the 11/55 will still run circles
-around them...

[Brian Utterback, Cray Research Inc.]
-Another advantage the PDP-10 had by mapping the registers to the
-memory space, other than indexing, was in execution. You could load a
-short loop into the registers and jump to them! The loop would run
-much faster, executing out of the registers.

Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University
(bi...@cmx.npac.syr.edu)

John Bartlett

unread,
May 22, 1988, 10:02:44 PM5/22/88
to
In article <63900015@convex> gru...@convex.UUCP writes:
>Which "two tag stores" are you referring to?

In our system we have two separate tag stores, one for processor access
to the cache, one for bus invalidation. This is because the bus and
the processors run on different clocks.

>By "fully associative"
>I take it you mean a true content addressable memory for the entire tag
>memory?? I believe you only need to have high associativity when
>searching through multiple sets of your data cache.
>
>Our cache structures consist of:
> a) a virtually addressed data RAM
> b) a virtually addressed validity RAM
> c) a physically addressed tag RAM
>The tag ram is only written as read data returns to the cache. The
>tag ram is read only as remote processor writes occur, and if there
>is a hit, the validity bits are cleared.
>

You must have some trick for insuring complete mapping between your virtual
index and your physical index. In order to insure complete mapping, one
of them has to be fully associateive (yes CAM) does it not? If you limit
the combinations in some way to prevent this problem, don't you take a hit
on hit rate ? (oh, sorry 'bout the bad pun)

gru...@convex.uucp

unread,
May 25, 1988, 10:12:00 AM5/25/88
to

>/* Written 9:02 pm May 22, 1988 by bart...@encore.Sun.COM
>You must have some trick for insuring complete mapping between your virtual
>index and your physical index. In order to insure complete mapping, one
>of them has to be fully associateive (yes CAM) does it not? If you limit
>the combinations in some way to prevent this problem, don't you take a hit
>on hit rate ? (oh, sorry 'bout the bad pun)
>
>
>John Bartlett {ihnp4,decvax,allegra,linus}!encore!bartlett
>Encore Computer Corp.

Its not a very fancy trick. Physical index = virtual index. Yes, the
cache has to be smaller than you might like it to maximize hit rate.
However, when you work with 5nsec ECL RAMs its difficult (impossible
today) to find any larger than 4K bits. The raw cycle time this permits
more than compensates for a small sacrifice of hit rate.

Jeff Gruger
Convex Computer Corp. {ihnp4!convex!gruger}

Alan Beal

unread,
Jun 15, 1988, 9:25:16 AM6/15/88
to
In article <31...@polyslo.UUCP>, doro...@polyslo.UUCP (David O'Rourke) writes:
> At last count system 3.7 was somewhere in the neighborhood of 700-800
> thousand lines, the only reason Unisys was forced to implement libraries is
> because the MCP was getting so big the compilier couldn't treat it as one
> single program anymore.

I can't believe I am about to defend Unisys, but I would say that the
number of lines of code in the MCP has nothing to do with the implementation
of libraries. The DMSII accessroutines were one of the first versions of
libraries even though it wasn't called a library and DMSII has been around
since the 70's. I can not speak of Unisys's intent, but libraries offer the
modularity desired in large complex systems. Take COMS for example, the
majority of COMS code is implemented as libraries, as well as BNA and the
new print subsystem. Libraries eliminate the need for binding all those
code files together - I would call this a software engineering enhancement not
a solution to the number of lines in the MCP. If you take a good look at
libraries, aren't they implemented in a manner similar to those in OS/2? My
only complaint is that Unisys has not put a lot of effort in developing new
products using the newest features in the MCP, ie. libraries and port files.

> And triing to keep up with all of the different versions of Algol that
> Unisys has: Newp, DC-Algol, ect.. No wonder no software gets written when
> ever you want to change something you have to work across at least three
> different versions of the same language, several different versions of the
> MCP, and you have that wonderful editor with which to look at all of this
> code.

Here I go again defending Unisys. )-: I am not sure that you understand
how the different versions of Algol are used. Algol, DCalgol, DMalgol, and
BDMSalgol are all compiled from a single source - symbol/algol. Here are
their capabilities:

1) Algol - normal application development capabilities
2) BDMSalgol - Algol plus DMSII capabilities
3) DCalgol - Algol plus data communications and system programming
capabilities. No DMSII capabilities.
4) DMalgol - DCalgol plus DMS accessing and development capabilities

Which version of algol is used is determined by the type of programming
involved and I never seem to get confused on which version to use.

Where are all these versions of the MCP? As far as I know we can only
purchase one version for the B7800. Again you are confusing the fact that
each machine series has a tailored MCP for that particular machine in order
to handle the way memory is managed and other gory details, but basicly
the functionality is the same between MCPs.

> Even Unisys is moving towards knowing Machine code, they have a new
> piece of software called DumpAnalyser, they seemed to feel the need to spend
> three weeks teaching me how to use it {they do this for all new employee's}.
> And if you think what it puts out is Algol code you are sadly mistaken, it's
> basicly for "reading" the stack of a program, and if that's not machine code
> I don't know what is, they still don't have an assembler, but they're now
> allowing the programmers to "look" at the code produced by the compiler.

How many application programmers out there have used Dump Analyzer? How
many know what it is? Dump Analyzer is a tool for systems programmers to
analyze memory dumps, ie. what programs where in the mix at the time of
the dump and where did they bomb off. Our staff spends a lot of time using
this tool and at times have looked at the machine code of the offending
program, but for the most part the machine code offers little insight into
the cause of the problem and usually the problem is passed on to Unisys to
solve. I would agree that knowledge of the stack architecture is very
helpful, especially in debugging programs. However, most people can debug
their programs without ever looking at machine code. Finally, the stack
is not machine code but an internal data structure for storing variables,
pointers to data, and recording the environment of the program. It would
be a mistake to say Unisys is moving towards machine code and assemblers
since the move from within is to get the Sperry side out of that mode of
operation.


Software is the name of the game for most companies now. IBM realizes this.
Unisys does not. The Burroughs side of Unisys has concentrated on further
enhancing its current software and has not made great efforts at developing
new products. For example, LINC was developed by a company in New Zealand
and was purchased by Burroughs as a 4GL. How many people like to use LINC?
It has nice syntax like 'MOVE ; FIELD1 FIELD2'. Would you call this an
end-user or programming language? Then there was GATEWAY developed by
Joseph and Cogan which was competing with COMS. Burroughs bought J & C and
now we have COMS. What happened to GATEWAY? And now we have SIM, a semantic
database system sitting on top of DMSII. Does it offer SQL a or provide access
to other DBMS systems like DB2? No, of course it doesn't. And how about
BNA - it is a nice way to connect Burroughs machines together but can I connect
our UNIX machine to it? Again, no. And speaking of BNA, it would be an
excellent vehicle in which to develop a distributed DBMS around. Are there
any plans to do this in the future? You know the answer.

--
Alan Beal DLSC-ZBC Autovon 932-4160
Defense Logistics Services Center Commercial (616)961-4160
Battle Creek, MI 49015 FTS 552-4160
UUCP: {uunet!gould,cbosgd!osu-cis}!dsacg1!dlscg2!abeal

David O'Rourke

unread,
Jun 18, 1988, 5:27:21 PM6/18/88
to
In article <3...@dlscg1.UUCP> dlsc...@dlscg1.UUCP (Alan Beal) writes:
> I can't believe I am about to defend Unisys, but I would say that the
>number of lines of code in the MCP has nothing to do with the implementation
>of libraries.

Have you ever worked for Unisys? No I don't think so. And yes the current
implementation of Libraries came about as a request from the MCP group {which
I worked with at the Mission Viejo/Lake Forest plant} to the Compiler group
to implement Libs. because the MCP was getting too large. Yes Unisys has
always had Libs. but not until recently were they used to a great extent
in internal/external production code, no one trusted them and would put the
code in-line rather than porting it out to a lib, most of the time the
offending programmer would claim performance. Unisys's library implmentation
is quite elegant and doesn't impose a performance hit after the 1st use.
Many programmers that have been with Unisys for MANY years don't see the
benifit of using libs. and you have to pull teeth to get them to use any of
the standard routines that are provided if they can write them themselves.
One programmer designated to teach me how to program A-Series told me not to
make MCP calls because of the high overhead, he said always do the simple
stuff so that your program runs faster. Well if making MCP calls has such
a high overhead and most programmers don't call them then what's the point
of having the MCP allow external calls? Many programmers were so worried about
the performance of their code that they would forsake compatibility {i.e.
doing it themselves rather than the MCP which would allow future compatibility}.
And if you think this is an isolated attitude you are wrong!

>modularity desired in large complex systems. Take COMS for example, the
>majority of COMS code is implemented as libraries, as well as BNA and the
>new print subsystem. Libraries eliminate the need for binding all those
>code files together - I would call this a software engineering enhancement not
>a solution to the number of lines in the MCP.

Yes but a great majority of programmers at Unisys don't see the benifit of
this enhancement. And as for COMS being modular that's one of the reasons
libaries came about. Because COMS got to be sooooo large they couldn't fit
it one program so they too requested that libs. be implemented. And as far
as COMS being good code well you can chuck that idea the slightest change in
the MCP normally breaks COMS, if you want to test your MPC patch test it with
COMS and see if it still works. Why is this you say, well because some
programmer did something himself rather than going thru the standard libary
call. Yes libs are implemented, but they are used as ways to break code into
smaller chunks for the compiler. They are rarely used to provide a standard
interface or provide future compatibility. Most programmers at Unisys that
I met simply didn't bother to make calls to other libraries if they could write
the code them selves. This causes lots of compatibility problems for future
upgrades and nullifies one of the benifits of libs.
And COMS is an interesting beast in itself. I spent 4 months going over that
code and talking to anyone I could find regarding information on COMS. The
mission plant has over 400 programmers working there and I talk to at least
100 of them. Asking a simple question: What does COMS do, and what does
MCP do! No-one to this day answered that question with a straght forward
answer. MCP and COMS are so intertwined that no one can tell them apart, there
are many places in COMS that has identical subroutines as the MCP because the
original programmer didn't bother to look and see if the MCP did it already,
what this means is that when that part of MCP is changed someone else has to
go in a change the same routine in COMS. In fact a special flag in patch
manager was implemented to flag changes to either of the pieces of code and
notify the appropriate department that they need to change their code. Yep
Unisys has libraries alright, now could someone teach them how to use them.

>If you take a good look at
>libraries, aren't they implemented in a manner similar to those in OS/2? My

Are you comparing A-Series to OS/2. Please the A-Series deserves better
than that. OS/2 1/2 an operating system :-)

>only complaint is that Unisys has not put a lot of effort in developing new
>products using the newest features in the MCP, ie. libraries and port files.

Have you ever watched a system using port files? Not real fast! Many of
the newest features of the MCP aren't understood by the vast majority of
the programmers at Unisys, hence you don't get to see a lot of software that
uses them.

>> And triing to keep up with all of the different versions of Algol that
>> Unisys has: Newp, DC-Algol, ect.. No wonder no software gets written when
>> ever you want to change something you have to work across at least three
>> different versions of the same language, several different versions of the
>> MCP, and you have that wonderful editor with which to look at all of this
>> code.
>
> Here I go again defending Unisys. )-: I am not sure that you understand
>how the different versions of Algol are used. Algol, DCalgol, DMalgol, and
>BDMSalgol are all compiled from a single source - symbol/algol. Here are
>their capabilities:

I am quite aware of the different versions of Algols. If they are so
similar then why does Unisys have 600-1000 {or more} page manual describing
just the features of that language for each version. Each manual for each of
the algols refer to the Standard Algol 60 manual which is the lowest common
denominator at Unisys, it then goes on to talk about the "differences" between
the "Standard" & "This version" of Algol. These manuals are LARGE, Technical,
and very scant in their descriptions of the various functions. They are
not the same, and many have major differences and feature specific syntax
that aren't availible in the other algols. You should spend 2 or 3 weeks
going between Algol ---> NeWP ----> DCAlgol and back and forth to find an
obscure bug in one of the routines in the MCP. This code wasn't standard
Algol and each part typically used the specialized features of that particular
language. Go off and do that and then come back and tell me that all of these
languages are the same and that if you know one you know them all, yeah thats
true for the simple stuff, but not for the sort of stuff that Unisys typically
writes.

> Where are all these versions of the MCP? As far as I know we can only
>purchase one version for the B7800. Again you are confusing the fact that

If you will read the availible posting you will find that I worked for the
A-Series group of people. They have released MPC 3.7 for the entire A-Series
line of computers. It is an Upgrade from 3.6 it is not machine specific except
that it has to be run on an A-Series. When I left to finish school they were
already coding MCP 3.8 and planning MCP 3.9. Again I'm confusing nothing!
You just don't understand.

>be a mistake to say Unisys is moving towards machine code and assemblers
>since the move from within is to get the Sperry side out of that mode of
>operation.

How would you know what the move from within is? And the version of
DumpAnalyser that I used did indeed allow you to look at both the stack and
the machine code that the different descriptors in the stack pointed to.

>BNA - it is a nice way to connect Burroughs machines together but can I connect
>our UNIX machine to it? Again, no. And speaking of BNA, it would be an
>excellent vehicle in which to develop a distributed DBMS around. Are there
>any plans to do this in the future? You know the answer.

Ahh but when I asked my manager when we were going to implement a distributed
filing system he said: "We already have a distributed file system" Well in
fact they don't. The people at Unisys will tell you that they already have
a distributed DBMS, and they think they do, when in fact they are sadly
mistaken. The whole purpose of my original article was to indicate that
although the E-Mode architecture is quite nice the software that Unisys is
running on top of it is quite old and wasn't very sophisticated when it was
new.
Unisys needs to make some radical changes if they are going to continue to
compete. There are people inside that are triing, but they have 20 years of
inertia to fight, so we'll see what happens. Stay tuned.

--
David M. O'Rourke

Disclaimer: I don't represent the school. All opinions are mine!

David O'Rourke

unread,
Jun 22, 1988, 1:29:49 PM6/22/88
to
In article <3...@dlscg1.UUCP> dlsc...@dlscg1.UUCP (Alan Beal) writes:
> While we are on the subject, how does Unisys compare to other vendors in the
>number of patches applied to its software releases? Currently we are release

I was personally dismayed at the instability of the MCP. For a 20 year old
OS it really isn't all that bullet proof. We had production machines that
would halt-load every day just to clean them selves up. I don't know of
too many other OS's that get so messed up that they have to re-boot to clean
everything out.
Judging from this I'd say Unisys A-Series MCP tends to have more bug fixes
that other equivilant OS's. BUT I DON'T REALLY KNOW, THIS IS OPINION BASED
ON MY EXPERIENCE.

>3.6 and are up to the 9th patch cycle. The problem with most of the patches
>is that they usually cause more problems than they fix; most of the time it
>seems that no one must have tested the patches. It is such a problem that
>the so called Unisys experts never seem to be able to remember what things
>were changed at what release. As standard operating procedure we always
>extensively test any new patch release before permanently installing it on
>the system. And it always seems that we are returning back to earlier, less
>patched versions of the same MCP release. It can become a nightmare.

The major problem seems to be that by the time the bugs are found the
programmers: A) Quit due to frustration with Unisys, B) Moved to another
section, C) Moved to another project. After the MCP group turns it's
software in to Product Assurance they don't see it again for about 3 - 6
months, by that time they've moved onto something else. Normally the
person fixing the bug isn't the one who wrote it. This also demonstrates
the problem with the Mono-Lithic structure of MCP, a change in one area
of the program might have side effects in other areas. Another problems
stems from what I mentioned in an early article, no body reuses code!!
Everyone writes it themselves, rather than triing to make central calls.
Well if a problem with one part of the Code in MCP then there is a VERY
VERY VERY high probability that there is some similar code in the MCP
somewhere that also needs to be fixed, but typically this isn't found until
after the release. It seems to be a problem with the Software engineering
aspect of MCP rather than bad programming. The arch. of MCP is forced on
newer programmers even though they know better, and it just continues to
grow unchecked.
An internal estimate from a couple of people in my group that I used
to work for estimated that if you were to re-implement MCP, not add one
single feature, just freeze it and rewrite it from scratch with today's
software techniques then they could probably get it down to about 300-400
thousand lines, Almost a 50% code reduction, and it would be MUCH MUCH more
extensible and easier to maintain, the problem is they estimated the project
to take 50-60 man years. And Management didn't seem to like the idea of
spending all of those resources just get what the "already" have. I
personally think it needs to be done or else the monster of MCP is going
to eat the A-Series people Alive. Things are already grinding to a slow
crawl when any new feature is added to MCP, this will soon become almost
inifinate in the amount of resources required to make even the slightest
change.

Thankyou for your comments. This has been an interesting discussion.
I've learned a lot and I hope some people have found my comments useful.
I'm willing to continue it, but I've also been requested to move it to
another group. Perhaps someone could recommend a group if they want
to continue this discussion.

0 new messages