> Actually, I think it was called an 11/74, which was indeed a multi-headed
> PDP-11/70. There was code to support it in RSX-11M-Plus, which I think
> I still have a listing of that I prop my feet on when I'm working at home.
>
> At least one such beastie was built, and was (at least in 1983) connected
> to DEC's E-net as nodes CASTOR and POLLUX (I think, it's been a long time).
The code is still there, although the RSX Development Group got rid of the
11/74s last year. (I used to be system manager for them.) The name Castor is
now used for another system, but at least two of the 11/74s are running as a
multiprocessing system elsewhere in the network, under another name.
The multiprocessing was truly symmetric; the only master/slave feature was the
clock update, which was performed by the boot processor. (Unless the boot
processor was taken offline, in which case the clock update would move -- not
just peripherals, but even memory and processors could be taken offline and
online dynamically.) I am appending a note written by Brian Mccarthy about
RSX's multiprocessing.
One thing I liked was that the system-build job would use more CPU time than
elapsed time. When I noticed it, I thought it was a bug until I remembered
what system I was working on.
-- edp (Eric Postpischil)
"Always mount a scratch monkey."
postp...@alien.enet.dec.com
This reply details the implementation of symmetric multiprocessing on
the PDP-11/RSX.
The RSX implementation is truly symmetric. The only facet of system
operation that is master/slavish is that only one processor (typically the
first one booted) gets to update the calendar clock (we poor 16 bitters
don't get time of year clocks, you know).
I've broken the discussion up into these basic parts:
Terms
Hardware
Locking mechanisms - Spin locks
Locking mechanisms - Wait locks
Cache handling basics (interaction with locks)
Overview of RSX synchronization
Distributed I/O
Handling the cache on behalf of the user
Terms
-----
This section is sort of a RSX to VMS dictionary.
TASK - read as "process", you'll do okay.
INSTALLED task
INSTALL a task - An installed task isn't quite the same as anything in
VMS that I know of. It's sort of a combination of a process that's
dormant, but has all of its data structures already, and an image
that doesn't require a directory look up.
APR - An APR is an active page register. 11s do virtual memory translation
in 8 4KW pages. Read as "page table entry" and I think it works.
UMR - An UMR is a UNIBUS map register, mapping 4K of UNIBUS physical space
to 4K of physical memory. You guys got enough maps you ought to be
able to figure this one out.
Hardware
--------
The 11/74 hardware was an almost standard 11/70. It had:
MKA11 four port shared memory. Similar to MA780 but all the
systems memory was in there. 1-8 boxes could be configured.
each box could hold any or all of the 4 mb physical address
space. It was possible to build a system with more memory
than could be used at one time, for redundancy purposes.
The ASRB instruction was interlocked. See below.
The system software could bypass or flush the cache on demand.
The software could also flush the cache on an APR or UMR basis.
The IIST (interprocessor interrupt and sanity timer) provided
a mechanism for one CPU to inform another CPU of a state change.
Locking mechanisms - Spin locks
-------------------------------
The basic locking mechanism in the PDP-11 implementation is the arithmetic
shift right byte (ASRB) instruction. That instruction is interlocked (sort of)
on the 11/74. Locks always have the value 0 or 1. ASRB lock will rotate the
low bit out (leaving zero) and return the C-BIT set/clear for previous state.
Thus the basic lock is:
10$: ASRB lock
BCC 10$
This had a disadvantage in the original 11/74. ASRBs were quite detrimental
to the memory system due to the time involved to do read/modify/write. A
user task issuing ASRBs (after all, they can be used for computations) could
interfere (greatly) with MASSBUS I/O due to bus bandwidth and latency. So the
following change was made (this is the "sort of" above, and the trick in an
earlier reply). The MKA11 memory system supports an exchange cycle. All locks
are 0 after the ASRB has been performed. So the modified 11/74 exchanged 0
with the lock contents, then examined the contents. If it was zero or one
(potentially a lock) the zero was correct. No problem. Else the ASRB was
performed and the result re-written. Not interlocked, but no problem.
Locking mechanisms - Wait locks
-------------------------------
Spin locks are still expensive. Locks that are locked for long times are
implemented via wait locks. These locks have a bit mask (protected by a
spin lock) of waiting CPUs for the lock. When the lock is unlocked, the
next one and only one processor waiting for the lock is notified that the
lock is now free (via interprocessor interrupts).
Cache handling basics (interaction with locks)
----------------------------------------------
The hardware provides cache flush and bypass. As it turns out, due to the
characteristics of the 11/70/74 cache, it was best to bypass the cache during
sections protected by spin locks, flush the cache at the beginning of a wait
lock.
This may not be true of bigger caches. The 11/70 has two sets of valid bits.
It clears the inactive one as it goes. When a flush is performed, if the
inactive set is fully cleared, it just switches sets. Hence the first flush
is free, the second requires a 1024 cycle wait (guess the size of the cache)
(about a millisecond). I suspect a 780, etc. is much worse.
Overview of RSX synchronization
-------------------------------
RSX doesn't use the CPU priority (IPL level) for synchronization. We use
a small scheduler that services basically six levels of operation. User
state (lowest priority), system state (this is the only place where shared
system data is valid for modification or examination) and four levels of
interrupt service. Interrupt routines which need to access shared data
queue a fork block, executed after all other system stat code is done. I
imagine VMS is similar in that respect. I don't know how hard VMS would
be to synchronize. I suspect it's similar.
Because of this, almost all of the system data is protected by a single
lock. It's locked when the CPU enters system state, unlocked when it leaves.
(This is also an unbelievably simple and effective deadlock avoidance sceheme,
albeit not too efficient). Attempting multiple locks would have been
horrendously hard. (Although the TT driver uses some).
At the very end of the scheduler, before going to a user task, the system
checks for other CPUs with pending work. If there is one (or more) the
first one (and only one) is selected and notified via interprocessor
interrupt that a state change has occurred. This is basically a way of
doing token passing for the locks and eliminates expensive collisions in
the contention for lock ownership.
Distributed I/O
---------------
Each discrete section of the UNIBUS in an 11/74 system has an identifiere
called a UNIBUS run mask (URM, not to be confused with UMR). Each fork block
has a required UMR field in it that describes the segments needed to execute
the request. Each CPU maintains a mask of connected URMs. This mask is
compared to each fork block when the list is examined to find one capable of
executing on the processor.
A major goal here was to avoid driver modifications. The CPU switching code
is all in the called routines in the exec, so that A) All of DECs and B) Any
of the user's written drivers didn't reuire modification.
Handling the cache on behalf of the user
----------------------------------------
Codewise, this was a bear. In order to solve the exclusion problem without
forcing application changes, we did the following. If Any user task maps
a global section read-write, all APRs mapped to that section bypass the cache.
If a number of tasks are running with a section mapped read-only, and another
is activated which maps it read-write, the tasks are all converted to do
bypass. This is quite a stunt if one of the tasks is active on another CPU
from the one that discovers it.
As it turns out, that solution isn't sufficient for some applications. If
the application uses non-iterlocked instructions (bit set, let's say) to
manipulate data in a shared region for synchronization purposes, allowing
multiple tasks to run on separate CPUs at all isn't possible. We have a
switch when a task is installed to declare its intent to use shared data
for synchronization. If a task has that attribute, no task mapping the
same regions may run on another CPU at the same time.
-----------------------------------------------------------------------------
This description all makes sense to me, but then again, I understood it
before I read this. If I've been cryptic in the description anywhere, let
me know and I'll try to clarify it. Thanks.
Brian