Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

indirection in old architectures

483 views
Skip to first unread message

Anton Ertl

unread,
Dec 29, 2023, 1:29:40 PM12/29/23
to
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.

The major question I have is why these architectures have this
feature.

The only use I can come up with for the arbitrarily repeated
indirection is the implementation of logic variables in Prolog.
However, Prolog was first implemented in 1970, and it did not become a
big thing until the 1980s (if then), so I doubt that this feature was
implemented for Prolog.

A use for a single indirection is the implementation of the memory
management in the original MacOS: Each dynamically allocated memory
block was referenced only from a single place (its handle), so that
the block could be easily relocated. Only the address of the handle
was freely passed around, and accessing the block then always required
double indirection. MacOS was implemented on the 68000, which did not
have the indirect bit; this demonstrates that the indirect bit is not
necessary for that. Nevertheless, such a usage pattern might be seen
as a reason to add the indirect bit. But is it enough?

Were there any other usage patterns? What happened to them when the
indirect bit went out of fashion?

One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Scott Lurndal

unread,
Dec 29, 2023, 2:05:01 PM12/29/23
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Some (many?) architectures of the 1960s (earlier? later?) have the
>feature that, when loading from an address, if a certain bit in the
>word is set, the CPU uses that word as the address to access, and this
>repeats until a word without this bit is found. At least that's how I
>understand the descriptions of this feature.

That's essentially accurate. The Burroughs medium systems
operands were described by an operand address that included
an 'address controller'. The address controller, a four-bit
field, specified two characteristics of the address; the
two-bit 'index' field contained the number of the index register
(there were three) to be used when calculating the final
address. The other two bits described how the data at the
final address should be treated by the processor
0b00 Unsigned Numeric Data [UN] (BCD)
0b01 Signed Numeric Data [SN] (BCD, first digit 0b1100 = "+", 0b1101 = '-').
0b10 Unsigned Alphanumeric Data [UA] (EBCDIC)
0b11 Indirect Address [IA]

Consider the operand 053251, this described an unsigned
numeric value starting at the address 53251 with no indexing.

The operand 753251 described an address indexed by IX1
and of the type 'indirect address' which points to another
operand word (potentially resulting in infinite recursion,
which was detected by an internal timer which would terminate
the process when triggered).

The actual operand data type was determined by the
address controller of the first operand that isn't
marked IA.


>
>The major question I have is why these architectures have this
>feature.

Primarily for flexibility in addressing without adding substantial
hardware support.

>
>The only use I can come up with for the arbitrarily repeated
>indirection is the implementation of logic variables in Prolog.

The aforementioned system ran mostly COBOL code (with some BPL;
assemblers weren't generally provided to customers).


>Were there any other usage patterns? What happened to them when the
>indirect bit went out of fashion?

Consider following a linked list to the final element as an
example usage.

The aforementioned system also had a SLL (Search Linked List)
that would test each element for one of several conditions
and terminate the indirection when the condition was true.

>
>One other question is how the indirect bit works with stores. How do
>you change the first word in the chain, the last one, or any word in
>between?

I guess I don't understand the question. It's just a pointer in
a linked list.

John Levine

unread,
Dec 29, 2023, 2:36:05 PM12/29/23
to
According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>Some (many?) architectures of the 1960s (earlier? later?) have the
>feature that, when loading from an address, if a certain bit in the
>word is set, the CPU uses that word as the address to access, and this
>repeats until a word without this bit is found. At least that's how I
>understand the descriptions of this feature.

More or less. Indirect addressing was always controlled by a bit in
the instruction. It was more common to have only a single level of
indirect addressing, just controlled by that instruction bit.
Multi-level wasn't much more useful and you had to have a way to break
address loops.

>One other question is how the indirect bit works with stores. How do
>you change the first word in the chain, the last one, or any word in
>between?

The CPU follows the indirect address chain to get the operand address
and then does the operation. On the PDP-10, this stores into the
word that FOO points to, perhaps after multiple indirections:

MOVEM AC,@FOO

while this stores into FOO itself:

MOVEM AC,FOO

>The major question I have is why these architectures have this
>feature.

Let's say you want to add up a list of numbers and your machine
doesn't have any index registers. What else are you going to do?

Indirect addressing was a big improvement over patching the
instructions and index registers were too expensive for small
machines. The IBM 70x mainframes had index registers, the early DEC
PDP series didn't other than the mainframe-esque PDP-6 and -10. The
PDP-11 mini was a complete rethink a decade after the PDP-1 with eight
registers usable for indexing and no indirect addressing.

>Were there any other usage patterns? What happened to them when the
>indirect bit went out of fashion?

They were also useful for argument lists which were invariably in
memory on machines without a lot of registers which was all of them
before S/360 and the PDP-6. On many machines a Fortran subroutine call
would leave the return address in an index register and the addresses
of the arguments were in the words after the call. The routine would
use something like @3(X) to get the third argument. Nobody other than
maybe Lisp cared about reentrant or recursive code, and if the number
of arguments in the call didn't match the number the routine expected
and your program blew up, well, don't do that.

As you suggested, a lot of uses boiled down to providing a fixed
address for something that can move, so instructions could indirect
through that fixed address without having to load it into a register.

For most purposes, index registers do indirection better, and now that
everything has a lot of registers, you can use some of them for the
fixed->movable stuff like the GOT in Unix/linux shared libraries.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,
Dec 29, 2023, 3:30:56 PM12/29/23
to
Anton Ertl wrote:

> Some (many?) architectures of the 1960s (earlier? later?) have the
> feature that, when loading from an address, if a certain bit in the
> word is set, the CPU uses that word as the address to access, and this
> repeats until a word without this bit is found. At least that's how I
> understand the descriptions of this feature.

> The major question I have is why these architectures have this
> feature.

Solves the memory access problem {arrays, nested arrays, linked lists,...}
The early machines had "insufficient" address generation means, and used
indirection as a trick to get around their inefficient memory address mode.

> The only use I can come up with for the arbitrarily repeated
> indirection is the implementation of logic variables in Prolog.
> However, Prolog was first implemented in 1970, and it did not become a
> big thing until the 1980s (if then), so I doubt that this feature was
> implemented for Prolog.

Some of the indirection machines had indirection-bit located in the
container at the address generated, others had the indirection in
the address calculation. In the case of the PDP-10 there was a time-
out counter and there were applications that worked fine up to a
particular size, and then simply failed when the indirection watch
dog counter kept "going off".

> A use for a single indirection is the implementation of the memory
> management in the original MacOS: Each dynamically allocated memory
> block was referenced only from a single place (its handle), so that
> the block could be easily relocated. Only the address of the handle
> was freely passed around, and accessing the block then always required
> double indirection. MacOS was implemented on the 68000, which did not
> have the indirect bit; this demonstrates that the indirect bit is not
> necessary for that. Nevertheless, such a usage pattern might be seen
> as a reason to add the indirect bit. But is it enough?

Two things: 1) the indirect bit is insufficient, 2) optimizing compilers
got to the point they were better at dereferencing linked lists than
the indirection machines were. {Reuse and all that rot.}

> Were there any other usage patterns? What happened to them when the
> indirect bit went out of fashion?

Arrays, matrixes, scatter, gather, lists, queues, stacks, arguments,....
We did all sorts of infinite-indirect stuff in asm on the PDP-10 {KI}
when programming at college.

They went out of fashion when compilers got to the point they could
hold the intermediate addresses in registers and short circuit the
amount of indirection needed--improving performance due to accessing
fewer memory locations.

The large register files of RISC spelled their doom.

> One other question is how the indirect bit works with stores. How do
> you change the first word in the chain, the last one, or any word in
> between?

In the machines where the indirection is at the instruction level, this
was simple, in the machines where the indirection was at the target, it
was more difficult.

> - anton

Summary::

First the architects thought registers were expensive.
{Many doubled down by OP-Mem ISAs.}
The architects endowed memory addressing with insufficient capabilities.
{Many to satisfy the OP-Mem and Mem-OP ISA they had imposed upon themselves}
Then they added indirection to make up for insufficient addressing.
And then everyone waited until RISC showed up (1980) before realizing their
error in register counts.
{Along about this time, Compilers started getting good.}

John Levine

unread,
Dec 29, 2023, 4:59:29 PM12/29/23
to
According to MitchAlsup <mitch...@aol.com>:
>Some of the indirection machines had indirection-bit located in the
>container at the address generated, others had the indirection in
>the address calculation. In the case of the PDP-10 there was a time-
>out counter and there were applications that worked fine up to a
>particular size, and then simply failed when the indirection watch
>dog counter kept "going off".

No, that's what the GE 635 did, a watchdog timer reset each time it
started a new instruction. The PDP-6 and -10 could take an interrupt
each time it calculated an address and would restart the instruction
when the interrupt returned. This worked because unlike on the 635 the
address calculation didn't change anything. (Well, except for the ILDB
and IDPB instructions that needed the first part done flag. But I
digress.)

You could tell how long the time between clock interrupts was by
making an ever longer indirect address chain and seeing where your
program stalled. It wouldn't crash, it just stalled as the very long
address chain kept being interrupted and restarted. I'm not being
hypothetical here.

>Two things: 1) the indirect bit is insufficient, 2) optimizing compilers
>got to the point they were better at dereferencing linked lists than
>the indirection machines were. {Reuse and all that rot.}

More importantly, index registers are a lot faster than indirect
addressing and at least since the IBM 801, we have good algorithms to
do register scheduling.

>> One other question is how the indirect bit works with stores. How do
>> you change the first word in the chain, the last one, or any word in
>> between?
>
>In the machines where the indirection is at the instruction level, this
>was simple, in the machines where the indirection was at the target, it
>was more difficult.

The indirection was always in the address word(s), not in the target.
It didn't matter if it was a load or a store.

Joe Pfeiffer

unread,
Dec 30, 2023, 2:26:08 PM12/30/23
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Some (many?) architectures of the 1960s (earlier? later?) have the
> feature that, when loading from an address, if a certain bit in the
> word is set, the CPU uses that word as the address to access, and this
> repeats until a word without this bit is found. At least that's how I
> understand the descriptions of this feature.
>
> The major question I have is why these architectures have this
> feature.

I'll hazard a guess that once you've got the indirect bit out in memory,
it's easier to just use the same logic on all memory reads than to only
let it happen once.

John Levine

unread,
Dec 30, 2023, 6:26:25 PM12/30/23
to
According to Joe Pfeiffer <pfei...@cs.nmsu.edu>:
>I'll hazard a guess that once you've got the indirect bit out in memory,
>it's easier to just use the same logic on all memory reads than to only
>let it happen once.

That's not how indirect addressing worked.

There was always a bit in the instruction to say to do indirection.

Sometimes that was it, sometimes on machines where the word size was
bigger than the address size, it also looked at some other bit in the
indirect word to see whether to keep going. On the PDP-8, the words
were 12 bits and the addresses were 12 bits so there was no room, they
couldn't have done multilevel indirect if they wanted to.

As several of us noted, multilevel indirection needed something to
break loops, while single level didn't. In my experience, multiple
indirection wasn't very useful, I didn't miss it on the -8, and I
can't recall using it other than as a gimmick on the PDP-10.

Quadibloc

unread,
Dec 31, 2023, 3:00:18 AM12/31/23
to
On Fri, 29 Dec 2023 17:20:43 +0000, Anton Ertl wrote:

> Some (many?) architectures of the 1960s (earlier? later?) have the
> feature that, when loading from an address, if a certain bit in the word
> is set, the CPU uses that word as the address to access, and this
> repeats until a word without this bit is found. At least that's how I
> understand the descriptions of this feature.
>
> The major question I have is why these architectures have this feature.

No doubt this answer has already been given.

The reason these architectures had that feature was because of a feature
they _didn't_ have: an index register.

So in order to access arrays and stuff like that, instead of doing surgery
on the short address inside an instruction, you can simply store a full
address in a word somewhere that points anywhere you would like.

> One other question is how the indirect bit works with stores. How do
> you change the first word in the chain, the last one, or any word in
> between?

Let's assume we do have an architecture that supports multi-level
indirection. So an instruction word looks like this:

(i)(x)(opcode)(p)(address)

and an address constant looks like this:

(i)(x)(address)

So in an address constant (some architectures that had index registers
kept indirection) you could specify indexing too, but now the address was
longer by the length of the opcode field.

If the address inside an instruction is too short to handle all of memory
(i.e. the word length is less than 24 bits) then you need a "page" bit in
the instruction: 0 means page zero, shared by the whole program, 1 means
the current page - the one the instruction is on.

Let's now say the instruction is a _store_ instruction. Then what? Well,
if the indirect bit is set, it acts like a *load* instruction, to fetch and
load the effective address. It only stores at the point where indirection
ends - where the address is now of the actual location to do the storing
in, rather than the location of the effective address, which must be read,
not written.

John Savard

MitchAlsup

unread,
Dec 31, 2023, 12:22:27 PM12/31/23
to
Quadibloc wrote:

> On Fri, 29 Dec 2023 17:20:43 +0000, Anton Ertl wrote:

>> Some (many?) architectures of the 1960s (earlier? later?) have the
>> feature that, when loading from an address, if a certain bit in the word
>> is set, the CPU uses that word as the address to access, and this
>> repeats until a word without this bit is found. At least that's how I
>> understand the descriptions of this feature.
>>
>> The major question I have is why these architectures have this feature.

> No doubt this answer has already been given.

> The reason these architectures had that feature was because of a feature
> they _didn't_ have: an index register.

This is a better explanation than above. Instead of paying the high price
needed for index registers, they use main memory as their index registers.
{{A lot like building linked lists in FORTRAN 66}}.

> So in order to access arrays and stuff like that, instead of doing surgery
> on the short address inside an instruction, you can simply store a full
> address in a word somewhere that points anywhere you would like.

>> One other question is how the indirect bit works with stores. How do
>> you change the first word in the chain, the last one, or any word in
>> between?

> Let's assume we do have an architecture that supports multi-level
> indirection. So an instruction word looks like this:

> (i)(x)(opcode)(p)(address)

> and an address constant looks like this:

> (i)(x)(address)

> So in an address constant (some architectures that had index registers
> kept indirection) you could specify indexing too, but now the address was
> longer by the length of the opcode field.

> If the address inside an instruction is too short to handle all of memory
> (i.e. the word length is less than 24 bits) then you need a "page" bit in
> the instruction: 0 means page zero, shared by the whole program, 1 means
> the current page - the one the instruction is on.

Going all PDP-8 on us now ??

Vir Campestris

unread,
Dec 31, 2023, 12:40:58 PM12/31/23
to
On 30/12/2023 23:26, John Levine wrote:
> and I
> can't recall using it other than as a gimmick on the PDP-10.

It's a very long time ago, but I'm sure I do recall seeing it used on a
DECSystem10 for arrays of pointers for indirection.

The fact that 40 years later I can remember the @ being used in
assembler must mean something.

Modern machines don't like wasting space so much. On the '10 an address
pointed to was a 36 bit value with an 18 bit address in it. And the
indirection bit. There was space for things like this.

Andy

Thomas Koenig

unread,
Dec 31, 2023, 12:54:49 PM12/31/23
to
MitchAlsup <mitch...@aol.com> schrieb:
> Quadibloc wrote:
>
>> On Fri, 29 Dec 2023 17:20:43 +0000, Anton Ertl wrote:
>
>>> Some (many?) architectures of the 1960s (earlier? later?) have the
>>> feature that, when loading from an address, if a certain bit in the word
>>> is set, the CPU uses that word as the address to access, and this
>>> repeats until a word without this bit is found. At least that's how I
>>> understand the descriptions of this feature.
>>>
>>> The major question I have is why these architectures have this feature.
>
>> No doubt this answer has already been given.
>
>> The reason these architectures had that feature was because of a feature
>> they _didn't_ have: an index register.
>
> This is a better explanation than above. Instead of paying the high price
> needed for index registers, they use main memory as their index registers.
> {{A lot like building linked lists in FORTRAN 66}}.

The PDP-10 had both a recursive indirect bit and index registers (aka
memory locations 1 to 15), if I remember the manuals correctly
(I did a bit of reading, but I've never even come close to one of
these machines).

Paul A. Clayton

unread,
Dec 31, 2023, 1:25:11 PM12/31/23
to
On 12/29/23 2:36 PM, John Levine wrote:
> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>> Some (many?) architectures of the 1960s (earlier? later?) have the
>> feature that, when loading from an address, if a certain bit in the
>> word is set, the CPU uses that word as the address to access, and this
>> repeats until a word without this bit is found.
[snip]
>> Were there any other usage patterns? What happened to them when the
>> indirect bit went out of fashion?
[snip]
> As you suggested, a lot of uses boiled down to providing a fixed
> address for something that can move, so instructions could indirect
> through that fixed address without having to load it into a register.

Paged virtual memory as commonly implemented introduces one level
of indirection at page (rather than word) granularity.
Virtualization systems using nested page tables introduce a second
direction.

Hierarchical/multi-level page tables have multiple layers of
indirection where instead of a page table base pointer pointing to
a complete page table it points to a typically-page-sized array of
address and metadata entries where each entry points to a similar
array eventually reaching the PTE.

Even with page table caching (and workloads that play well with
this kind of virtual memory), this is not free but it can be
"cheap enough". Using large pages for virtual-physical to physical
translation can help a lot. Presumably having an OS bias placement
of its translation table pages into large quasi-pages would help
caching for VPA-to-PA, i.e., many VPAs used by the OS for paging
would be in the same large page (e.g., 2MiB for x86).

(Andy Glew had suggested using larger pages for intermediate nodes
rather than limiting such to the last node in a hierarchical page
table. This has the same level-reducing effect of huge pages that
short-circuit the translation indirection at the end but allows
eviction and permission control at base-page size, with the
consequent larger number of PTEs active if there is spatial
locality at huge page granularity. Such merely assumes that
locality potentially exists at the intermediate nodes rather than
exclusively at the last node. Interestingly, with such a page
table design one might consider having rather small pages; e.g., a
perhaps insane 64-byte base page size (at least for the tables)
would only provide 3 bits per level but each level could be
flattened to provide 6, 9, 12, etc. bits. Such extreme flexibility
may well not make sense, but it seems interesting to me.)


> For most purposes, index registers do indirection better, and now that
> everything has a lot of registers, you can use some of them for the
> fixed->movable stuff like the GOT in Unix/linux shared libraries.

For x86-64 some of the segments can have non-zero bases, so these
provide an additional index register ("indirection").

Scott Lurndal

unread,
Dec 31, 2023, 1:26:01 PM12/31/23
to
John Levine <jo...@taugh.com> writes:
>According to Joe Pfeiffer <pfei...@cs.nmsu.edu>:
>>I'll hazard a guess that once you've got the indirect bit out in memory,
>>it's easier to just use the same logic on all memory reads than to only
>>let it happen once.
>
>That's not how indirect addressing worked.
>
>There was always a bit in the instruction to say to do indirection.

In our case (B3500 et alia), there was a bit per operand, so a three operand
instruction could have all three addresses indirect. The processor treated
the value at the indirect address as an operand address allowing infinite
recursion (subject to a processor timer in case of loops).


Scott Lurndal

unread,
Dec 31, 2023, 1:28:11 PM12/31/23
to
Quadibloc <quad...@servername.invalid> writes:
>On Fri, 29 Dec 2023 17:20:43 +0000, Anton Ertl wrote:
>
>> Some (many?) architectures of the 1960s (earlier? later?) have the
>> feature that, when loading from an address, if a certain bit in the word
>> is set, the CPU uses that word as the address to access, and this
>> repeats until a word without this bit is found. At least that's how I
>> understand the descriptions of this feature.
>>
>> The major question I have is why these architectures have this feature.
>
>No doubt this answer has already been given.
>
>The reason these architectures had that feature was because of a feature
>they _didn't_ have: an index register.

Not necessarily true. The B3500 had three index registers (special
locations in memory, not real registers). Later systems in the early
80's added an additional four register-based index registers, but
continued to support indirect addressing.

MitchAlsup

unread,
Dec 31, 2023, 2:00:57 PM12/31/23
to
All of the PDP-10s at CMU had the register upgrade. {2×Ki and 1×Kl}
I believe that most PDP-10 ever sold had the register upgrade.

MitchAlsup

unread,
Dec 31, 2023, 2:00:57 PM12/31/23
to
I had been thinking that since my large-page translation tables have
a count of the number of pages, that when forking off a new GuestOS
that I would allocate the HyperVisor tables as a single 8GB large
page, and when it needs more then switch to a more treeified page
table. This leaves the second level of DRAM translation at 1 very
cacheable and TLB-able PTE--dramatically reducing the table walking
overhead.

A single 8GB page mapping can allow access to one 8192B page up to
1M 8192B pages. Guest OS page tables can map any of these 8192B pages
to any virtual address it desires with permissions it desires.

> This has the same level-reducing effect of huge pages that
> short-circuit the translation indirection at the end but allows
> eviction and permission control at base-page size, with the
> consequent larger number of PTEs active if there is spatial
> locality at huge page granularity. Such merely assumes that
> locality potentially exists at the intermediate nodes rather than
> exclusively at the last node. Interestingly, with such a page
> table design one might consider having rather small pages; e.g., a
> perhaps insane 64-byte base page size (at least for the tables)
> would only provide 3 bits per level but each level could be
> flattened to provide 6, 9, 12, etc. bits. Such extreme flexibility
> may well not make sense, but it seems interesting to me.)


>> For most purposes, index registers do indirection better, and now that
>> everything has a lot of registers, you can use some of them for the
>> fixed->movable stuff like the GOT in Unix/linux shared libraries.

> For x86-64 some of the segments can have non-zero bases, so these
> provide an additional index register ("indirection").

This has more to do with 16 registers being insufficient than indirection
(segmentation) being better.

John Levine

unread,
Dec 31, 2023, 3:19:36 PM12/31/23
to
According to Thomas Koenig <tko...@netcologne.de>:
>The PDP-10 had both a recursive indirect bit and index registers (aka
>memory locations 1 to 15), if I remember the manuals correctly
>(I did a bit of reading, but I've never even come close to one of
>these machines).

Yup. Each instruction had an 18 bit address, a four bit index register, and an indirect bit.
It took the address, and added the contents of the right half of the index register if non-zero.
If the indirect bit was off, that was the operand address. If the indirect bit was set, it
fetched the word at that location and did the whole thing over again, including the indexing.

You could in principle create extremely complicated address chanis but
it was so confusing that nobody did.

MitchAlsup

unread,
Dec 31, 2023, 3:46:04 PM12/31/23
to
John Levine wrote:

> According to Thomas Koenig <tko...@netcologne.de>:
>>The PDP-10 had both a recursive indirect bit and index registers (aka
>>memory locations 1 to 15), if I remember the manuals correctly
>>(I did a bit of reading, but I've never even come close to one of
>>these machines).

> Yup. Each instruction had an 18 bit address, a four bit index register, and an indirect bit.
> It took the address, and added the contents of the right half of the index register if non-zero.
> If the indirect bit was off, that was the operand address. If the indirect bit was set, it
> fetched the word at that location and did the whole thing over again, including the indexing.

> You could in principle create extremely complicated address chanis but
> it was so confusing that nobody did.

At CMU is used this a lot for things like symbol table searches.
What I did not use was the index register stuff of the indirection (except
at the first level).

sarr.b...@alum.dartmouth.org

unread,
Jan 1, 2024, 3:31:32 PMJan 1
to
John Levine <jo...@taugh.com> wrote:

: More importantly, index registers are a lot faster than indirect
: addressing and at least since the IBM 801, we have good algorithms to
: do register scheduling.

Once upon a time saving an instruction was a big deal; the 801, and
RISC in general, was possible because memory got much cheaper.
Using index registers costs an extra instrucion for loading the index
register.

Index registers were a scarce resource too (except for the Atlas) so
keeping all your pointers in index registers wasn't a good option
either.

sarr`

MitchAlsup

unread,
Jan 3, 2024, 8:42:02 PMJan 3
to
sarr.b...@alum.dartmouth.org wrote:

> John Levine <jo...@taugh.com> wrote:

> : More importantly, index registers are a lot faster than indirect
> : addressing and at least since the IBM 801, we have good algorithms to
> : do register scheduling.

> Once upon a time saving an instruction was a big deal; the 801, and
> RISC in general, was possible because memory got much cheaper.
> Using index registers costs an extra instrucion for loading the index
> register.

Mark Horowitz stated (~1983) MIPS executes 1.5× as many instructions
as VAX and at 6× the frequency for a 4× improvement in performance.

Now, imagine a RISC ISA that only needs 1.1× as many instructions as
VAX with no degradation WRT operating frequency.

EricP

unread,
Jan 5, 2024, 11:34:19 AMJan 5
to
PDP-11 and VAX had multiple address modes with a single level of indirection.
The VAX usage stats from 1984 show about 3% use on SPEC.

DG Nova had infinite indirection - if the Indirect bits was set in the
instruction then in the address register if the msb of the address was zero
then it was the address of the 16-bit data, if the msb of the address was 1
then it was the address of another address, looping until msb = 0.
I don't know how DG used it but, just guessing, because Nova only had
4 registers might be to create a kind of virtual register set in memory.

The best use I have for single level indirection is compilers & linkers.
The compiler emits a variable reference without knowing if it is local
to the linkage unit or imported from a DLL. Linker discovers it is a
DLL export variable and changes the assigned variable to be a pointer
to the imported value that is patched by the loader,
and just flips the Indirect bit on the instruction.

Doing the same thing without address indirection requires inserting
extra LD instructions and having a spare register allocated to the
linker to work with.






John Levine

unread,
Jan 5, 2024, 1:05:26 PMJan 5
to
According to EricP <ThatWould...@thevillage.com>:
>PDP-11 and VAX had multiple address modes with a single level of indirection.
>The VAX usage stats from 1984 show about 3% use on SPEC.

The main place the PDP-11 used indirect addressing was in @(PC)+ which
was the idiom for absolute addressing. It fetched the next word in the
instruction stream as an immediate via (PC)+ and then used it as an
address via indirection. The assembler let you write @#123 to geerate
that address mode and put the 123 in line.

It was also useful for threaded code, where you had a register,
typically R4, pointing at a list of routine addresses and dispatched
with JMP @(R4)+

If you were feeling clever you could do this coroutine switch JSR PC,@(SP)+

That popped the top word off the stack, then pushed the current PC, then jumped
to the address it had popped.

>DG Nova had infinite indirection - if the Indirect bits was set in the
>instruction then in the address register if the msb of the address was zero
>then it was the address of the 16-bit data, if the msb of the address was 1
>then it was the address of another address, looping until msb = 0.
>I don't know how DG used it but, just guessing, because Nova only had
>4 registers might be to create a kind of virtual register set in memory.

My guess is that it was cheap to implement and let them say look, here
is a cool thing that we do and DEC doesn't. I would be surprised if
there were many long indirect chains.

MitchAlsup

unread,
Jan 5, 2024, 6:21:02 PMJan 5
to
John Levine wrote:

> According to EricP <ThatWould...@thevillage.com>:
>>PDP-11 and VAX had multiple address modes with a single level of indirection.
>>The VAX usage stats from 1984 show about 3% use on SPEC.

> The main place the PDP-11 used indirect addressing was in @(PC)+ which
> was the idiom for absolute addressing. It fetched the next word in the
> instruction stream as an immediate via (PC)+ and then used it as an
> address via indirection. The assembler let you write @#123 to geerate
> that address mode and put the 123 in line.

> It was also useful for threaded code, where you had a register,
> typically R4, pointing at a list of routine addresses and dispatched
> with JMP @(R4)+

> If you were feeling clever you could do this coroutine switch JSR PC,@(SP)+

> That popped the top word off the stack, then pushed the current PC, then jumped
> to the address it had popped.

I used this in a real-timeOS I developed at CMU to deal with laser power control.

Processes (no MMU or protection) would receive control JSR PC,@(SP)+ and
return control with JSR PC,@(SP)+ at which time OS would find the next thing to
do and JSR PC,@(SP)+ all over again. Really light weight context switching.

Lawrence D'Oliveiro

unread,
Jan 17, 2024, 12:28:51 AMJan 17
to
On Fri, 29 Dec 2023 19:04:56 GMT, Scott Lurndal wrote:

> The [Burroughs] system ran mostly COBOL code (with some BPL;
> assemblers weren't generally provided to customers).

For an interesting reason: privilege protection was enforced in software,
not hardware.

Lawrence D'Oliveiro

unread,
Jan 17, 2024, 1:34:59 AMJan 17
to
On Thu, 4 Jan 2024 01:36:35 +0000, MitchAlsup wrote:

> Mark Horowitz stated (~1983) MIPS executes 1.5× as many instructions as
> VAX and at 6× the frequency for a 4× improvement in performance.

Mmm, maybe you got the last two multipliers the wrong way round?

Lawrence D'Oliveiro

unread,
Jan 17, 2024, 1:36:45 AMJan 17
to
On Sun, 31 Dec 2023 17:54:44 -0000 (UTC), Thomas Koenig wrote:

> ... but I've never even come close to one of these
> machines).

You could have one, or a software emulation of one, right in front of you,
just a SIMH install away.

Terje Mathisen

unread,
Jan 17, 2024, 2:14:39 AMJan 17
to
No, that seems correct: It needed 1.5 times as many instructions, so the
6X frequency must be divided by 1.5 for a final speedup of 4X?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Scott Lurndal

unread,
Jan 17, 2024, 11:02:37 AMJan 17
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:
>On Fri, 29 Dec 2023 19:04:56 GMT, Scott Lurndal wrote:
>
[no assembler shipped to customers]

>> The [Burroughs] system ran mostly COBOL code (with some BPL;
>> assemblers weren't generally provided to customers).
>
>For an interesting reason: privilege protection was enforced in software,
>not hardware.

Actually, that is not the case.

Burroughs had multiple lines of mainframes: small, medium and large.

Small systems (b1700/b1800/b1900) had a writeable control store and the instruction set
would be dynamically loaded when the application was scheduled.

Medium systems were BCD systems (B[234][5789]xx) (descended from the orignal line
of Electrodata Datatron systems when Burroughs bought electrodata
in the mid 1950s). Designed to efficiently run COBOL code. These
are the systems I was referring to above. They had hardware
enforced privilege protection.

Large systems (starting with the B5000/B5500) were stack systems
running ALGOL and algol deriviative (DCALGOL, NEWP, etc)
(they also supported COBOL, Fortran, Basic, etc).

The systems you are thinking about were the Large systems. And
there were issues with that (a famous paper in the mid 1970s
showed how to set the 'compiler' flag on any application allowing
it to bypass security protections - put the application on a
tape, load it on an IBM system, patch the executable header,
and restore it on the Burroughs system).

MitchAlsup1

unread,
Jan 17, 2024, 12:40:44 PMJan 17
to
Performance is in millions of instructions per second.

If the instruction count was 1.0× a 6× frequency would yield 6× gain.

So, since there were 1.5× as many instructions and 6× as many instructions per
second, 6 / 1.5 = 4×

EricP

unread,
Jan 17, 2024, 1:10:52 PMJan 17
to
VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
machine and advertised it as such. But no one had actually measured it.
When they finally did and found it was 0.5 MIPS they just changed to
calling that "1 VUP" or "VAX-780 Units of Processing".

This also showed up in the Dhrystone benchmarks:

https://en.wikipedia.org/wiki/Dhrystone_Results

"Another common representation of the Dhrystone benchmark is the
DMIPS (Dhrystone MIPS) obtained when the Dhrystone score is divided
by 1757 (the number of Dhrystones per second obtained on the VAX 11/780,
nominally a 1 MIPS machine)."

I suppose they should have changed that to DVUPS.

Stanford MIPS (16 registers) in 1984 ran at 4 MHz with a 5 stage pipeline.
The paper I'm looking at compares it to a 8 MHz 68000 and has
Stanford MIPS averaging 5 times faster on their Pascal benchmark.

The MIPS R2000 with 32 registers launched in 1986 at 8.3, 12.5 and 15 MHz.
It supposedly could sustain 1 reg-reg ALU operation per clock.


John Levine

unread,
Jan 17, 2024, 2:14:05 PMJan 17
to
It appears that EricP <ThatWould...@thevillage.com> said:
>VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
>giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
>machine and advertised it as such.

No, they knew how fast it was. It was about as fast as an IBM 370/158
which IBM rated at 1 MIPS. A Vax instruction could do a lot more than
a 370 instruction so it wasn't implausible that the performance was
similar even though the instruction rate was about half.

>When they finally did and found it was 0.5 MIPS they just changed to
>calling that "1 VUP" or "VAX-780 Units of Processing".

Yeah, they got grief for the MIPS stuff.

EricP

unread,
Jan 17, 2024, 2:33:21 PMJan 17
to
John Levine wrote:
> It appears that EricP <ThatWould...@thevillage.com> said:
>> VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
>> giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
>> machine and advertised it as such.
>
> No, they knew how fast it was. It was about as fast as an IBM 370/158
> which IBM rated at 1 MIPS.

So those were TOUPS or Three-seventy One-fifty-eight Units of Performance.

> A Vax instruction could do a lot more than
> a 370 instruction so it wasn't implausible that the performance was
> similar even though the instruction rate was about half.

And they define 1 VUP = 1 TOUP

>> When they finally did and found it was 0.5 MIPS they just changed to
>> calling that "1 VUP" or "VAX-780 Units of Processing".
>
> Yeah, they got grief for the MIPS stuff.

One just has to be careful comparing clock MIPS and VUPS.


John Levine

unread,
Jan 17, 2024, 2:55:16 PMJan 17
to
According to EricP <ThatWould...@thevillage.com>:
>John Levine wrote:
>> It appears that EricP <ThatWould...@thevillage.com> said:
>>> VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
>>> giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
>>> machine and advertised it as such.
>>
>> No, they knew how fast it was. It was about as fast as an IBM 370/158
>> which IBM rated at 1 MIPS.
>
>So those were TOUPS or Three-seventy One-fifty-eight Units of Performance.

If you want. IBM mainframe MIPS was a well understood performance
measure at the time. In the mid 1970s, there were a few IBM clones
like Amdahl, but the other mainframe makers were already sinking into
obscurity. I can't think of anyone else making a 32 bit byte
addressable mainframe at the time that wasn't an IBM clone. I suppose
there were the Interdata machines but they were minis and sold mostly
for embedded realtime.

>> A Vax instruction could do a lot more than
>> a 370 instruction so it wasn't implausible that the performance was
>> similar even though the instruction rate was about half.
>
>And they define 1 VUP = 1 TOUP

Yes, but a TOUP really was an IBM MIPS.

EricP

unread,
Jan 17, 2024, 3:19:42 PMJan 17
to
John Levine wrote:
> According to EricP <ThatWould...@thevillage.com>:
>> John Levine wrote:
>>> It appears that EricP <ThatWould...@thevillage.com> said:
>>>> VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
>>>> giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
>>>> machine and advertised it as such.
>>> No, they knew how fast it was. It was about as fast as an IBM 370/158
>>> which IBM rated at 1 MIPS.
>> So those were TOUPS or Three-seventy One-fifty-eight Units of Performance.
>
> If you want. IBM mainframe MIPS was a well understood performance
> measure at the time. In the mid 1970s, there were a few IBM clones
> like Amdahl, but the other mainframe makers were already sinking into
> obscurity. I can't think of anyone else making a 32 bit byte
> addressable mainframe at the time that wasn't an IBM clone. I suppose
> there were the Interdata machines but they were minis and sold mostly
> for embedded realtime.
>
>>> A Vax instruction could do a lot more than
>>> a 370 instruction so it wasn't implausible that the performance was
>>> similar even though the instruction rate was about half.
>> And they define 1 VUP = 1 TOUP
>
> Yes, but a TOUP really was an IBM MIPS.

Ok, but VAX-780 really was measured by DEC at 0.5 MIPS.
So either the assumption that a VUP = TOUP was wrong
or the assumption that a TOUP = MIPS was.

See section 5 and table 8.

Characterization of Processor Performance in the VAX-11/780, 1984
http://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf


John Levine

unread,
Jan 17, 2024, 3:27:50 PMJan 17
to
According to EricP <ThatWould...@thevillage.com>:
>>>> No, they knew how fast it was. It was about as fast as an IBM 370/158
>>>> which IBM rated at 1 MIPS.
>>> So those were TOUPS or Three-seventy One-fifty-eight Units of Performance.
>>
>> If you want. IBM mainframe MIPS was a well understood performance
>> measure at the time. In the mid 1970s, there were a few IBM clones
>> like Amdahl, but the other mainframe makers were already sinking into
>> obscurity. I can't think of anyone else making a 32 bit byte
>> addressable mainframe at the time that wasn't an IBM clone. I suppose
>> there were the Interdata machines but they were minis and sold mostly
>> for embedded realtime.
>>
>>>> A Vax instruction could do a lot more than
>>>> a 370 instruction so it wasn't implausible that the performance was
>>>> similar even though the instruction rate was about half.
>>> And they define 1 VUP = 1 TOUP
>>
>> Yes, but a TOUP really was an IBM MIPS.
>
>Ok, but VAX-780 really was measured by DEC at 0.5 MIPS.
>So either the assumption that a VUP = TOUP was wrong
>or the assumption that a TOUP = MIPS was.

As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions. In that era, MIPS meant
either a million IBM instructions, or as some wag put it, Meaningless
Indication of Processor Speed.

Thomas Koenig

unread,
Jan 17, 2024, 5:37:04 PMJan 17
to
John Levine <jo...@taugh.com> schrieb:

> As I think I said above. a million IBM instructions did about as much
> work as half a million VAX instructions.

Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)

EricP

unread,
Jan 17, 2024, 6:15:01 PMJan 17
to
I was thinking the same thing. VAX address modes like auto-increment
would be equivalent to 2 instructions for each operand and likely used
in benchmarks.

VAX having 32-bit immediates and offsets and and 64-bit float immediates
per operand vs 370 having to build constants or load them.

And POLY for transcendentals is one instruction.

All of those would add clocks to the VAX instruction execute time
but not its instruction count and MIPS.


Scott Lurndal

unread,
Jan 17, 2024, 6:56:28 PMJan 17
to
EricP <ThatWould...@thevillage.com> writes:
>Thomas Koenig wrote:
>> John Levine <jo...@taugh.com> schrieb:
>>
>>> As I think I said above. a million IBM instructions did about as much
>>> work as half a million VAX instructions.
>>
>> Why the big difference? Were fancy addressing modes really used so
>> much? Or did the code for the VAX mostly run POLY instructions? :-)
>
>I was thinking the same thing. VAX address modes like auto-increment
>would be equivalent to 2 instructions for each operand and likely used
>in benchmarks.

MOVC3 and MOVC5, perhaps?

MitchAlsup1

unread,
Jan 17, 2024, 7:00:52 PMJan 17
to
Fancy addressing modes {indirection, pre decrement, post increment,
Constants, Displacements, index, ADD-CMP-
Branch, CRC, Bit manipulation, ...)
You could say these contribute to most of the gain

Lawrence D'Oliveiro

unread,
Jan 17, 2024, 7:24:42 PMJan 17
to
On Wed, 17 Jan 2024 23:56:24 GMT, Scott Lurndal wrote:

> MOVC3 and MOVC5, perhaps?

Interruptible instructions ... wot fun ...

John Levine

unread,
Jan 17, 2024, 9:05:35 PMJan 17
to
I don't think anyone used the fancy addressing modes or complex
instructions much. But here's an example. Let's say A, B, and C are
floats in addressable memory and you want to do A = B + C

370 code

LE R0,B
AE R0,C
STE R0,A

VAX code

ADDF3 B,C,A

The VAX may not be faster, but that's one instruction rather than 3.

Or that old Fortran favorite I = I + 1

370 code

L R1,I
LA R2,1
AR R1,R2
ST R1,I

VAX code

INCL I

or if you have a lousy optimizer

ADDL2 #1,I

or if you have a really lousy optimizer

ADDL3 #1,I,I

It's still one instruction rather than four.

In 370 code you often also needed extra instructions to make data
addressable since it had no direct addressing and address offsets in
instructions were only 12 bits.

Lawrence D'Oliveiro

unread,
Jan 17, 2024, 11:51:13 PMJan 17
to
On Thu, 18 Jan 2024 02:05:30 -0000 (UTC), John Levine wrote:

> ADDF3 B,C,A
>
> The VAX may not be faster, but that's one instruction rather than 3.

If those were register operands, that instruction would be 4 bytes.

I think worst case, each operand could have an index register and a 4-byte
offset (in addition to the operand specifier byte), for a maximum
instruction length of 19 bytes.

So, saying “just one instruction” may not sound as good as you think.

Here’s an old example, from the VMS kernel itself. This instruction

PUSHR #^M<R0,R1,R2,R3,R4,R5>

pushes the first 6 registers onto the stack, and occupies just 2 bytes.
Whereas this sequence

PUSHL R5
PUSHL R4
PUSHL R3
PUSHL R2
PUSHL R1
PUSHL R0

does the equivalent thing, but takes up 2 × 6 = 12 bytes.

Guess which is faster?

Terje Mathisen

unread,
Jan 18, 2024, 4:08:52 AMJan 18
to
REP MOVS is the classic x86 example: Since all register usage is fixed
(r)si,(r)di,(r)cx the cpu can always accept an interrupt at any point,
it just needs to update those three registers and take the interrupt.

When the instruction resumes, any remaining moves are performed.

This was actually an early 8086/8088 bug: If you had multiple prefix
bytes, like you would need if you were moving data to the Stack segment
instead of the Extra, and the encoding was REP SEGSS MOVS, then only hte
last prefix byte was remembered in the saved IP/PC value.

I used to check for this bug by moving a block which was large enough
that it took over 55ms, so that a timer interrupt was guaranteed:

If the CX value wasn't zero after the instruction, then the bug had
happened.

EricP

unread,
Jan 18, 2024, 9:21:05 AMJan 18
to
VAX Fortran77 could optimize a DO loop array index to an autoincrement,
I think they called it strength reduction of loop induction variables.

do i = 1, N
A(i) = A(i) + B(i)
end do

ADDD (rB)+, (rA)+

VAX usage stats for compilers Basic, Bliss, Cobol, Fortran, Pascal, PL1,
show usage frequency per operand specifier of autoincrement ~4%, index ~7%
except Basic has 17% for autoincrement.
There is almost no usage of deferred addressing (address of address of data).


EricP

unread,
Jan 18, 2024, 9:40:17 AMJan 18
to
Lawrence D'Oliveiro wrote:
> On Thu, 18 Jan 2024 02:05:30 -0000 (UTC), John Levine wrote:
>
>> ADDF3 B,C,A
>>
>> The VAX may not be faster, but that's one instruction rather than 3.
>
> If those were register operands, that instruction would be 4 bytes.
>
> I think worst case, each operand could have an index register and a 4-byte
> offset (in addition to the operand specifier byte), for a maximum
> instruction length of 19 bytes.

The longest instruction I think might be an ADD3H with two H format
16-byte float immediates with an indexed destination with 4 byte offset.

That should be something like 2 opcode, 1 opspec, 16 imm,
1 opspec, 16 imm, 1 opspec, 4 imm, 1 index = 42 bytes.

(Yes its a silly instruction but legal.)

Anton Ertl

unread,
Jan 18, 2024, 11:27:12 AMJan 18
to
EricP <ThatWould...@thevillage.com> writes:
>The MIPS R2000 with 32 registers launched in 1986 at 8.3, 12.5 and 15 MHz.
>It supposedly could sustain 1 reg-reg ALU operation per clock.

It could do at most one instruction per clock, and it certainly needed
to branch at some point, so no sustained 1/clock ALU instructions.
Also, a useful program would want to load or store at some point, so
even less ALU instructions. And with cache misses, also fewer than 1
IPC.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

John Levine

unread,
Jan 18, 2024, 11:31:34 AMJan 18
to
According to Lawrence D'Oliveiro <l...@nz.invalid>:
>On Thu, 18 Jan 2024 02:05:30 -0000 (UTC), John Levine wrote:
>
>> ADDF3 B,C,A
>>
>> The VAX may not be faster, but that's one instruction rather than 3.
>
>If those were register operands, that instruction would be 4 bytes.
>
>I think worst case, each operand could have an index register and a 4-byte
>offset (in addition to the operand specifier byte), for a maximum
>instruction length of 19 bytes.
>
>So, saying “just one instruction” may not sound as good as you think.

I wasn't saying they were always better, just pointing out that there
were straightforward reasons that 500K VAX instructions could do the
same work as 1M 370 instructions.

Considering that the 370 is still alive and the VAX died decades ago,
it should be evident that instruction count isn't a very useful
metric across architectures.

Anton Ertl

unread,
Jan 18, 2024, 12:01:57 PMJan 18
to