Is Mill the next Data General Fountainhead Project?

Brett

unread,

Jun 11, 2022, 4:33:52 PM6/11/22

to

My answer is yes, and you can add in phrases like “A bridge too far.”

Here is DG Fountainhead Project:

https://people.cs.clemson.edu/~mark/fhp.html

https://people.cs.clemson.edu/~mark/FHP_Product_Overview.pdf

https://people.cs.clemson.edu/~mark/Intro_FHP_System.pdf

DG’s previous product was a low cost mini, what today would be a
microcontroller for machinery. FHP was a an expensive leap 20 years into
the future, a decade father than the customers wanted to go. A expensive
multi board design that failed to compete with a single board upgraded
microcontroller known as the DG NOVA.

Another example is the MIPS design, which in the words of the PHD designer
was a perfect instruction set. MIPS failed to compete with ARM which was a
thrown together kludge that changed instruction sets four times. (ARM32,
ARM16, Thumb, ARM64) Which by the way meant ARM could renew its patents
four times, obsoleting previous products, while MIPS went off patent and
was cloned my the Chinese which destroyed the product margins.

Second best always wins.

Over in RWT I proposed a RISC Mill hybrid where you add a two unit belt to
a RISC chip which saves 5 bits in almost every instruction. Loads/stores
and into/float call all save a register specifier resulting in reduced
instruction size. This is important in the embedded market and would sell
well. Most importantly this will act like crack cocaine driving demand for
the real Mill.

From their you can go to a four unit belt and start adding other Mill
features like real branch history. Charging a little more for each upward
compatible feature, while being mostly plug and play with the base
architecture, just plug a new core into your SOC.

Maybe ending up with three slightly overlapping upward compatible
instruction set lines.

Your Thoughts?

Quadibloc

unread,

Jun 11, 2022, 9:14:45 PM6/11/22

to

On Saturday, June 11, 2022 at 2:33:52 PM UTC-6, gg...@yahoo.com wrote:

> Here is DG Fountainhead Project:

> https://people.cs.clemson.edu/~mark/Intro_FHP_System.pdf

Sounds like IBM's Future System - a machine which has different
microcode depending on which compiled language you're using.

The only similarity to Mill is that Mill is... ambitious... but the fact
that some ambitious projects failed is not enough to support the
conclusion that ambitious projects are bad. Without them, we
would not have progress.

John Savard

Thomas Koenig

unread,

Jun 12, 2022, 4:50:06 AM6/12/22

to

Brett <gg...@yahoo.com> schrieb:

>
>
> My answer is yes, and you can add in phrases like “A bridge too far.”
>
> Here is DG Fountainhead Project:
>
> https://people.cs.clemson.edu/~mark/fhp.html
>
> https://people.cs.clemson.edu/~mark/FHP_Product_Overview.pdf
>
> https://people.cs.clemson.edu/~mark/Intro_FHP_System.pdf
>
> DG’s previous product was a low cost mini, what today would be a
> microcontroller for machinery. FHP was a an expensive leap 20 years into
> the future, a decade father than the customers wanted to go.

It was a leap 20 years into an alternate future, which never happened.

> A expensive
> multi board design that failed to compete with a single board upgraded
> microcontroller known as the DG NOVA.

It wasn't ready. By the time they started the Fountainhead
project, they already had the 16-bit Eclipse. By the time they
were even halfway close to finishing it, they already had the
32-bit Eclipse MV. And, like all other minicomputer makers, they
were unaware that their business model was to be obliterated by
fast microprocessors and, finally, the PC.

Even if they had rolled it out on time, it would not have survived,
RISC microprocessors would have kille it.

Ivan Godard

unread,

Jun 12, 2022, 4:50:14 AM6/12/22

to

Why not a 2-unit stack instead of a belt? The B5000 family had the top
two entries of its stack in registers, with implicit push/pop to memory.
That left the great majority of instructions one byte long; NAMC (LEA)
and VALC (load) were two byte.

A stack has advantages over a short belt because the implicit push/pop
saves the explicit spill/fill that otherwise required. It does require
an explicit discard of dead values; the B5000 implicitly discarded on
store, which avoids that problem at the cost of needing DUPL when a
stored value was still live.

Our experience is that a belt is only an improvement over a stack if the
belt is long enough that spill/rescue is rare. You are swapping the
space of an occasional rescue for the space of a destination register in
each instruction. The cutover is a little more than the number of
registers needed in a genreg machine to avoid spills of transient values.

Anton Ertl

unread,

Jun 12, 2022, 1:09:43 PM6/12/22

to

Brett <gg...@yahoo.com> writes:
>
>
>My answer is yes, and you can add in phrases like “A bridge too far.”
>
>Here is DG Fountainhead Project:
>
>https://people.cs.clemson.edu/~mark/fhp.html
>
>https://people.cs.clemson.edu/~mark/FHP_Product_Overview.pdf
>
>https://people.cs.clemson.edu/~mark/Intro_FHP_System.pdf

Interesting, thanks.

For those who don't remember it, FHP is the project for which the
Eagle project described in "The Soul of a New Machine" was the backup
plan (Eagle eventually prevailed).

Apparently Kidder was lucky in choosing to follow the relatively fast
Eagle project. If he had followed one that took much, much longer
than originally planned, like FHP, would he have been able to follow
the project to the end?

>DG’s previous product was a low cost mini, what today would be a
>microcontroller for machinery. FHP was a an expensive leap 20 years into
>the future, a decade father than the customers wanted to go. A expensive
>multi board design that failed to compete with a single board upgraded
>microcontroller known as the DG NOVA.

The Eclipse MV/8000 (Eagle) that DG shipped instead of FHP had 5 or 7
15" x 15" boards <https://people.cs.clemson.edu/~mark/330/eagle.html>.

>Another example is the MIPS design, which in the words of the PHD designer
>was a perfect instruction set. MIPS failed to compete with ARM

MIPS tried to compete in the workstation and server market, and was
not successful in the later 1990s. ARM did not try to compete there.

MIPS also competed in games consoles (e.g, Playstation, Playstation
2), but lost to PowerPC and later AMD, probably because they could not
offer competetive performance.

ARM developed the business model of selling CPU cores as IP blocks,
and that's how they got successful in the embedded market, and
outcompeted MIPS in the embedded control market.

>which was a
>thrown together kludge that changed instruction sets four times. (ARM32,
>ARM16, Thumb, ARM64) Which by the way meant ARM could renew its patents
>four times, obsoleting previous products, while MIPS went off patent and
>was cloned my the Chinese which destroyed the product margins.

MIPS has their share of extensions and revisions (e.g., SIMD
instructions). I think the product margins were destroyed by MIPS
being not particularly attractive compared to the competition in this
century. This leads to having a small ecosystem, and a small
ecosystem means that you also cannot charge a lot for architectural
licenses (and I think even ARM cannot charge a lot).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,

Jun 12, 2022, 2:44:45 PM6/12/22

to

On 6/12/2022 11:16 AM, Anton Ertl wrote:
> Brett <gg...@yahoo.com> writes:
>>
>>
>> My answer is yes, and you can add in phrases like “A bridge too far.”
>>
>> Here is DG Fountainhead Project:
>>
>> https://people.cs.clemson.edu/~mark/fhp.html
>>
>> https://people.cs.clemson.edu/~mark/FHP_Product_Overview.pdf
>>
>> https://people.cs.clemson.edu/~mark/Intro_FHP_System.pdf
>
> Interesting, thanks.
>
> For those who don't remember it, FHP is the project for which the
> Eagle project described in "The Soul of a New Machine" was the backup
> plan (Eagle eventually prevailed).
>
> Apparently Kidder was lucky in choosing to follow the relatively fast
> Eagle project. If he had followed one that took much, much longer
> than originally planned, like FHP, would he have been able to follow
> the project to the end?
>

Looks it up.
Yeah, this is more stuff from before my span of existence...

>> DG’s previous product was a low cost mini, what today would be a
>> microcontroller for machinery. FHP was a an expensive leap 20 years into
>> the future, a decade father than the customers wanted to go. A expensive
>> multi board design that failed to compete with a single board upgraded
>> microcontroller known as the DG NOVA.
>
> The Eclipse MV/8000 (Eagle) that DG shipped instead of FHP had 5 or 7
> 15" x 15" boards <https://people.cs.clemson.edu/~mark/330/eagle.html>.
>
>> Another example is the MIPS design, which in the words of the PHD designer
>> was a perfect instruction set. MIPS failed to compete with ARM
>
> MIPS tried to compete in the workstation and server market, and was
> not successful in the later 1990s. ARM did not try to compete there.
>

In retrospect, it is interesting that MIPS was competitive at
workstations, as its ISA is pretty limited in many ways, and it is
unlikely that (at the time) they were using many of the performance
tricks from newer machines (eg: OoO and similar).

Then again, I guess early x86 and M68K machines were not particularly
high performance either.

Can still note that I still have a 19 year old laptop which can get
outperformed in some benchmarks by a RasPi 2 (and more so by the newer
RasPis).

Though, there does seem to be the difference that the RasPi's seem to
have better 'memcpy' speeds than the laptop.

> MIPS also competed in games consoles (e.g, Playstation, Playstation
> 2), but lost to PowerPC and later AMD, probably because they could not
> offer competetive performance.
>
> ARM developed the business model of selling CPU cores as IP blocks,
> and that's how they got successful in the embedded market, and
> outcompeted MIPS in the embedded control market.
>

I think MIPS was too little too late.

Also notable that a lot of this stuff was run by lawsuit-crazy jerks,
rather than people trying to work towards the greater good or similar
(like, one doesn't need to look too far to find people suing each other
for whatever reason, ...).

It also looks like the 80s and 90s was an era where pretty much everyone
was trying to get into this game, until (apparently) everything mostly
came crashing down in the 2000s (with only x86 and ARM left as
significant players).

>> which was a
>> thrown together kludge that changed instruction sets four times. (ARM32,
>> ARM16, Thumb, ARM64) Which by the way meant ARM could renew its patents
>> four times, obsoleting previous products, while MIPS went off patent and
>> was cloned my the Chinese which destroyed the product margins.
>
> MIPS has their share of extensions and revisions (e.g., SIMD
> instructions). I think the product margins were destroyed by MIPS
> being not particularly attractive compared to the competition in this
> century. This leads to having a small ecosystem, and a small
> ecosystem means that you also cannot charge a lot for architectural
> licenses (and I think even ARM cannot charge a lot).
>

I suspect MIPS also had a bit of a drawback due to spending a big chunk
of encoding space on funky bit-sliced branch instructions.

Though, yeah, they could still find enough encoding space to fit SIMD
into the mix.

I am currently sitting around with the thought though, that given SIMD
instructions don't tend to occur in large numbers, it could make sense
in an ISA design to mostly offload SIMD operations to larger instruction
formats (assuming an ISA with variable-length encodings).

Stefan Monnier

unread,

Jun 12, 2022, 3:22:01 PM6/12/22

to

> In retrospect, it is interesting that MIPS was competitive at workstations,
> as its ISA is pretty limited in many ways, and it is unlikely that (at the
> time) they were using many of the performance tricks from newer machines
> (eg: OoO and similar).

AFAIK the R4000 was the first 64bit microprocessor (and I suspect the
vast majority of them never ran 64bit code in their lifetime since the
software support for it only appeared much later).
And the R8k was one of the first 4-way superscalar processors.
This was around the time when MIPS was associated with SGI which itself
was branding itself as the new face of supercomputing.

Stefan

MitchAlsup

unread,

Jun 12, 2022, 3:45:42 PM6/12/22

to

On Sunday, June 12, 2022 at 1:44:45 PM UTC-5, BGB wrote:
> On 6/12/2022 11:16 AM, Anton Ertl wrote:
> > Brett <gg...@yahoo.com> writes:
>
>
> It also looks like the 80s and 90s was an era where pretty much everyone
> was trying to get into this game, until (apparently) everything mostly
> came crashing down in the 2000s (with only x86 and ARM left as
> significant players).
<

Many designers, starting in 1980, saw RISC as a way to get a 32-bit architecture
on a single die. The academic papers indicated that "anyone and his brother"
could put together a decent ISA and 4-5 stage pipeline. So, yes, those who got
to a position where we could be funded, went forth and attempted to conquer.

<
> >> which was a
> >> thrown together kludge that changed instruction sets four times. (ARM32,
> >> ARM16, Thumb, ARM64) Which by the way meant ARM could renew its patents
> >> four times, obsoleting previous products, while MIPS went off patent and
> >> was cloned my the Chinese which destroyed the product margins.
> >
> > MIPS has their share of extensions and revisions (e.g., SIMD
> > instructions). I think the product margins were destroyed by MIPS
> > being not particularly attractive compared to the competition in this
> > century. This leads to having a small ecosystem, and a small
> > ecosystem means that you also cannot charge a lot for architectural
> > licenses (and I think even ARM cannot charge a lot).
> >
> I suspect MIPS also had a bit of a drawback due to spending a big chunk
> of encoding space on funky bit-sliced branch instructions.
>

I would not call them out on this--they had plenty of reasons--not the least of
which was number of pipelined cycles wasted branching.

>
> Though, yeah, they could still find enough encoding space to fit SIMD
> into the mix.
>
>
> I am currently sitting around with the thought though, that given SIMD
> instructions don't tend to occur in large numbers, it could make sense
> in an ISA design to mostly offload SIMD operations to larger instruction
> formats (assuming an ISA with variable-length encodings).
<

x86 has more than 1000 SIMD instructions.
<
I question your understanding of "don't tend to occur in large" !?!

Brett

unread,

Jun 12, 2022, 3:50:55 PM6/12/22

to

Stacks are awful.

I am talking 32 registers plus a two unit belt, for you that may be scratch
plus belt.

The vast majority of loads are used immediately, no need to specify 5 bits
of register for these calculations, use a short belt instead. A one unit
belt is actually all you need plus registers. Otherwise known as an
accumulator design, but you probably cannot patent that with all the prior
art.

This is basically RISC plus Accumulator ops to reduce the opcode size by 5
bits.
The compiler work needed is small, just an optimizer pass on normal code
gen. The hardware design is small, it’s just a RISC plus some logic.

The market exists and will gladly buy it for the reduced code ROM cost,
giving you a slice of that savings. You can even rip off MIPS and ARM32 as
your opcode base, before switching to your own variable or fixed 16 width
opcodes. Manufacturers will love that as it will reduce port costs and
risks, the fallback is right there.

This design can generate revenue and become the soul of your new machine.
;)

Michael S

unread,

Jun 12, 2022, 3:56:16 PM6/12/22

to

On Sunday, June 12, 2022 at 10:22:01 PM UTC+3, Stefan Monnier wrote:
> > In retrospect, it is interesting that MIPS was competitive at workstations,
> > as its ISA is pretty limited in many ways, and it is unlikely that (at the
> > time) they were using many of the performance tricks from newer machines
> > (eg: OoO and similar).
> AFAIK the R4000 was the first 64bit microprocessor (and I suspect the
> vast majority of them never ran 64bit code in their lifetime since the
> software support for it only appeared much later).
> And the R8k was one of the first 4-way superscalar processors.

But multichip, expensive and low clocked.
Great for parallel supercomputers, I suppose, but not for your typical
engineering workstation.

IMHO, more interesting was R10K - together with Intel PPro and HP PA-8000
it started the wave of "modern Big OoO" (as opposed to "barely OoO" of
several predecessors, mostly made by IBM).
Of those three, R10K was the only one that was both 64-bit and single-chip.

> This was around the time when MIPS was associated with SGI which itself
> was branding itself as the new face of supercomputing.
>

I think, at that point (1994) SGI still was workstation company first, supercomputers second.

>
> Stefan

Michael S

unread,

Jun 12, 2022, 4:01:20 PM6/12/22

to

On Sunday, June 12, 2022 at 8:09:43 PM UTC+3, Anton Ertl wrote:
> Brett <gg...@yahoo.com> writes:
> >
> >
> >My answer is yes, and you can add in phrases like “A bridge too far.”
> >
> >Here is DG Fountainhead Project:
> >
> >https://people.cs.clemson.edu/~mark/fhp.html
> >
> >https://people.cs.clemson.edu/~mark/FHP_Product_Overview.pdf
> >
> >https://people.cs.clemson.edu/~mark/Intro_FHP_System.pdf
> Interesting, thanks.
>
> For those who don't remember it, FHP is the project for which the
> Eagle project described in "The Soul of a New Machine" was the backup
> plan (Eagle eventually prevailed).

Prevailed within DG, but for the industry as whole it was not nearly as
significant as Nova.
It started DC's race from leadership to irrelevance that culminated when
Moto canceled Mitch's 88K.

MitchAlsup

unread,

Jun 12, 2022, 4:53:25 PM6/12/22

to

Stacks are GREAT for code density ! (so it depends on what metric you are using)

<
> I am talking 32 registers plus a two unit belt, for you that may be scratch
> plus belt.
<

See, that is the point I am having trouble grasping--you have both a belt and
a register file. My 66000 has the file, mill has belts. Why both ?

>
> The vast majority of loads are used immediately, no need to specify 5 bits
> of register for these calculations, use a short belt instead.
<

This is a better argument for LD-ops than for a belt.

<
> A one unit
> belt is actually all you need plus registers. Otherwise known as an
> accumulator design, but you probably cannot patent that with all the prior
> art.
<

I would like you to post assembly language of a forward FFT2 based on
your proposed ISA model. Can you do that ?

>
> This is basically RISC plus Accumulator ops to reduce the opcode size by 5
> bits.
>

27-bit opcodes in a 32-bit memory implementation seems ill-fitting.

>
> The compiler work needed is small, just an optimizer pass on normal code
> gen. The hardware design is small, it’s just a RISC plus some logic.
>
> The market exists and will gladly buy it for the reduced code ROM cost,
> giving you a slice of that savings. You can even rip off MIPS and ARM32 as
> your opcode base, before switching to your own variable or fixed 16 width
> opcodes. Manufacturers will love that as it will reduce port costs and
> risks, the fallback is right there.
<

My 66000 has similar code density as x86-64............

Michael S

unread,

Jun 12, 2022, 4:54:37 PM6/12/22

to

ROM? It's 2022, not 1982.
Small microcontrollers have built-in NOR flash and to the 1st order of approximation
it is free. The different cost for different sizes is not because bigger flash costs
much more to produce, but because market segmentation is a way to maximize profit.
Slightly bigger/faster microcontrollers have NOR or even NAND flash off-die-on-package.
In such scenario it is free, at least in role of program storage, both to the 1st
and to the 2nd order of approximation. As data storage - may be not, but uC ISA
is no help here.

> giving you a slice of that savings. You can even rip off MIPS and ARM32 as
> your opcode base,

MIPS already has nanoMIPS that as far as code density goes would run
round in circles around you pseudo-MILL. Pay attention how nobody cares about it.
And Arm's Thumb2 while beatable is not beatable by much. And I very much
doubt that it is beatable by your fantasy's ISA.
The biggest advantage of Thumb2 is not the fact that it is good ISA (although
it is a good ISA), but that it is the most established solution.

MitchAlsup

unread,

Jun 12, 2022, 4:56:53 PM6/12/22

to

On Sunday, June 12, 2022 at 3:01:20 PM UTC-5, Michael S wrote:
> On Sunday, June 12, 2022 at 8:09:43 PM UTC+3, Anton Ertl wrote:
> > Brett <gg...@yahoo.com> writes:
> > >
> > >
> > >My answer is yes, and you can add in phrases like “A bridge too far.”
> > >
> > >Here is DG Fountainhead Project:
> > >
> > >https://people.cs.clemson.edu/~mark/fhp.html
> > >
> > >https://people.cs.clemson.edu/~mark/FHP_Product_Overview.pdf
> > >
> > >https://people.cs.clemson.edu/~mark/Intro_FHP_System.pdf
> > Interesting, thanks.
> >
> > For those who don't remember it, FHP is the project for which the
> > Eagle project described in "The Soul of a New Machine" was the backup
> > plan (Eagle eventually prevailed).
> Prevailed within DG, but for the industry as whole it was not nearly as
> significant as Nova.
> It started DC's race from leadership to irrelevance that culminated when
> Moto canceled Mitch's 88K.
<

Moto, being true to form--management never said yes and especially
never said no--killed itself with indecision. I left more than a year before
they cancelled <themselves>.
<
But it was fun to watch Apple jump from 68K, to PPC, to x86, to ARM.
And I did see the 88K version John Sell was walking around with (which
he said ran interpretive 68K stuff considerably faster than PPC ran that stuff.)

BGB

unread,

Jun 12, 2022, 5:03:03 PM6/12/22

to

On 6/12/2022 2:45 PM, MitchAlsup wrote:
> On Sunday, June 12, 2022 at 1:44:45 PM UTC-5, BGB wrote:
>> On 6/12/2022 11:16 AM, Anton Ertl wrote:
>>> Brett <gg...@yahoo.com> writes:
>>
>>
>> It also looks like the 80s and 90s was an era where pretty much everyone
>> was trying to get into this game, until (apparently) everything mostly
>> came crashing down in the 2000s (with only x86 and ARM left as
>> significant players).
> <
> Many designers, starting in 1980, saw RISC as a way to get a 32-bit architecture
> on a single die. The academic papers indicated that "anyone and his brother"
> could put together a decent ISA and 4-5 stage pipeline. So, yes, those who got
> to a position where we could be funded, went forth and attempted to conquer.
> <

Given the average person doesn't know much of anything about
programming, the number of people who could design a competent ISA is
likely still a fair bit smaller than the general population.

But, yeah, there were lots of funky computers and processors coming out
from companies which seemingly had little relation to these fields
otherwise (either before or after).

I guess it is quite possible that much of the industry jumped on CPUs
and computers as a "get rich quick" scheme or something.

>>>> which was a
>>>> thrown together kludge that changed instruction sets four times. (ARM32,
>>>> ARM16, Thumb, ARM64) Which by the way meant ARM could renew its patents
>>>> four times, obsoleting previous products, while MIPS went off patent and
>>>> was cloned my the Chinese which destroyed the product margins.
>>>
>>> MIPS has their share of extensions and revisions (e.g., SIMD
>>> instructions). I think the product margins were destroyed by MIPS
>>> being not particularly attractive compared to the competition in this
>>> century. This leads to having a small ecosystem, and a small
>>> ecosystem means that you also cannot charge a lot for architectural
>>> licenses (and I think even ARM cannot charge a lot).
>>>
>> I suspect MIPS also had a bit of a drawback due to spending a big chunk
>> of encoding space on funky bit-sliced branch instructions.
>>
> I would not call them out on this--they had plenty of reasons--not the least of
> which was number of pipelined cycles wasted branching.

Possible, branching isn't free.

I guess one possible feature of bit-sliced branch displacements could be
that it could potentially allow fitting the branch-predictor into the IF
stage, which could then allow for 1-cycle branch-latency for predicted
branches (vs, say, a 2-cycle latency).

I am using relative displacement branches, but these are in turn only
branch-predicted if there is no carry outside the low 24 bits.

>>
>> Though, yeah, they could still find enough encoding space to fit SIMD
>> into the mix.
>>
>>
>> I am currently sitting around with the thought though, that given SIMD
>> instructions don't tend to occur in large numbers, it could make sense
>> in an ISA design to mostly offload SIMD operations to larger instruction
>> formats (assuming an ISA with variable-length encodings).
> <
> x86 has more than 1000 SIMD instructions.
> <
> I question your understanding of "don't tend to occur in large" !?!

I meant in terms of instruction frequency within generated code, not the
size of the SIMD ISA.

Say, program is:
17% Branch Instructions;
25% Load/Store (Disp9);
21% Const/Immed ALU instructions;
10% Load/Store (Index);
...

The fraction spent on SIMD instructions is actually relatively small, so
even if one spends a 64-bit encoding on nearly every SIMD instruction,
the total program size isn't likely to change all that much.

Like, code density is mostly appeased so long as one can keep the listed
categories as "reasonably compact".

There is still a fair bit of potential encoding space in terms of 64-bit
instruction formats (somewhat larger than with the 32-bit instruction
formats).

...

MitchAlsup

unread,

Jun 12, 2022, 5:52:13 PM6/12/22

to

Are you going to leave out the ones not encountered frequently ?

>
>
> Say, program is:
> 17% Branch Instructions;
> 25% Load/Store (Disp9);
> 21% Const/Immed ALU instructions;
> 10% Load/Store (Index);
> ...
<

My 66000 cooked numbers::
19.2% branch/jump
35.7% Ld/St (of all flavors)
41.4% INT
03.7% Float
<
19.8% contain an immediate

>
> The fraction spent on SIMD instructions is actually relatively small, so
> even if one spends a 64-bit encoding on nearly every SIMD instruction,
> the total program size isn't likely to change all that much.
>

My point wrt SIMD is that even the expense of encoding and compiler teaching
is wasteful.

>
> Like, code density is mostly appeased so long as one can keep the listed
> categories as "reasonably compact".
>
> There is still a fair bit of potential encoding space in terms of 64-bit
> instruction formats (somewhat larger than with the 32-bit instruction
> formats).
>

As I have said many times, just under ½ of My 66000 instruction encoding
space remains free (unencoded).
> ...

BGB

unread,

Jun 12, 2022, 6:18:33 PM6/12/22

to

Once one crosses the threshold where one can justify using an SDcard,
effective ROM size is (for practical purposes) effectively unbounded.

Using SDcard for pagefile or RAM is more of a concern though.

Code-density matters (some) for I$ miss rate, but even then tends to be
a minor issue compared with D$ miss rate.

Then again, it seems like it could be possible for a "clever" compiler
to organize code to try to minimize conflict-misses in the instruction
cache (well, that or have a set-associative I$).

>> giving you a slice of that savings. You can even rip off MIPS and ARM32 as
>> your opcode base,
>
> MIPS already has nanoMIPS that as far as code density goes would run
> round in circles around you pseudo-MILL. Pay attention how nobody cares about it.
> And Arm's Thumb2 while beatable is not beatable by much. And I very much
> doubt that it is beatable by your fantasy's ISA.
> The biggest advantage of Thumb2 is not the fact that it is good ISA (although
> it is a good ISA), but that it is the most established solution.
>

Yeah.

Both Thumb2 and 32-bit x86 seem to fall into the "hard to beat" category
in terms of code density (at least if compiled with "gcc -Os" + strip or
similar).

Ironically, despite being pretty much dog-chewed, Thumb2 is seemingly
still "less bad" here than RISC-V's 'C' extension.

Both are still pretty chewed up compared with SuperH, but (despite being
entirely 16-bit instructions), SuperH doesn't win the code-density game.

In general, BJX2 code density seems to be partway between x86-64 and
Thumb2, though in my current binaries tends to look a little worse here
because the binaries are basically static-linked with a full copy of the
OS kernel (also because the WEXifier makes code-density worse, and for
some undetermined period of time, my compiler was, due to a bug, trying
to use the WEXifier even in size-optimized mode).

It still manages to also (quite significantly) beat out MSVC on x86-64,
but this is partly because MSVC's x86-64 output seems to be particularly
bulky (often roughly 5x the size of the "gcc -Os" scenario). Not sure
how much is due to code generation and how much due to static-linked
MSVCRT though.

BGB

unread,

Jun 12, 2022, 7:40:29 PM6/12/22

to

My SIMD ISA support is already pretty modest if compared with SSE.
Granted, I have Binary16, which SSE generally lacked.

>>
>>
>> Say, program is:
>> 17% Branch Instructions;
>> 25% Load/Store (Disp9);
>> 21% Const/Immed ALU instructions;
>> 10% Load/Store (Index);
>> ...
> <
> My 66000 cooked numbers::
> 19.2% branch/jump
> 35.7% Ld/St (of all flavors)
> 41.4% INT
> 03.7% Float
> <
> 19.8% contain an immediate

If one notes, the probabilities here don't seem to look all that different.

>>
>> The fraction spent on SIMD instructions is actually relatively small, so
>> even if one spends a 64-bit encoding on nearly every SIMD instruction,
>> the total program size isn't likely to change all that much.
>>
> My point wrt SIMD is that even the expense of encoding and compiler teaching
> is wasteful.

Dunno.

Partly a question of how costs are best spent.

>>
>> Like, code density is mostly appeased so long as one can keep the listed
>> categories as "reasonably compact".
>>
>> There is still a fair bit of potential encoding space in terms of 64-bit
>> instruction formats (somewhat larger than with the 32-bit instruction
>> formats).
>>
> As I have said many times, just under ½ of My 66000 instruction encoding
> space remains free (unencoded).

I am not out of encoding space yet either, as noted, 24-bit blocks:
F0: General Ops, 75% full
F1: Ld/St Disp9, Full
F2: 2RI / 3RI, Mostly Full
F3: Unused (User)
F4: Repeat F0 (WEX)
F5: Repeat F1 (WEX)
F6: Repeat F2 (WEX)
F7: Repeat F3 (WEX)
F8: Imm16 ops, 2/3 Full
F9: Unused
FA: "LDI Imm24u, R0"
FB: "LDI Imm24n, R0"
FC: Repeat F8 (WEX)
FD: Repeat F9 (WEX)
FE: Jumbo
FF: Op64

So, 4 out of 6 top-level blocks are in-use.

FA/FB make more sense if Jumbo were not a thing.

Predicated Block:
E0: Repeat F0 (?T)
E1: Repeat F1 (?T)
E2: Repeat F2 (?T)
E3: Repeat F3 (?T)
E4: Repeat F0 (?F)
E5: Repeat F1 (?F)
E6: Repeat F2 (?F)
E7: Repeat F3 (?F)
E8: Repeat F8 (?T)
E9: Repeat F9 (?T)
EA: Repeat F0 (?T, WEX)
EB: Repeat F2 (?T, WEX)
EC: Repeat F8 (?F)
ED: Repeat F9 (?F)
EE: Repeat F0 (?F, WEX)
EF: Repeat F2 (?F, WEX)

XGPR:
7w: Repeats F0
9w: Repeats F2 and F1.

Within F0:
0/1/2/3/4/5/6,8,C/D (in use)
7/9/A/B: Unused
E/F: Deprecated
Or, 75% or 63% of F0 is in-use.

Within Op64:
FFw0_zzzz_F0nm_XeoX: Extend F0
FFw0_zzzz_F1nm_XeoX: Extend F1
FFw0_zzzz_F2nm_XeoX: Extend F2
FFw0_zzzz_F3nm_XeoX: Extend F3
...
FFjj_jjjj_FAjj_jjjj: BRA Abs48
FFjj_jjjj_FBjj_jjjj: BBR Abs48

The Op64 space effectively makes each block 16x bigger as currently
defined (plus more if the Imm/4R field is interpreted as opcode bits).

...

Ivan Godard

unread,

Jun 12, 2022, 8:18:23 PM6/12/22

to

Different lifetimes for the contents.

Belts are fine for transients and save entropy, but the auto-advance is
not good for persistent repeatedly used values, which a belt has to
rescue before they disappear. Genregs are fine for persistent values
(essentially a manually controlled D0 cache), but are wasteful for
transients compared to a belt or stack.

Mill has both, except we call our genRegs "scratch".

Andy Valencia

unread,

Jun 12, 2022, 9:45:46 PM6/12/22

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
> AFAIK the R4000 was the first 64bit microprocessor (and I suspect the
> vast majority of them never ran 64bit code in their lifetime since the
> software support for it only appeared much later).

I was the SW lead at Sequent when we were scoping the "Model R" (R as in
"RISC"), which was going to be an R4000. Sequent hardware powered a lot of
DBMS, especially Oracle, and we were going to make it a full, native 64-bit
system. Economic woes cratered the program, and presently I was Somewhere
Else.

I heard from somebody who stayed on, and apparently they were really, really
wishing that they had moved on 64-bit sooner rather than later.

Anyway, I came back to MIPS at cisco, and did the first port of QNX Neutrino
to the architecture (in fact, to anything but x86). It really is a rather
nice instruction set.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html

Stefan Monnier

unread,

Jun 13, 2022, 9:34:19 AM6/13/22

to

Michael S [2022-06-12 12:56:14] wrote:
>> AFAIK the R4000 was the first 64bit microprocessor (and I suspect the
>> vast majority of them never ran 64bit code in their lifetime since the
>> software support for it only appeared much later).
>> And the R8k was one of the first 4-way superscalar processors.
>
> But multichip, expensive and low clocked.
> Great for parallel supercomputers, I suppose, but not for your typical
> engineering workstation.

Agreed. My point was just to underline the fact that back then MIPS
processors were definitely vying for the top-spot.

> I think, at that point (1994) SGI still was workstation company first,
> supercomputers second.

I think their goal had already shifted towards supercomputing, tho.

Stefan

Stefan Monnier

unread,

Jun 13, 2022, 9:36:51 AM6/13/22

to

> I heard from somebody who stayed on, and apparently they were really, really
> wishing that they had moved on 64-bit sooner rather than later.

The addition of support for 64bit in the R4000 seemed to be
right, indeed. While almost none of those CPUs ever saw 64bit code, it
was apparently cheap enough from a hardware cost point of view, and it
made it easy for SGI to switch their OS to 64bit (because all the
32bit-only machines had disappeared by then).

Stefan

gg...@yahoo.com

unread,

Jun 14, 2022, 2:52:12 AM6/14/22

to

Belts are also GREAT for code density for the same reason as stacks.
Yes 8 bit opcodes become viable and even preferred.

Load [r1]12 to belt
Load [r2]16 to belt
Add belt, belt2, belt result
Store [r3]32 from belt

That’s four instructions that do not need six of nine register specifiers.
Basically the same as a stack without the brain damage for addressing, and double brain damage of an addressable stack.

Yes a FFT would use long instructions and few belt renames, so what.

Belt2 is hard coded into the instruction, there is no general belt offset. A special instruction can extract Belt2 to a register for opcodes that drop two results.

Since “belt” is just a rename register only modest changes to a RISC decoder rename unit are needed.

You can substitute ACC for belt, or R32 for belt, it’s just a name for an implied register. Renaming means this is actually many registers and thus not the bottleneck of early designs. Yes you can go many wide and are not restricted to one wide like early stack designs.

Anton Ertl

unread,

Jun 14, 2022, 8:13:34 AM6/14/22

to

BGB <cr8...@gmail.com> writes:
>On 6/12/2022 11:16 AM, Anton Ertl wrote:
>> MIPS tried to compete in the workstation and server market, and was
>> not successful in the later 1990s. ARM did not try to compete there.
>>
>
>In retrospect, it is interesting that MIPS was competitive at
>workstations, as its ISA is pretty limited in many ways,

In which ways?

>and it is
>unlikely that (at the time) they were using many of the performance
>tricks from newer machines (eg: OoO and similar).

MIPS has seen OoO implementations (e.g., the R10000), and I don't see
why you consider that unlikely. On the contrary, MIPS was considered
particularly clean, supposedly making it easier to use
performance-enhancing techniques. In particular, they left away the
fixed condition-code register in MIPS and its descendants (e.g., Alpha
and RISC-V).

>Then again, I guess early x86 and M68K machines were not particularly
>high performance either.

Early MIPS CPUs were high-performance for their time.

>It also looks like the 80s and 90s was an era where pretty much everyone
>was trying to get into this game, until (apparently) everything mostly
>came crashing down in the 2000s (with only x86 and ARM left as
>significant players).

Mainly in the 1980s, where small teams with little experience could
design competetive CPUs (like the ARM1/2). That window closed when
more transistors became available, allowing more complexity
(superscalar, OOO), which required larger teams and more sophisticated
tooling. It seems to me that at some point many companies realized
that they could not keep up with the required effort, and when Intel
came along with IA-64, several happily grasped for the straw and
abandoned their RISC CPU.

In particular, this happened to MIPS as SGI's architecture: After the
R10000, the next ambitious project apparently failed, and SGI produced
only higher-clocked variants of the R10000; that in itself would not
have been a problem, if they had had competetive clock rates, but they
had not. They then adopted IA-64 ASAP, and after that, I think Xeons
(but one no longer heard much about SGI by that time).

>I suspect MIPS also had a bit of a drawback due to spending a big chunk
>of encoding space on funky bit-sliced branch instructions.

What are you talking about? MIPS-I has conditional branches for:

if rs=rt
if rs!=rt
if rs<0
if rs>0
if rs<=0
if rs>=0

[plus stuff related to coprocessors]

And IIRC it's the same with MIPS-IV. If MIPS has any "bit-sliced
branch instructions", they were added after MIPS left the
high-performance markets.

Anton Ertl

unread,

Jun 14, 2022, 8:52:19 AM6/14/22

to

Michael S <already...@yahoo.com> writes:
>On Sunday, June 12, 2022 at 8:09:43 PM UTC+3, Anton Ertl wrote:

>> Brett <gg...@yahoo.com> writes:=20
>> >=20
>> >=20
>> >My answer is yes, and you can add in phrases like =E2=80=9CA bridge too =
>far.=E2=80=9D=20
>> >=20
>> >Here is DG Fountainhead Project:=20
>> >=20
>> >https://people.cs.clemson.edu/~mark/fhp.html=20
>> >=20
>> >https://people.cs.clemson.edu/~mark/FHP_Product_Overview.pdf=20
>> >=20
>> >https://people.cs.clemson.edu/~mark/Intro_FHP_System.pdf
>> Interesting, thanks.=20
>>=20
>> For those who don't remember it, FHP is the project for which the=20
>> Eagle project described in "The Soul of a New Machine" was the backup=20
>> plan (Eagle eventually prevailed).=20

>
>Prevailed within DG, but for the industry as whole it was not nearly as
>significant as Nova.

Probably. The Nova inspired a number of microprocessors, but none of
them was really successful.

>It started DC's race from leadership to irrelevance that culminated when
>Moto canceled Mitch's 88K.

Leadership? DG was in the minicomputer market that was eaten
completely by the killer micros. Who of the minicomputer makers
survived? Only HP.

Concerning Eagle, that was the first MV machine, and according to
<https://en.wikipedia.org/wiki/Data_General#MV_series>:

|The MV systems generated an almost miraculous turnaround for Data
|General. Through the early 1980s sales picked up, and by 1984 the
|company had over a billion dollars in annual sales.

But of course a few years later microcomputers ate their lunch.

Would DG have fared better by betting on a different architecture
rather than 88k? I doubt it. Would they have fared better by rolling
their own RISC? Most likely not.

When 88K was canceled, DG switched to Intel and Windows NT, but they
apparently were not competetive in that market. But I don't see a
much better way for them in the non-peripheral computer business
(well, maybe stick with Unix instead of switching to WNT).

They were somewhat more successful in the peripheral business in that
they were bought by EMC because of their CLARiiON (RAID?) technology.

Quadibloc

unread,

Jun 14, 2022, 10:04:09 AM6/14/22

to

On Sunday, June 12, 2022 at 2:53:25 PM UTC-6, MitchAlsup wrote:
> On Sunday, June 12, 2022 at 2:50:55 PM UTC-5, gg...@yahoo.com wrote:

> > Stacks are awful.
> >
> Stacks are GREAT for code density ! (so it depends on what metric you are using)

True. But stacks aren't used much these days. Because consecutive instructions
are dependent on one another a lot. So they make OoO work harder. The current
fashion is architectures that fix it so you hardly need OoO at all, because they're
register-to-register architectures with banks of 32 registers - so the compiler can
do the register renaming for you!

Of course, that stopped working when OoO got so big that RISC machines with
32-register banks _also_ got implemented in OoO. On tiny little smartphones, yet,
instead of insanely performance-emphasizing workstations.

So today we have *no* ISA that "makes sense" for what we want, no matter what
ISA is chosen, we have to use a big OoO layer before the programs can be executed
as quickly as we demand and can cheaply achieve.

And, of course, this explains *why* Intel tried the Itanium.

Basically, Moore's Law has worked so well, we're at a funny spot on the wheel of
reincarnation. If a new technology came along... say where the gates were 100x
faster than on today's chips, *but* in this exotic new material, you couldn't make
a die large enough to fit anything past, oh, say, an Intersil 6100 (a PDP-8, that is)
then computer design would "make sense" again, and the way to build computers
would be the way we built them in the good old days.

John Savard

John Levine

unread,

Jun 14, 2022, 10:58:55 AM6/14/22

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>Probably. The Nova inspired a number of microprocessors, but none of

>them was really successful. ...

>Leadership? DG was in the minicomputer market that was eaten
>completely by the killer micros. Who of the minicomputer makers
>survived? Only HP.

And them only because they had printers as a cash cow while they killed
the mini lines and started selling PCs.

It always surprised me that none of the mini makers managed to turn their
mini architecture into a successful chip. I suppose it's a version of the
innovator's dilemma. DEC had pretty good single chip versions of PDP-8 and
PDP-11 but they wouldn't sell them other than as part of a packaged system
because margins and business plan.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,

Jun 14, 2022, 11:58:37 AM6/14/22

to

On Tuesday, June 14, 2022 at 9:58:55 AM UTC-5, John Levine wrote:
> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
> >Probably. The Nova inspired a number of microprocessors, but none of
> >them was really successful. ...
> >Leadership? DG was in the minicomputer market that was eaten
> >completely by the killer micros. Who of the minicomputer makers
> >survived? Only HP.
> And them only because they had printers as a cash cow while they killed
> the mini lines and started selling PCs.
>
> It always surprised me that none of the mini makers managed to turn their
> mini architecture into a successful chip. I suppose it's a version of the
> innovator's dilemma. DEC had pretty good single chip versions of PDP-8 and
> PDP-11 but they wouldn't sell them other than as part of a packaged system
> because margins and business plan.
<

In the end it is all margins.

BGB

unread,

Jun 14, 2022, 1:29:41 PM6/14/22

to

On 6/14/2022 6:20 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> On 6/12/2022 11:16 AM, Anton Ertl wrote:
>>> MIPS tried to compete in the workstation and server market, and was
>>> not successful in the later 1990s. ARM did not try to compete there.
>>>
>>
>> In retrospect, it is interesting that MIPS was competitive at
>> workstations, as its ISA is pretty limited in many ways,
>
> In which ways?
>

A few ways:
Only having Base + Disp addressing;
Earlier versions lacking Load/Store operations for Byte and similar;
Needing to manually put NOPs or similar after Load instructions;
Earlier versions only doing 32-bit scalar integer operations;
...

In some ways, like an "extra funky" RISC-V / RV32I.

>> and it is
>> unlikely that (at the time) they were using many of the performance
>> tricks from newer machines (eg: OoO and similar).
>
> MIPS has seen OoO implementations (e.g., the R10000), and I don't see
> why you consider that unlikely. On the contrary, MIPS was considered
> particularly clean, supposedly making it easier to use
> performance-enhancing techniques. In particular, they left away the
> fixed condition-code register in MIPS and its descendants (e.g., Alpha
> and RISC-V).
>

AFAIK, they didn't go superscalar until the late 1990s or similar.

I was thinking more around the early 1990s.

Though, information on the "what happened when" aspects of MIPS are a
little fuzzy.

Though, for example, they used the R3000 in the PlayStation, sorta
implying that 1994 was ~ "MIPS I" era or similar, N64 had an R4000 based
CPU so 1997 was ~ "MIPS III" era, ...

Based on the ISA design, I am guessing MIPS was primarily chasing after
high MHz, rather than trying to get lots of work done per clock-cycle.

>> Then again, I guess early x86 and M68K machines were not particularly
>> high performance either.
>
> Early MIPS CPUs were high-performance for their time.
>

Probably true, it would appear that (prior to the Pentium) x86 and
friends weren't exactly all that high-performance either.

Though, admittedly, earlier on in my project, I had underestimated the
performance of the Pentium, seemingly it is hard to get up to these
performance levels on an FPGA (at least for general purpose code).

Though, partly this was because I was not aware that the Pentium was
superscalar, and effectively thought it worked essentially more like a
higher-clocked 486.

Most of what I can do to improve performance, sadly, doesn't help with
general-purpose code (leading to the relative irony that currently the
fastest version of Quake I have on this is running on
software-rasterized OpenGL; currently nearly twice the frame-rate of the
software rendered version, *...).

*: Right now I have GLQuake almost up to similar framerates as Hexen
(mostly high single-digit framerates).
Both GLQuake and Hexen are in the "borderline playable, but too still
too laggy to give a good experience" category.

There is also ROTT, which is still kinda laggy for whatever reason, and
still resists my efforts to get it working reliably (and is, ironically,
the only one of the engines still rendering using 8-bit indexed-color;
most of the others having been moved over to RGB555).

But, yeah, one can't really SIMD their way into branch-heavy scalar C
code and similar going faster (well, more so when pretty much every
branch triggers a register spill in the compiler).

Also nevermind if my port of GLQuake looks more like if it had been
ported to a PlayStation 1 than what it would have looked like on an
actual GPU (could make it look nicer, but this would make it slower).

Could maybe be slightly better if I got a board with an XC7A200T or
similar, as while I couldn't likely improve the clock-speed, I could
afford to go multi-core and also have a much bigger L2 cache and similar
(compared with the XC7A100T, this thing would have a pretty absurd
amount of Block-RAM and similar).

>> It also looks like the 80s and 90s was an era where pretty much everyone
>> was trying to get into this game, until (apparently) everything mostly
>> came crashing down in the 2000s (with only x86 and ARM left as
>> significant players).
>
> Mainly in the 1980s, where small teams with little experience could
> design competetive CPUs (like the ARM1/2). That window closed when
> more transistors became available, allowing more complexity
> (superscalar, OOO), which required larger teams and more sophisticated
> tooling. It seems to me that at some point many companies realized
> that they could not keep up with the required effort, and when Intel
> came along with IA-64, several happily grasped for the straw and
> abandoned their RISC CPU.
>
> In particular, this happened to MIPS as SGI's architecture: After the
> R10000, the next ambitious project apparently failed, and SGI produced
> only higher-clocked variants of the R10000; that in itself would not
> have been a problem, if they had had competetive clock rates, but they
> had not. They then adopted IA-64 ASAP, and after that, I think Xeons
> (but one no longer heard much about SGI by that time).
>

Yeah.

My familiarity with them was mostly in that they created OpenGL, but by
the time I was really old enough to start caring that much about it,
this scene had already been taken over by 3Dfx and NVIDIA and friends.

When I was young, I had watched shows like ReBoot and similar, but my
younger self didn't know much about how these shows were made, ...

>> I suspect MIPS also had a bit of a drawback due to spending a big chunk
>> of encoding space on funky bit-sliced branch instructions.
>
> What are you talking about? MIPS-I has conditional branches for:
>
> if rs=rt
> if rs!=rt
> if rs<0
> if rs>0
> if rs<=0
> if rs>=0
>
> [plus stuff related to coprocessors]
>
> And IIRC it's the same with MIPS-IV. If MIPS has any "bit-sliced
> branch instructions", they were added after MIPS left the
> high-performance markets.
>

The J-format branch instructions (J, JAL), rather than doing like on
most machines and adding/subtracting a displacement to PC, effectively
copied the high-order bits from PC and the low-order bits from the
branch instruction.

They also used an (implausibly large) 26-bit offset.

A 26 bit offset eats a lot more encoding space than a 20 (or even 24)
bit displacement would have done.

Though, OTOH, this does avoid the need for carry propagation; so makes
sense in a cost-minimization sense.

Does mean though that programs would need to be loaded to addresses
where ".text" does not cross a 64MB boundary.

MitchAlsup

unread,

Jun 14, 2022, 1:52:49 PM6/14/22

to

On Tuesday, June 14, 2022 at 12:29:41 PM UTC-5, BGB wrote:
> On 6/14/2022 6:20 AM, Anton Ertl wrote:
> > BGB <cr8...@gmail.com> writes:
> >> On 6/12/2022 11:16 AM, Anton Ertl wrote:
> >>> MIPS tried to compete in the workstation and server market, and was
> >>> not successful in the later 1990s. ARM did not try to compete there.
> >>>
> >>
> >> In retrospect, it is interesting that MIPS was competitive at
> >> workstations, as its ISA is pretty limited in many ways,
> >
> > In which ways?
> >
> A few ways:
> Only having Base + Disp addressing;
<

Only about 2% penalty

<
> Earlier versions lacking Load/Store operations for Byte and similar;
> Needing to manually put NOPs or similar after Load instructions;
<

2%-3% penalty. And note: while up to 80%-odd delay slots were filled,
only about 60% were filled with instructions useful on both paths.
But everyone today agrees, NoOps to fill pipeline cycles is just dumb.

<
> Earlier versions only doing 32-bit scalar integer operations;
<

It was a 32-bit machine, and a perfect match to C of that era.

> ...
>
> In some ways, like an "extra funky" RISC-V / RV32I.
> >> and it is
> >> unlikely that (at the time) they were using many of the performance
> >> tricks from newer machines (eg: OoO and similar).
> >
> > MIPS has seen OoO implementations (e.g., the R10000), and I don't see
> > why you consider that unlikely. On the contrary, MIPS was considered
> > particularly clean, supposedly making it easier to use
> > performance-enhancing techniques. In particular, they left away the
> > fixed condition-code register in MIPS and its descendants (e.g., Alpha
> > and RISC-V).
> >
> AFAIK, they didn't go superscalar until the late 1990s or similar.
<

They took that SuperPipelined side road first, which others avoided.

>
> I was thinking more around the early 1990s.
>
> Though, information on the "what happened when" aspects of MIPS are a
> little fuzzy.
>
> Though, for example, they used the R3000 in the PlayStation, sorta
> implying that 1994 was ~ "MIPS I" era or similar, N64 had an R4000 based
> CPU so 1997 was ~ "MIPS III" era, ...
>
> Based on the ISA design, I am guessing MIPS was primarily chasing after
> high MHz, rather than trying to get lots of work done per clock-cycle.
<

As Dirk Meyers would say:: Frequency is your friend.

<
> >> Then again, I guess early x86 and M68K machines were not particularly
> >> high performance either.
> >
> > Early MIPS CPUs were high-performance for their time.
> >
> Probably true, it would appear that (prior to the Pentium) x86 and
> friends weren't exactly all that high-performance either.
>

Pentium Pro was the first x86 that anyone would considered fast.

>

> > What are you talking about? MIPS-I has conditional branches for:
> >
> > if rs=rt
> > if rs!=rt
> > if rs<0
> > if rs>0
> > if rs<=0
> > if rs>=0
> >
> > [plus stuff related to coprocessors]
> >
> > And IIRC it's the same with MIPS-IV. If MIPS has any "bit-sliced
> > branch instructions", they were added after MIPS left the
> > high-performance markets.
> >
> The J-format branch instructions (J, JAL), rather than doing like on
> most machines and adding/subtracting a displacement to PC, effectively
> copied the high-order bits from PC and the low-order bits from the
> branch instruction.
<

They accessed the (single) cache twice per cycle, and this was how they did
branching without an adder in the path--the other choice was 2 delay slots.

>
> They also used an (implausibly large) 26-bit offset.
<

Most of us did too !

>
> A 26 bit offset eats a lot more encoding space than a 20 (or even 24)
> bit displacement would have done.
<

Unconditional branches (and subroutine calls) don't need Rd or Rs1.
We saw this as FREE and of great value. Most of us also shifted it
up by 2-bits as we were word addresses and this gave us more
distance.

>
>
> Though, OTOH, this does avoid the need for carry propagation; so makes
> sense in a cost-minimization sense.
>
> Does mean though that programs would need to be loaded to addresses
> where ".text" does not cross a 64MB boundary.
<

No, but it does create a few boundary issues the linker needs to solve.

George Neuner

unread,

Jun 14, 2022, 3:03:20 PM6/14/22

to

On Sun, 12 Jun 2022 19:50:52 -0000 (UTC), Brett <gg...@yahoo.com>
wrote:

>Stacks are awful.

Traditional stacks where you can only address the top are awful. If
you can use values deep in the stack as operands WITHOUT first copying
them to the top, then stacks become much more palatable.

YMMV,
George

Michael S

unread,

Jun 14, 2022, 3:58:18 PM6/14/22

to

No, they made R8K in 1994. That's 11 years before ARM.

> I was thinking more around the early 1990s.
>
> Though, information on the "what happened when" aspects of MIPS are a
> little fuzzy.
>
> Though, for example, they used the R3000 in the PlayStation, sorta
> implying that 1994 was ~ "MIPS I" era or similar, N64 had an R4000 based
> CPU so 1997 was ~ "MIPS III" era, ...
>

Looking at consoles gets you very skewed picture of MIPS eras.
As far as MIPS or SGI were concerned R3000 era was 1988-1991.
R4000/4400 era - 1992 to 1995.
By 1996 a low end was R5000, high end was R10000.
And that if we fully ignore R8000, but may be we should.

> Based on the ISA design, I am guessing MIPS was primarily chasing after
> high MHz, rather than trying to get lots of work done per clock-cycle.

Only in R4000, not before or after.

> >> Then again, I guess early x86 and M68K machines were not particularly
> >> high performance either.
> >
> > Early MIPS CPUs were high-performance for their time.
> >
> Probably true, it would appear that (prior to the Pentium) x86 and
> friends weren't exactly all that high-performance either.
>
>
> Though, admittedly, earlier on in my project, I had underestimated the
> performance of the Pentium, seemingly it is hard to get up to these
> performance levels on an FPGA (at least for general purpose code).
>

Your inability to achieves decent clock rates in FPGAs is not a fault of FPGAs.

> Though, partly this was because I was not aware that the Pentium was
> superscalar, and effectively thought it worked essentially more like a
> higher-clocked 486.
>

Pentium superscalarity was moderately important for performance, but
split I/D caches and much faster and wider external bus were likely bigger
factors.
Also, in some niches Pentiums 10x or so faster FPU also played a role.
Although back with compilers of those days and small caches of P5/P54
it was very rare when this FPU was utilized by more than, say, 30%.

BGB

unread,

Jun 14, 2022, 4:27:53 PM6/14/22

to

On 6/14/2022 12:52 PM, MitchAlsup wrote:
> On Tuesday, June 14, 2022 at 12:29:41 PM UTC-5, BGB wrote:
>> On 6/14/2022 6:20 AM, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> On 6/12/2022 11:16 AM, Anton Ertl wrote:
>>>>> MIPS tried to compete in the workstation and server market, and was
>>>>> not successful in the later 1990s. ARM did not try to compete there.
>>>>>
>>>>
>>>> In retrospect, it is interesting that MIPS was competitive at
>>>> workstations, as its ISA is pretty limited in many ways,
>>>
>>> In which ways?
>>>
>> A few ways:
>> Only having Base + Disp addressing;
> <
> Only about 2% penalty
> <
>> Earlier versions lacking Load/Store operations for Byte and similar;
>> Needing to manually put NOPs or similar after Load instructions;
> <
> 2%-3% penalty. And note: while up to 80%-odd delay slots were filled,
> only about 60% were filled with instructions useful on both paths.
> But everyone today agrees, NoOps to fill pipeline cycles is just dumb.
> <

OK.

But, yeah, not a fan of delay slots, so I avoided them in my design.

Trying to use the result of a load directly after the load instruction
will result in triggering a pipeline interlock in my case, sadly my
compiler isn't really smart enough to avoid this one.

Ironically, one of the bigger savings from the FMOV.S instruction was
not from saving 1 cyle due to fusing the two instructions, it was saving
3 cycles due to "MOV.L + FLDCF" almost invariably triggering an interlock.

>> Earlier versions only doing 32-bit scalar integer operations;
> <
> It was a 32-bit machine, and a perfect match to C of that era.

Possibly.

Though, from my current POV, lacking native 64-bit integers and SIMD and
similar seemed like a drawback, but none of its competitors at the time
had these either.

>> ...
>>
>> In some ways, like an "extra funky" RISC-V / RV32I.
>>>> and it is
>>>> unlikely that (at the time) they were using many of the performance
>>>> tricks from newer machines (eg: OoO and similar).
>>>
>>> MIPS has seen OoO implementations (e.g., the R10000), and I don't see
>>> why you consider that unlikely. On the contrary, MIPS was considered
>>> particularly clean, supposedly making it easier to use
>>> performance-enhancing techniques. In particular, they left away the
>>> fixed condition-code register in MIPS and its descendants (e.g., Alpha
>>> and RISC-V).
>>>
>> AFAIK, they didn't go superscalar until the late 1990s or similar.
> <
> They took that SuperPipelined side road first, which others avoided.

OK.

>>
>> I was thinking more around the early 1990s.
>>
>> Though, information on the "what happened when" aspects of MIPS are a
>> little fuzzy.
>>
>> Though, for example, they used the R3000 in the PlayStation, sorta
>> implying that 1994 was ~ "MIPS I" era or similar, N64 had an R4000 based
>> CPU so 1997 was ~ "MIPS III" era, ...
>>
>> Based on the ISA design, I am guessing MIPS was primarily chasing after
>> high MHz, rather than trying to get lots of work done per clock-cycle.
> <
> As Dirk Meyers would say:: Frequency is your friend.
> <

Possibly.

Some use-cases would likely fit well with a faster but narrower 32-bit
machine, than with a slower but wider 64-bit machine.

>>>> Then again, I guess early x86 and M68K machines were not particularly
>>>> high performance either.
>>>
>>> Early MIPS CPUs were high-performance for their time.
>>>
>> Probably true, it would appear that (prior to the Pentium) x86 and
>> friends weren't exactly all that high-performance either.
>>
> Pentium Pro was the first x86 that anyone would considered fast.

Yeah, dunno.

My experience, IIRC:
486 DX2-66: Quake was a slideshow, but Doom ran well.
P1, 133: Quake ran pretty OK (IIRC).
P2, 266: Quake (and Quake 2) ran very well.

Initially, my parents had a 486, whereas I (personally) had an 8088 that
I got second-hand from my aunt (who at the time had gotten a Pentium).

The 8088 was significantly limited in what sorts of software it could
run (pretty much nothing at the time would run on an 8088; so it was
mostly limited to basic MS-DOS programs and QBasic).

Not that long afterwards, I had a 486, and my parents had also gotten a
Pentium. So, me (still in elementary school), could then run Windows
3.11 and Doom and similar (but, most of the newer PCs at the time were
running Win9x and similar).

I also migrated from QBasic to C in my (initially unsuccessful) attempts
to understand the Doom source (to me at this stage of my life, there was
this amazing thing that id Software had an FTP server that they would
occasionally drop zip files of source code onto, 1).

IIRC, by middle school, things were up to around 700MHz, and between 1
and 2 GHz in high-school, then clock-speed improvements mostly died off.

Specifics are fuzzy, all of this was decades ago now.

1: Though, potentially, nerdy elementary school kids may not have
necessarily been the intended target demographic.

>>
>
>>> What are you talking about? MIPS-I has conditional branches for:
>>>
>>> if rs=rt
>>> if rs!=rt
>>> if rs<0
>>> if rs>0
>>> if rs<=0
>>> if rs>=0
>>>
>>> [plus stuff related to coprocessors]
>>>
>>> And IIRC it's the same with MIPS-IV. If MIPS has any "bit-sliced
>>> branch instructions", they were added after MIPS left the
>>> high-performance markets.
>>>
>> The J-format branch instructions (J, JAL), rather than doing like on
>> most machines and adding/subtracting a displacement to PC, effectively
>> copied the high-order bits from PC and the low-order bits from the
>> branch instruction.
> <
> They accessed the (single) cache twice per cycle, and this was how they did
> branching without an adder in the path--the other choice was 2 delay slots.

OK.

In my case, there are several adders:
The "advance to next instruction" adder:
Adds instruction length to PC for next PC;
Is technically still modulo 4GB (can't load binary across 4GB mark).
The branch-prediction adder:
Only does low 24 bits;
Skips prediction if there would be a carry.
The AGU, which may be used via (non predicted) branches.
Adds a 33-bit scaled displacement (1).

1: Had gone back and forth between using the main AGU and another
dedicated address calculation for calculating the branch target
(secondary more-limited AGU). For now, using the main AGU.

Not sure how much of an impact the presence of logic for bounds-checked
pointers and texture-load addressing are having here, but have not blown
out timing at least (not sure the relative cost of a lot of these
"technically exist but are also nonsense" logic paths; or if Vivado
realizes the paths are nonsense/impossible and is able to prune them
away?...).

>>
>> They also used an (implausibly large) 26-bit offset.
> <
> Most of us did too !
>>
>> A 26 bit offset eats a lot more encoding space than a 20 (or even 24)
>> bit displacement would have done.
> <
> Unconditional branches (and subroutine calls) don't need Rd or Rs1.
> We saw this as FREE and of great value. Most of us also shifted it
> up by 2-bits as we were word addresses and this gave us more
> distance.

I went with 20-bit relative branches here.

Granted, many of my immediate and displacement fields are a bit smaller
than on most RISCs.

>>
>>
>> Though, OTOH, this does avoid the need for carry propagation; so makes
>> sense in a cost-minimization sense.
>>
>> Does mean though that programs would need to be loaded to addresses
>> where ".text" does not cross a 64MB boundary.
> <
> No, but it does create a few boundary issues the linker needs to solve.

Not sure.

If one loads a binary to an address where it crosses a 64MB boundary, it
seems like this would foobar the branches pretty bad.

Well, also one would likely need PE/COFF style base relocs or similar to
be able to deal with the bit-slicing (PIC / PIE wouldn't really work
with this).

Then started wondering about how the original (non-PE) COFF would have
dealt with some issues (like how to support base-relocatable images),
and then realized that this is quite possibly why both PE/COFF and ELF
came into existence (eg, two separate paths to address some deficiencies
in the original format for COFF executables).

Well, and my "PEL" format, dropping the MZ header, but keeping most
other PE/COFF structures, and adding an LZ stage, isn't exactly a return
to the original COFF format either...

Stefan Monnier

unread,

Jun 14, 2022, 4:50:31 PM6/14/22

to

>>> Earlier versions only doing 32-bit scalar integer operations;
>> It was a 32-bit machine, and a perfect match to C of that era.

> Though, from my current POV, lacking native 64-bit integers and SIMD and
> similar seemed like a drawback, but none of its competitors at the time had
> these either.

Also it's rather odd to blame MIPS for being limited to 32bit when it's
the first ISA to ever offer 64bit support (ignoring earlier
supercomputers for a moment).

>> As Dirk Meyers would say:: Frequency is your friend.

> Some use-cases would likely fit well with a faster but narrower 32-bit
> machine, than with a slower but wider 64-bit machine.

I highly doubt that constraining an ISA to 32bit would provide much (if
any) benefit in terms of speed: the R4k (with 64bit support) was no
slower than 32bit CPUs of that era, and similarly adding 64bit support
to the Opteron did not prevent it from being just as fast as 32bit-only
implementations of the x86 ISA of that era.

Stefan

MitchAlsup

unread,

Jun 14, 2022, 4:54:12 PM6/14/22

to

On Tuesday, June 14, 2022 at 3:27:53 PM UTC-5, BGB wrote:
> On 6/14/2022 12:52 PM, MitchAlsup wrote:
> > On Tuesday, June 14, 2022 at 12:29:41 PM UTC-5, BGB wrote:

> > It was a 32-bit machine, and a perfect match to C of that era.
> Possibly.
>
> Though, from my current POV, lacking native 64-bit integers and SIMD and
> similar seemed like a drawback, but none of its competitors at the time
> had these either.
<

SIMD did not arrive until MMX--and then it took 15 years to realize how big
a hole we had shot in our foot.

> >> Based on the ISA design, I am guessing MIPS was primarily chasing after
> >> high MHz, rather than trying to get lots of work done per clock-cycle.
> > <
> > As Dirk Meyers would say:: Frequency is your friend.
> > <
> Possibly.
<

This was before we hit the power wall with P5 and K9.....
<
What always irritated me was that the very same people who said::
"we will never buy a CPU over 100 Watts" were happy to install a
GPU card that burned 300 Watts.

>
> Some use-cases would likely fit well with a faster but narrower 32-bit
> machine, than with a slower but wider 64-bit machine.
<

I decided to be "uninterested" in that end of the spectrum, or machines
without MMUs, and machines that did not support Hypervisors.

<

> > Pentium Pro was the first x86 that anyone would considered fast.
> Yeah, dunno.
>
>
> My experience, IIRC:
> 486 DX2-66: Quake was a slideshow, but Doom ran well.
> P1, 133: Quake ran pretty OK (IIRC).
> P2, 266: Quake (and Quake 2) ran very well.
<

Quake ran well on my 200 MHz Pentium Pro even at full screen resolution.

>
> Initially, my parents had a 486, whereas I (personally) had an 8088 that
> I got second-hand from my aunt (who at the time had gotten a Pentium).
>
> The 8088 was significantly limited in what sorts of software it could
> run (pretty much nothing at the time would run on an 8088; so it was
> mostly limited to basic MS-DOS programs and QBasic).
>
>
> Not that long afterwards, I had a 486, and my parents had also gotten a
> Pentium. So, me (still in elementary school), could then run Windows
> 3.11 and Doom and similar (but, most of the newer PCs at the time were
> running Win9x and similar).
<

I had a 33 MHz 486 with 20 MB of main memory (and 3 disks drives)
that significantly outran my wife's Pentium 66 MHz with 16MB main
memory and 1 disk.
<
My 486 had enough main memory I turned the swapper off (and lost
95% of my "blue screen of death"s

>

> >
> >>> What are you talking about? MIPS-I has conditional branches for:
> >>>
> >>> if rs=rt
> >>> if rs!=rt
> >>> if rs<0
> >>> if rs>0
> >>> if rs<=0
> >>> if rs>=0
> >>>
> >>> [plus stuff related to coprocessors]
> >>>
> >>> And IIRC it's the same with MIPS-IV. If MIPS has any "bit-sliced
> >>> branch instructions", they were added after MIPS left the
> >>> high-performance markets.
> >>>
> >> The J-format branch instructions (J, JAL), rather than doing like on
> >> most machines and adding/subtracting a displacement to PC, effectively
> >> copied the high-order bits from PC and the low-order bits from the
> >> branch instruction.
> > <
> > They accessed the (single) cache twice per cycle, and this was how they did
> > branching without an adder in the path--the other choice was 2 delay slots.
> OK.
>
> In my case, there are several adders:
> The "advance to next instruction" adder:
> Adds instruction length to PC for next PC;
<

I have found that once you have an instruction buffer, that this adder can be
done in shift register form (unary)

<
> Is technically still modulo 4GB (can't load binary across 4GB mark).
> The branch-prediction adder:
<

I have a branch adder, but no branch prediction adder--can you explain?

If the subroutine was smaller than 64K
Take the subroutine that crossed the 64K boundary and relocate it higher in the
address space so it does not cross a 64K boundary.
If the subroutine was larger than 64K
take each branch and make it double, the first conditional branch uses the
inverted OpCode, the second uses the 26-bit direct form.
<
Side note: We had the M88K compiler produce the 2 branch version
and had the linker optimize the 99.5% back into single branches once
the offset was known. Shrinking the code was linear in time, while
expanding the code was at least cubic in time.

Stephen Fuld

unread,

Jun 14, 2022, 4:54:16 PM6/14/22

to

On 6/14/2022 1:26 PM, BGB wrote:
> On 6/14/2022 12:52 PM, MitchAlsup wrote:
>> On Tuesday, June 14, 2022 at 12:29:41 PM UTC-5, BGB wrote:

snip

>>> Earlier versions only doing 32-bit scalar integer operations;
>> <
>> It was a 32-bit machine, and a perfect match to C of that era.
>
> Possibly.
>
> Though, from my current POV, lacking native 64-bit integers and SIMD and
> similar seemed like a drawback, but none of its competitors at the time
> had these either.

You are being very unfair to the designers of the day. That's like
saying we could have won World War 1 more quickly if we had just
employed jet planes. :-(

You couldn't have fit a 64 bit CPU on a single chip of the day, and SIMD
wasn't even in anyone's dreams.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

Jun 14, 2022, 5:38:18 PM6/14/22

to

OK.

>> I was thinking more around the early 1990s.
>>
>> Though, information on the "what happened when" aspects of MIPS are a
>> little fuzzy.
>>
>> Though, for example, they used the R3000 in the PlayStation, sorta
>> implying that 1994 was ~ "MIPS I" era or similar, N64 had an R4000 based
>> CPU so 1997 was ~ "MIPS III" era, ...
>>
>
> Looking at consoles gets you very skewed picture of MIPS eras.
> As far as MIPS or SGI were concerned R3000 era was 1988-1991.
> R4000/4400 era - 1992 to 1995.
> By 1996 a low end was R5000, high end was R10000.
> And that if we fully ignore R8000, but may be we should.
>

Most information I could find about MIPS was in relation to games consoles.

I lack much first-hand information or experience, as I wasn't really
dealing with this stuff at the time.

>> Based on the ISA design, I am guessing MIPS was primarily chasing after
>> high MHz, rather than trying to get lots of work done per clock-cycle.
>
> Only in R4000, not before or after.
>

OK.

>>>> Then again, I guess early x86 and M68K machines were not particularly
>>>> high performance either.
>>>
>>> Early MIPS CPUs were high-performance for their time.
>>>
>> Probably true, it would appear that (prior to the Pentium) x86 and
>> friends weren't exactly all that high-performance either.
>>
>>
>> Though, admittedly, earlier on in my project, I had underestimated the
>> performance of the Pentium, seemingly it is hard to get up to these
>> performance levels on an FPGA (at least for general purpose code).
>>
>
> Your inability to achieves decent clock rates in FPGAs is not a fault of FPGAs.
>

Not many other FPGA soft-core processors do all that much better either.
SWeRV runs at slower speeds than the BJX2 core;
...

MicroBlaze runs faster (can generally run at 100MHz), but it is also a
32-bit scalar machine.

I can also get similar clock speeds to MicroBlaze if operating under
similar design constraints.

It can run Doom well, but not all that much beyond Doom.
Like, it isn't really a "make Quake run fast on FPGA" solution either.

Nor is SWeRV (even if SWeRV seemingly curb-stomps the BJX2 core in terms
of Dhrystone score).

Like, it rocks out on Dhrystone, but Doom is still kinda meh, and Quake
is still mostly unplayable.

Not entirely sure what is going on here.

But, I guess the general sentiment is that it is hard to match the
performance of a Pentium when running at 1/3 the clock speed of a
typical Pentium.

>> Though, partly this was because I was not aware that the Pentium was
>> superscalar, and effectively thought it worked essentially more like a
>> higher-clocked 486.
>>
>
> Pentium superscalarity was moderately important for performance, but
> split I/D caches and much faster and wider external bus were likely bigger
> factors.
> Also, in some niches Pentiums 10x or so faster FPU also played a role.
> Although back with compilers of those days and small caches of P5/P54
> it was very rare when this FPU was utilized by more than, say, 30%.
>

OK.

I also have split I/D caches (direct-mapped), but a lot of clock cycles
are spent on cache misses (particularly effected by L2 miss latency).

RAM in this case is a 16-bit wide DDR2 module that I am running at 50MHz
(DLL disabled), generally transferring data in 64-byte bursts.

Could in theory make the RAM faster, but would likely need to give in
and use Vivado MIG or similar.

Though, it looks like the gains from being able to run the RAM at faster
clock-speeds would be offset by the RAM now needing bigger latency
settings for CAS and RAS and similar.

CAS is inescapable, but could sidestep some of the other latencies by
noting that the cycle-times are in many cases enough to cover the RAM's
latency requirements.

If I run the partial simulation with L2 modeling disabled (effectively,
it behaves like the L2 cache always hits), there is a pretty significant
increase in performance stats (well, along with a jump in terms of
"memcpy()" speed from ~ 24MB/s to ~ 90MB/s).

Though, sadly, this is one of those "not going to happen on a real FPGA
cases".

I guess I can fire up GLQuake and see what kinds of framerates I could
get (Doom typically runs at the 32 fps limiter in this scenario).

With any luck, maybe GLQuake will start up a little faster (takes an
annoyingly long time at what is effectively 0.05 MHz in wall-clock time,
*2...).

*2: In this case, I am using Verilator for most of the simulations
(seems to be one of the better options AFAICT).

Though, this case does at least partially reflect what I could hope for
on an XC7A200T, as then I could (probably) afford 512K or similar of L2
cache, as well as the Nexys Video board having 32-bit RAM (vs 16-bit
RAM). Still not yet crossed the fence for throwing money at this though
(still mostly targeting the Nexys A7).

Stefan Monnier

unread,

Jun 14, 2022, 5:57:17 PM6/14/22

to

> Most information I could find about MIPS was in relation to games consoles.
> I lack much first-hand information or experience, as I wasn't really dealing
> with this stuff at the time.

I recently found a kind of esoteric website which happens to have a fair
bit of info about those MIPS processors tho it's not the main focus of
the site. It's called Wikipedia :-)

Stefan

Thomas Koenig

unread,

Jun 14, 2022, 6:01:44 PM6/14/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Concerning Eagle, that was the first MV machine, and according to
><https://en.wikipedia.org/wiki/Data_General#MV_series>:
>
>|The MV systems generated an almost miraculous turnaround for Data
>|General. Through the early 1980s sales picked up, and by 1984 the
>|company had over a billion dollars in annual sales.

The first MV had double clock rate on their microcode than
externally. Go figure...

> But of course a few years later microcomputers ate their lunch.
>
> Would DG have fared better by betting on a different architecture
> rather than 88k? I doubt it. Would they have fared better by rolling
> their own RISC? Most likely not.

Something for comp.history.what-if:

DG draws the right conclusions from the NOVA and starts a RISC
design instead of what thy did, in 1975. They also get graph
coloring for register allocation, and release the Galaxy as a
pipelined RISC machine in, let's say 1977 or 1978, which outperforms
the VAX 11/780 by a rather wide margin.

Given this assumption, what would be the resulting changes? Earlier
RISC microprocessors by the likes of Motorola and Intel? The minis
would still go down, attacked by the killer micros, but maybe some
of them could supply the emerging workstation market with compatible
systems like the VaxStation.

BGB

unread,

Jun 14, 2022, 6:08:30 PM6/14/22

to

Idle thought:
What if someone made an ISA where the whole ISA functioned in roughly
the same way as x87 ?...

Well, I have my own thoughts about how well this would work out, but I
have decided not to go into it in too much detail.

BGB

unread,

Jun 14, 2022, 6:13:16 PM6/14/22

to

On 6/14/2022 3:50 PM, Stefan Monnier wrote:
>>>> Earlier versions only doing 32-bit scalar integer operations;
>>> It was a 32-bit machine, and a perfect match to C of that era.
>> Though, from my current POV, lacking native 64-bit integers and SIMD and
>> similar seemed like a drawback, but none of its competitors at the time had
>> these either.
>
> Also it's rather odd to blame MIPS for being limited to 32bit when it's
> the first ISA to ever offer 64bit support (ignoring earlier
> supercomputers for a moment).
>

Granted, I guess it worked pretty well at the time.

As noted, it was still a lot better in these areas than either 32-bit
x86 or M68K.

I am left to wonder how well DEC Alpha would have done had it survived,
seems like a superficially similar design to that of a 64-bit MIPS
variant...

Well, it is that or have them face off against 64-bit RISC-V.

Though, it is possible all of these would have done similar if
implemented with similar technology.

>>> As Dirk Meyers would say:: Frequency is your friend.
>> Some use-cases would likely fit well with a faster but narrower 32-bit
>> machine, than with a slower but wider 64-bit machine.
>
> I highly doubt that constraining an ISA to 32bit would provide much (if
> any) benefit in terms of speed: the R4k (with 64bit support) was no
> slower than 32bit CPUs of that era, and similarly adding 64bit support
> to the Opteron did not prevent it from being just as fast as 32bit-only
> implementations of the x86 ISA of that era.
>

On an FPGA at least, I can get 32-bit designs up to around 100 MHz
without too much effort, whereas with 64-bit designs it is a challenge
keeping it above 75 MHz.

This is part of how I ended up mostly stuck at 50 MHz (that and my
slower-but-wider 50 MHz design seemed to beat out my "narrower but
faster" designs in terms of usable performance).

Other issues though are related to pipeline width and L1 cache size and
similar.

One can make timing easier by using 4K L1 caches or similar, but then
one takes a big hit from the cache misses (if compared with a 16K or 32K
L1 caches).

Though, L1 caches seem to run into diminishing returns (making them
bigger than this doesn't really help anymore).

But, the L2 cache still seems to be in "the more, the better" territory
(at present, mostly can't make L2 bigger because it has already
basically eaten all the Block-RAM).

However, making caches smaller seems to increase how often the core
needs to access external RAM, which tended to hurt more than the gains
from the faster clock speed (if the RAM is stuck at being the same speed).

Most of my attempts at accessing the RAM faster have ended in failure.

Stefan Monnier

unread,

Jun 14, 2022, 6:42:23 PM6/14/22

to

> On an FPGA at least, I can get 32-bit designs up to around 100 MHz without
> too much effort, whereas with 64-bit designs it is a challenge keeping it
> above 75 MHz.

IIUC the FPGA softcore world is very different, both in terms of the
hardware constraints it imposes and in the applications it targets.

But for "real chips", AFAICT 64bit-vs-32bit ISA doesn't make much
difference (both in terms of target frequency and in terms of real
estate) unless your CPU is very small.

Stefan

MitchAlsup

unread,

Jun 14, 2022, 8:32:55 PM6/14/22

to

In fact, RISC was developed/invented PRECISELY because we could fit
a 32-bit pipeline in a single chip, but not the microcode and sequencer
of conventional CISC architectures.

Ivan Godard

unread,

Jun 14, 2022, 8:50:18 PM6/14/22

to

Like the DG Nova before it, early RISC was a triumph of packaging, not
of architecture. There doesn't seem to be any good reason for current
RISC as it's no longer package limited.

Stefan Monnier

unread,

Jun 14, 2022, 8:55:05 PM6/14/22

to

Ivan Godard [2022-06-14 17:50:15] wrote:
> Like the DG Nova before it, early RISC was a triumph of packaging, not of
> architecture. There doesn't seem to be any good reason for current RISC as
> it's no longer package limited.

While the design of RISC ISAs was not done for current hardware
constraints, it seems noone has been able to come up with a design
that's convincingly better adapted yet :-(

IOW while there's no good reason for current RISC, there's no good
reason against it either :-)

Stefan

Ivan Godard

unread,

Jun 14, 2022, 9:08:38 PM6/14/22

to

Agreed. By and large, there's no significant architectural advantage to
any contemporary chip. The OOO engines inside are pretty much equivalent
within their respective power and price buckets. You can write
micro-benchmarks, and the rare application, where the difference between
architectures is measurable, but not for day to day code.

MitchAlsup

unread,

Jun 15, 2022, 1:44:14 PM6/15/22

to

The HW OoO engines are essentially instruction -> microOp recompilers.
If you can afford the area and power, these recompilers basically make
the data path independent of the ISA and essentially similar to each other.
<
On the other hand, a good ISA makes the first several implementations
easier--especially in the SW department (compilers,...)
<
So, over in the applications arena, ISAs become pretty much of a wash.
However, over in the systems arena, having an architecture that can do
context switches (application thread to application thread) rapidly
should bring advantages. Here, My architecture can perform a context
switch from any thread running under any Guest OS to any other thread
running under any other Guest OS in about 10 cycles. Literature tends
to indicate a context switch staying within a Guest OS takes on the order
of 1,000 cycles and context switch between Guest OSs takes 10,000 cycles.
Just imagine both of those numbers dropping to 10 cycles.
>

Thomas Koenig

unread,

Jun 15, 2022, 1:54:44 PM6/15/22

to

MitchAlsup <Mitch...@aol.com> schrieb:

> In fact, RISC was developed/invented PRECISELY because we could fit
> a 32-bit pipeline in a single chip, but not the microcode and sequencer
> of conventional CISC architectures.

This may be true of Berkeley RISC (if you were referring to that
specifically), but the 801 was built "out of vendor logic using
only out-of-catalogue dual-in-line packaged chips on a wire-wrapped
vendor board" according to "The 801 Minicomputer Project".

Ivan Godard

unread,

Jun 15, 2022, 2:15:27 PM6/15/22

to

While we're bragging: switch is two full cache line fetch times on a Mill

MitchAlsup

unread,

Jun 15, 2022, 4:13:34 PM6/15/22

to

How do you reload scratch from thread to thread ?

Ivan Godard

unread,

Jun 15, 2022, 4:38:09 PM6/15/22

to

Lazily, just as with calls and returns.

Stefan Monnier

unread,

Jun 15, 2022, 4:55:31 PM6/15/22

to

>> So, over in the applications arena, ISAs become pretty much of a wash.
>> However, over in the systems arena, having an architecture that can do
>> context switches (application thread to application thread) rapidly
>> should bring advantages.

[...]

> While we're bragging: switch is two full cache line fetch times on a Mill

Indeed ISAs can still make significant impacts on system software
depending on their model of memory protection, context switch, supervisor
modes, ... AFAIK these are largely orthogonal to issues like
RISC-vs-CISC-vs-VLIW.

But talking about context switch times, while I'm definitely rooting for
you guys (both the My66k and the Mill) I'm wondering if it will make much
difference: AFAIK fast context switches, just like lightweight threads,
don't make much of a difference because no matter how fast you can make
them, no context switch is always much faster. Even with a 10-cycle
context switch time, the new thread will have to work with a branch
predictor that's trained on another process and a cache that was
filled with another process's data.

So optimization engineers will still want to minimize context switches.
Will 10 cycles vs 1000 cycles still make a difference after
they're done?

I do think your systems can bring real benefits in the area of switching
between protection domains, but I'm less convinced about the impact of
context switch times between user processes.

Stefan

BGB

unread,

Jun 15, 2022, 5:37:23 PM6/15/22

to

I suspect on FPGA, timing is fairly sensitive to "distance", and more
complex or wider logic pushes interconnected logic further apart,
causing timing to be harder.

Likewise, one needs to add extra forwarding between signals in unrelated
subsystems.

There is also a big "void" in the FPGA, and it looks like any signals
which cross the void have a fairly big penalty (it looks like Vivado is
aware of this, in terms of logic layout, the void almost casts a
"shadow" over the logic-blocks and Block-RAMs and similar on the other
side).

...

Cache size is a big factor as well, for example, I had noted that in my
GLQuake port, there seems to be a relatively high L1 I$ miss rate at
16K, whereas if I increase the size of the L1 I$ to 32K, timing isn't
particularly happy about it (pushes "worst negative slack" from around
0.375ns to around 0.054ns; which is basically a warning for "you are
about to start getting intermittent 'Timing Failed' messages").

Where, for 50 MHz, WNS:
< 0.1: At risk of timing failing
0.2 .. 0.9: Mostly good.
1.0 .. 4.9: Can add more stuff (so long as resources are available).
> 5.0: Could probably bump up the clock speed.

So, in this case: L1s: 32+32K; L2: 256K (on an XC7A100T).

If I had an XC7A200T, it looks like I could do: 64K+64K with 512K of L2
(since it apparently has a much higher density of Block-RAMs, in
addition to a lot more of it), (and/or stick with 32K L1's but bump the
CPU core up to 75MHz; if going from the "-1" to "-2" speed grade).

It looks like 75MHz would also be enough to mostly push my GLQuake port
over the 10fps mark.

Main issue being mostly the cost of the FPGA board.

(Well, and anyone else who wanted to run this would need to drop upwards
of $500 on the FPGA board... Vs a $270 board or similar...).

...

MitchAlsup

unread,

Jun 15, 2022, 6:22:32 PM6/15/22

to

See, that 10 cycles contains all context of the thread--including all 32 registers
(4 cache lines).
<
What happens when mill needs a scratch register that has not yet arrived?

MitchAlsup

unread,

Jun 15, 2022, 6:25:01 PM6/15/22

to

On Wednesday, June 15, 2022 at 3:55:31 PM UTC-5, Stefan Monnier wrote:
> >> So, over in the applications arena, ISAs become pretty much of a wash.
> >> However, over in the systems arena, having an architecture that can do
> >> context switches (application thread to application thread) rapidly
> >> should bring advantages.
> [...]
> > While we're bragging: switch is two full cache line fetch times on a Mill
> Indeed ISAs can still make significant impacts on system software
> depending on their model of memory protection, context switch, supervisor
> modes, ... AFAIK these are largely orthogonal to issues like
> RISC-vs-CISC-vs-VLIW.
>
> But talking about context switch times, while I'm definitely rooting for
> you guys (both the My66k and the Mill) I'm wondering if it will make much
> difference: AFAIK fast context switches, just like lightweight threads,
> don't make much of a difference because no matter how fast you can make
> them, no context switch is always much faster. Even with a 10-cycle
> context switch time, the new thread will have to work with a branch
> predictor that's trained on another process and a cache that was
> filled with another process's data.
<

Yes, all the caches are effectively stale when a thread receives control.
About the best one can do is to have ASIDs to allow the caches to retain
as much shared state as is present.

BGB

unread,

Jun 15, 2022, 7:24:52 PM6/15/22

to

On 6/15/2022 5:24 PM, MitchAlsup wrote:
> On Wednesday, June 15, 2022 at 3:55:31 PM UTC-5, Stefan Monnier wrote:
>>>> So, over in the applications arena, ISAs become pretty much of a wash.
>>>> However, over in the systems arena, having an architecture that can do
>>>> context switches (application thread to application thread) rapidly
>>>> should bring advantages.
>> [...]
>>> While we're bragging: switch is two full cache line fetch times on a Mill
>> Indeed ISAs can still make significant impacts on system software
>> depending on their model of memory protection, context switch, supervisor
>> modes, ... AFAIK these are largely orthogonal to issues like
>> RISC-vs-CISC-vs-VLIW.
>>
>> But talking about context switch times, while I'm definitely rooting for
>> you guys (both the My66k and the Mill) I'm wondering if it will make much
>> difference: AFAIK fast context switches, just like lightweight threads,
>> don't make much of a difference because no matter how fast you can make
>> them, no context switch is always much faster. Even with a 10-cycle
>> context switch time, the new thread will have to work with a branch
>> predictor that's trained on another process and a cache that was
>> filled with another process's data.
> <
> Yes, all the caches are effectively stale when a thread receives control.
> About the best one can do is to have ASIDs to allow the caches to retain
> as much shared state as is present.

Yep, ASIDs or a single big (shared) virtual address space to avoid a
bunch of TLB misses...

I was left thinking about possible ways to do register-banking in BJX2,
but can't come up with much which would be:
Cost effective;
Sufficiently faster than RAM based save/restore to be worthwhile.

An intermediate option would be to add an "extra special SRAM"
designated specifically for the ISR to use for register save/restore via
special instructions, but I don't expect it would save enough
clock-cycles to make it worthwhile.

A limiting factor is that with the current pipeline, can't really hope
to move much more than 128 bits per clock cycle (so best-case is around
68-80 clock cycles, if saving/restoring all of the GPRs).

Would likely save some number of clock-cycles in terms of cache misses
though (since the interrupt-stack tends to not be in-cache). I had
previously been using an SRAM backed stack for this, but sadly this
stack wasn't big enough to deal with the pagefile (and I needed to
switch to a larger but slower DRAM-backed stack; either that or make the
SRAM region bigger and hope the ISR doesn't still overflow the stack).

Well, or another option being for the ISR to use absolute-addressing to
manually save registers into the SRAM region (while mostly sticking with
using DRAM for the actual stack). Essentially like a poor-man's version
of the "dedicated SRAM" approach (doesn't require adding any new
instructions).

Or, some combination of the above.

In any case though, for switching between threads or processes, the
storm of cache misses is likely to be unavoidable regardless of how
quickly one can save/restore the registers.

>>
>> So optimization engineers will still want to minimize context switches.
>> Will 10 cycles vs 1000 cycles still make a difference after
>> they're done?
>>
>> I do think your systems can bring real benefits in the area of switching
>> between protection domains, but I'm less convinced about the impact of
>> context switch times between user processes.
>>

Generally agreed.

Not sure if anyone thinks my ISA has any hope here...

Then again, I guess it doesn't exactly give a whole lot for people to
feel inspired about, mostly just my seemingly stalled (or at least
slow-moving) efforts to make its performance not suck...

Though, it does look like, at present, in theory if I could get up to 75
MHz while keeping caches the same size, this is mostly enough to push
GLQuake above 10 fps.

If I need to drop caches to 4K L1 though (as on my current FPGA board),
it is basically no-go (75MHz + 4K L1 caches performing worse than 50MHz
+ 16K or 32K L1 caches).

Most of my more recent performance gains have come more from
software-side optimizations than HW or ISA level stuff though (not a
whole lot of "low hanging fruit" at the ISA level).

Most significant gains recently were from adding code to try to cull
away geometry that is hidden behind other geometry (and then dealing
with a bunch of graphical glitches resulting from this).

Sadly, can't use a ray-casting approach for this, because using
ray-casting for per-polygon occlusion checks is slower than "just draw
everything".

Instead, one needs to draw front-to-back and check projected triangle
vertices against the Z buffer to detect "likely occluded" primitives.

That, and also don't really have enough "CPU go power" to sort all of
the particle effects into back-to-front order, so as-is overlapping
particles are prone to clipping each-other due to draw-order issues.
( Like, on a CPU at more modern clock speeds, someone could just sorta
throw QuickSort or similar at the problem... )

...

MitchAlsup

unread,

Jun 15, 2022, 8:10:52 PM6/15/22

to

Hint: this is an indication that you have too many registers.

>
> Would likely save some number of clock-cycles in terms of cache misses
> though (since the interrupt-stack tends to not be in-cache). I had
<

My 66000 HW dumps registers into memory where the OS has assigned them,
no interrupt/exception stack needed (Of course each OS thread needs its own
stack which is in a different address space than the thread being interrupted.)
<
Also note: All 32 registers with all control registers are passed in a single
ATOMIC bus transfer--so all interested 3rd parties see the state before or
after and nothing in between.

<
> previously been using an SRAM backed stack for this, but sadly this
> stack wasn't big enough to deal with the pagefile (and I needed to
> switch to a larger but slower DRAM-backed stack; either that or make the
> SRAM region bigger and hope the ISR doesn't still overflow the stack).
>

Hint: figure out how to get rid of the stack.......

>
> Well, or another option being for the ISR to use absolute-addressing to
> manually save registers into the SRAM region (while mostly sticking with
> using DRAM for the actual stack). Essentially like a poor-man's version
> of the "dedicated SRAM" approach (doesn't require adding any new
> instructions).
<

You remain under the illusion that instructions are responsible for storing
and restoring those registers.

>
> Or, some combination of the above.
>

Something entirely different.

>
> In any case though, for switching between threads or processes, the
> storm of cache misses is likely to be unavoidable regardless of how
> quickly one can save/restore the registers.
<

Yes, completely unavoidable.

Ivan Godard

unread,

Jun 15, 2022, 11:39:06 PM6/15/22

to

Model dependent, just as for you - that 10 cycles depends on the
available pathways in Mill just as it does in My66. As in both cases the
movement is behind the scenes (which gets decode out of the works) and
the choices in implemented pathways are largely the same for a given
market/power/performance, the only difference in timing will be due to
the number of state bits that have to be updated and when that happens.

The state bit count is pretty much determined by the number of values in
flight and the number of persistent values that are referenced so often
that it is worth keeping them in high-speed access instead of cache.
That depends on the market and the apps, which as both are GP machines
can be assumed to be similar.

My66 loads everything in front; that simplifies the hardware but
introduces a tradeoff between state size and latency that (for example)
might pinch for apps that would really like 64 regs or 128-bit regs.

Mill does things lazily, which in a static schedule machine means the
logic has to be able to stall on an incomplete restore. But for
call/return we don't have a restore of scratch - each frame gets a new
one and the only state is args/results on the belt, which is half or
less the my66 state typical, and so may be expected to be faster.

Whether it is also half the size for a visit depends on the semantics of
the visit primitive. If visit is like a call, and the visitee gets args
but has no persistent state except globals, then Mill will have a speed
advantage over My66 because there is less state. If visit is like a goto
and the visitee gets to inherit scratch and belt from the prior visit
then timing of the two should be similar.

Mill can avoid the interlock implied by lazy restore by not generating
code that would try to use a value before it was restored; a static
machine can do that. My66 waits for everything anyway (it's OOO after
all) and so doesn't need a separate interlock beyond what it does anyway.

If my66 fetch drops individual values into the dispatch as they arrive
then it is essentially lazy too; you will wait for later fetch contents
only if the code tries to use them. If the logic fetches everything in
full, into a regfile say, before starting the issue logic then you are
not laze and have a fixed overhead latency. I don't know which you do.

Mill is semi-lazy for the belt contents, depending on whether the state
to be restored had been pushed into the spiller; a fast call/return can
avoid both save and restore; it just a change to the name mapping. A
chain of nested calls or nested returns may have to get stuff from the
memory hierarchy and stall. Scratch is lazy; the spill->fill latency is
long enough that save doesn't need an interlock. Currently restore has
an interlock and can stall. On balance we expect that lazy will average
faster than my66 full restore, and be similar if my66 uses an OOO restore.

TL;DR: Mill call, return, and call-like visit should be a bit faster
than my66 given similar market/power/skill, due to lesser state. The two
should be similar for branch-like visit. Mill lazy save/restore should
be faster than some possible my66 implementations, and similar for
others. There's no obvious way that Mill would be slower in any model.

Ivan Godard

unread,

Jun 15, 2022, 11:43:54 PM6/15/22

to

Yes - that's a bug part of the reason why Mill has shared virtual. Of
course, the same problem occurs in the PLB, but a PLB entry is much
bigger than a page so there are fewer of them and so there's less
thrashing than you'd get in a conventional TLB.

BGB

unread,

Jun 16, 2022, 2:17:45 AM6/16/22

to

Choices:
8: Not enough
16: Still not enough
32: Mostly good
64: Pros/cons vs 32
128: We an Itanium now
256: Maybe stop now...

Currently have 64 GPRs, plus a certain number of CRs (LR, GBR, TBR, ...).

RISC-V had put some of the stuff I put in CRs into GPR space.

So, execution context is around 72 registers or so...

As noted before, if I were designing it now, I probably would have
changed a few things (such as migrating LR and GBR into GPR space).

Also the DLR/DHR registers were kind of a stupid idea in retrospect, but
at least they can still serve the role of being special registers for
encoding PC-rel and GBR-rel addresses, though having GBR and LR in GPR
space could have served the same role:
(LR, disp) => (PC, disp)
(GBR, disp) => (GBR, disp)
...

I might have also considered designing the interrupt mechanism a little
differently, ...

But, a lot of this falls into the category of stuff I can't change
without breaking compatibility with my existing ISA.

>>
>> Would likely save some number of clock-cycles in terms of cache misses
>> though (since the interrupt-stack tends to not be in-cache). I had
> <
> My 66000 HW dumps registers into memory where the OS has assigned them,
> no interrupt/exception stack needed (Of course each OS thread needs its own
> stack which is in a different address space than the thread being interrupted.)
> <
> Also note: All 32 registers with all control registers are passed in a single
> ATOMIC bus transfer--so all interested 3rd parties see the state before or
> after and nothing in between.
> <

OK.

I had considered possible hardware options, but couldn't think up a cost
effective hardware option which would have any real advantage over the
software options.

The "special magic SRAM" space (with a register-like interface) would
provide an option in that it wouldn't have any cache misses.

But, 1K of "magic SRAM" just so I can have faster ISR entry/exit is
uncertain.

At present, the timer ISR is around 0.8% .. 1.2% of the CPU time (for a
1kHz IRQ).

So, ~ 500 clock cycles per IRQ. This IRQ mostly checks and decrements a
counter, and when this counter reaches 0, it checks a few other things,
and then resets the counter.

Most of the time goes into the part for saving/restoring registers, and
most of this seems to be L1 misses.

But, say, I could probably get this part down from 500 cycles to around
90 or 100 via a special-purpose SRAM.

I could probably also take the option of rewriting the ISR handler
directly in ASM (and the not just naively save and restore all of the
registers "just in case").

>> previously been using an SRAM backed stack for this, but sadly this
>> stack wasn't big enough to deal with the pagefile (and I needed to
>> switch to a larger but slower DRAM-backed stack; either that or make the
>> SRAM region bigger and hope the ISR doesn't still overflow the stack).
>>
> Hint: figure out how to get rid of the stack.......

Kinda hard to do stuff without a stack.

I had before imagined ABI designs which would have replaced the
call-stack with a linked list of "continuations", but I decided not to
that direction "for technical reasons" (it would be awkward and the
performance would be kinda trash).

Then again, I guess it could be kinda funny if one could replace the
usual sorts of task-switching mechanisms with a
"call_with_current_continuation" style mechanism...

>>
>> Well, or another option being for the ISR to use absolute-addressing to
>> manually save registers into the SRAM region (while mostly sticking with
>> using DRAM for the actual stack). Essentially like a poor-man's version
>> of the "dedicated SRAM" approach (doesn't require adding any new
>> instructions).
> <
> You remain under the illusion that instructions are responsible for storing
> and restoring those registers.

Not really a good cost-effective alternative in sight.

I use instructions, because in most scenarios, these remain as the
cheapest option.

I guess one can ask if there are cheaper ways to implement interrupt
handling than "swap around a few of the registers and similar and
perform a computed branch".

There was the 8086 style interrupt mechanism (push a few things to
whatever was the current stack and then perform a far-jump to an address
loaded from the IDT), but ironically enough this is likely to be more
complicated and expensive than my current mechanism (since it requires
accessing memory).

Anton Ertl

unread,

Jun 16, 2022, 4:08:27 AM6/16/22

to

John Levine <jo...@taugh.com> writes:
>According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>>Who of the minicomputer makers
>>survived? Only HP.
>
>And them only because they had printers as a cash cow while they killed
>the mini lines and started selling PCs.

Not sure it was the printers. They seem to have diversified
successfully, selling PCs in addition to printers, calculators,
measurement equipment, workstations and servers. I was an intern at
HP in the late 80s, and I remember some people talking about someone
else, who changed from selling workstations to selling PCs (instead of
workstations), and that he made a remark about the large difference in
numbers of units sold.

>It always surprised me that none of the mini makers managed to turn their
>mini architecture into a successful chip. I suppose it's a version of the
>innovator's dilemma. DEC had pretty good single chip versions of PDP-8 and
>PDP-11 but they wouldn't sell them other than as part of a packaged system
>because margins and business plan.

That was not the case for the Nova, which we discussed here some time
ago. There were several microprocessor implementations of the Nova
during the 70s (National Semiconductor IPC-16A/520 PACE, microNOVA
mN601, Fairchild 9440), and most of them were available on the open
market, and not too expensive, either (mn601: $95 in lots of 100,
9440: $395 in lots of 100).

But the mn601 was not particularly fast (2.4-10us per instruction),
and I think all these CPUs required a 16-bit bus and 16-bit
peripherals (unlike the 8088), and were the peripherals available
cheaply?

So I think the problem was that it's not enough to build a cheap
microprocessor. You also have to provide a way to build a cheap
system from it. Otherwise you quickly land at minicomputer prices,
and are eaten by the real microcomputer competition.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Terje Mathisen

unread,

Jun 16, 2022, 7:56:22 AM6/16/22

to

MitchAlsup wrote:

> On Tuesday, June 14, 2022 at 12:29:41 PM UTC-5, BGB wrote:
>> On 6/14/2022 6:20 AM, Anton Ertl wrote:
>>> Early MIPS CPUs were high-performance for their time.
>>>
>> Probably true, it would appear that (prior to the Pentium) x86 and
>> friends weren't exactly all that high-performance either.
>>

> Pentium Pro was the first x86 that anyone would considered fast.

IMHO, high-perf x86 actually started with the Pentium:

The PPro/P6 was simply the first model that was fast when running
compiled code!

By investing a lot of programmer resources, it was in fact possible to
get really high performance out of the classic Pentium. My favorite
algorithm to show this is word count where my 60 MHz Pentium could count
characters/words/lines at an actual (measured) performance of 40 MB/s.

Similarly, as soon as we got the MMX 64-bit SIMD extension, it was
possible to do full rate DVD decoding with zero frame drops. (Zoran
SoftDVD was the first program to achieve this, I helped out a little bit
with the asm code.)

Terje

PS. Funnily enough, my 60 MHz Pentium asm code would actually run slower
on a 200 MHz PPro! The P6 needed an algorithm which avoided all the
partial register updates my Pentium code took advantage of.

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Quadibloc

unread,

Jun 16, 2022, 9:35:15 AM6/16/22

to

On Tuesday, June 14, 2022 at 2:54:16 PM UTC-6, Stephen Fuld wrote:
> On 6/14/2022 1:26 PM, BGB wrote:

> > Though, from my current POV, lacking native 64-bit integers and SIMD and
> > similar seemed like a drawback, but none of its competitors at the time
> > had these either.

> You are being very unfair to the designers of the day. That's like
> saying we could have won World War 1 more quickly if we had just
> employed jet planes. :-(

Back in the days of 8-bit processors, they _were_ able to make pocket
calculators. Those performed very sophisticated floating-point operations!

The technology back then would not have permitted anyone to put
a 360/195 on a single chip; that wasn't available until the Pentium II
came along, which was in some ways very similar architecturally.

But putting a 360/30 on a single chip? That would have been doable.

So back in the days of 8-bit processors, or very shortly after, they could
have designed a chip that would have fit on a single die, and yet had
the instruction set of a 68020 plus a 68881. It just wouldn't have been
as fast, doing everything with an 8-bit ALU.

That just wouldn't have been the most efficient use of an 8-bit ALU,
although it would have been convenient to program, and it would
have been compatible with future chips.

This wouldn't have included SIMD, and it wouldn't have included 64-bit
integers. Just 32-bit integers, and 64-bit floating-point.

John Savard

Ivan Godard

unread,

Jun 16, 2022, 9:37:20 AM6/16/22

to

On 6/16/2022 12:45 AM, Anton Ertl wrote:
> John Levine <jo...@taugh.com> writes:
>> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>>> Who of the minicomputer makers
>>> survived? Only HP.
>>
>> And them only because they had printers as a cash cow while they killed
>> the mini lines and started selling PCs.
>
> Not sure it was the printers. They seem to have diversified
> successfully, selling PCs in addition to printers, calculators,
> measurement equipment, workstations and servers. I was an intern at
> HP in the late 80s, and I remember some people talking about someone
> else, who changed from selling workstations to selling PCs (instead of
> workstations), and that he made a remark about the large difference in
> numbers of units sold.

Didn't HP also have a substantial instruments business?

Anton Ertl

unread,

Jun 16, 2022, 10:55:48 AM6/16/22

to

Ivan Godard <iv...@millcomputing.com> writes:
>On 6/16/2022 12:45 AM, Anton Ertl wrote:
>> They seem to have diversified
>> successfully, selling PCs in addition to printers, calculators,

>> MEASUREMENT EQUIPMENT, workstations and servers.
...

>Didn't HP also have a substantial instruments business?

I capitalized it now. HP's share of that market may have been
substantial, but that division was described as being a small part of
the revenue of HP in the late 1980s (IIRC ~5%, which made it the
smallest of IIRC 4 divisions).

EricP

unread,

Jun 16, 2022, 12:53:27 PM6/16/22

to

It did, and many of their computer products were originally
developed to interface with and control instruments.
HP spun off its electronic and bio-analytical test and measurement
instruments business into Agilent Technologies in 1999.
Agilent has subsequently gone through its own contortions.

MitchAlsup

unread,

Jun 16, 2022, 12:58:09 PM6/16/22

to

How does this static strategy work when you return from a page fault exception ?
<
My 66000 has everything it needs in control and data registers.
Mill has everything it needs in control registers but nothing in data registers (scratch).
How does the compiler avoid using scratch when returning to code not expecting to
receive a control transfer.

>
> Whether it is also half the size for a visit depends on the semantics of
> the visit primitive. If visit is like a call, and the visitee gets args
> but has no persistent state except globals, then Mill will have a speed
> advantage over My66 because there is less state. If visit is like a goto
> and the visitee gets to inherit scratch and belt from the prior visit
> then timing of the two should be similar.
>
> Mill can avoid the interlock implied by lazy restore by not generating
> code that would try to use a value before it was restored; a static
> machine can do that. My66 waits for everything anyway (it's OOO after
> all) and so doesn't need a separate interlock beyond what it does anyway.
<

But how do you avoid this on return from page fault ?

>
> If my66 fetch drops individual values into the dispatch as they arrive
> then it is essentially lazy too; you will wait for later fetch contents
> only if the code tries to use them. If the logic fetches everything in
> full, into a regfile say, before starting the issue logic then you are
> not laze and have a fixed overhead latency. I don't know which you do.
<

Not lazy fixed latency.

>
> Mill is semi-lazy for the belt contents, depending on whether the state
> to be restored had been pushed into the spiller; a fast call/return can
> avoid both save and restore; it just a change to the name mapping. A
> chain of nested calls or nested returns may have to get stuff from the
> memory hierarchy and stall. Scratch is lazy; the spill->fill latency is
> long enough that save doesn't need an interlock. Currently restore has
> an interlock and can stall. On balance we expect that lazy will average
> faster than my66 full restore, and be similar if my66 uses an OOO restore.
<

But these are not the context switches we were talking about--those are
calls/returns--for which My 66000 is like a typical RISC with a typical ABI.

MitchAlsup

unread,

Jun 16, 2022, 1:02:40 PM6/16/22

to

On Thursday, June 16, 2022 at 1:17:45 AM UTC-5, BGB wrote:
> On 6/15/2022 7:10 PM, MitchAlsup wrote:

> >> previously been using an SRAM backed stack for this, but sadly this
> >> stack wasn't big enough to deal with the pagefile (and I needed to
> >> switch to a larger but slower DRAM-backed stack; either that or make the
> >> SRAM region bigger and hope the ISR doesn't still overflow the stack).
> >>
> > Hint: figure out how to get rid of the stack.......
<
> Kinda hard to do stuff without a stack.
>

(You and EricP)
<
Yes the ISR HAS a stack. No the ISR does not have to store and restore
the interrupted thread's registers on that stack. Yes it has to save stuff it
wants on its stack.

>
> I had before imagined ABI designs which would have replaced the
> call-stack with a linked list of "continuations", but I decided not to
> that direction "for technical reasons" (it would be awkward and the
> performance would be kinda trash).
>
> Then again, I guess it could be kinda funny if one could replace the
> usual sorts of task-switching mechanisms with a
> "call_with_current_continuation" style mechanism...
> >>
> >> Well, or another option being for the ISR to use absolute-addressing to
> >> manually save registers into the SRAM region (while mostly sticking with
> >> using DRAM for the actual stack). Essentially like a poor-man's version
> >> of the "dedicated SRAM" approach (doesn't require adding any new
> >> instructions).
> > <
> > You remain under the illusion that instructions are responsible for storing
> > and restoring those registers.
> Not really a good cost-effective alternative in sight.
<

HW sequencer dumps and reloads registers en massé (Presto: no STs and no LDs)

MitchAlsup

unread,

Jun 16, 2022, 1:04:38 PM6/16/22

to

On Thursday, June 16, 2022 at 8:35:15 AM UTC-5, Quadibloc wrote:
> On Tuesday, June 14, 2022 at 2:54:16 PM UTC-6, Stephen Fuld wrote:
> > On 6/14/2022 1:26 PM, BGB wrote:
>
> > > Though, from my current POV, lacking native 64-bit integers and SIMD and
> > > similar seemed like a drawback, but none of its competitors at the time
> > > had these either.
>
> > You are being very unfair to the designers of the day. That's like
> > saying we could have won World War 1 more quickly if we had just
> > employed jet planes. :-(
> Back in the days of 8-bit processors, they _were_ able to make pocket
> calculators. Those performed very sophisticated floating-point operations!
<

There were some pocket calculators that were 1-bit machines.......
and these could perform exotic FP calculations.

BGB

unread,

Jun 16, 2022, 1:23:38 PM6/16/22

to

On 6/16/2022 6:56 AM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Tuesday, June 14, 2022 at 12:29:41 PM UTC-5, BGB wrote:
>>> On 6/14/2022 6:20 AM, Anton Ertl wrote:
>>>> Early MIPS CPUs were high-performance for their time.
>>>>
>>> Probably true, it would appear that (prior to the Pentium) x86 and
>>> friends weren't exactly all that high-performance either.
>>>
>> Pentium Pro was the first x86 that anyone would considered fast.
>
> IMHO, high-perf x86 actually started with the Pentium:
>
> The PPro/P6 was simply the first model that was fast when running
> compiled code!
>

Going by what stats I could find, it seemed to be doing pretty well vs
386 and 486.

As noted before, my DMIPS/MHz score only slightly beats out a 486, but
how much is due to architecture vs compiler vs ... is uncertain.

Even as weak as my compiler is though, architecturally I should have an
advantage vs the 486 in nearly every area (well, apart from branch
latency, my core apparently having a somewhat longer pipeline than the 486).

My emulator currently shows a fair number of cycles now going into
branches, but it does not model the branch predictor at present.

Mostly because this was less of an issue, in relation, back when the
core had more significant memory-access bottlenecks (eg, then L1<->L2
being the bottleneck, vs now it being L2<->DRAM).

Though, as-is it appears even a 100% L2 hit rate isn't enough to
consistently break 10fps in GLQuake at 50MHz.

Based on "simulation status LEDs", I am starting to see a bit more
stalls due to non-memory reasons. It would appear the FPU is starting to
make an its presence known.

> By investing a lot of programmer resources, it was in fact possible to
> get really high performance out of the classic Pentium. My favorite
> algorithm to show this is word count where my 60 MHz Pentium could count
> characters/words/lines at an actual (measured) performance of 40 MB/s.
>
> Similarly, as soon as we got the MMX 64-bit SIMD extension, it was
> possible to do full rate DVD decoding with zero frame drops. (Zoran
> SoftDVD was the first program to achieve this, I helped out a little bit
> with the asm code.)
>

I have my doubts as to how much I could pull off here.

Based on my own experiences, I would suspect 640x480p30 MPEG decoding
would be a bit outside of what I could likely pull off on a 50 MHz CPU core.

Was having enough challenge as-is with simpler codec designs at 320x200.

Though, in this case, one faces both the issue of making stuff fast
enough for it to be decoded effectively, as well as low enough bitrate
to not choke on the SDcard (eg: why I couldn't just play MS-CRAM video,
the needed bitrate for CRAM to look "not completely awful" was pretty high).

But, as noted, what I had the best results with was mostly:
Differential color endpoint vectors;
Skipped if no delta;
1/2 bytes for small deltas;
3+ bytes for larger deltas.
6-bit lookup-based patterns (1);
LZ post-compressor.
Experimented with both byte-oriented and Huffman encoded LZ;
Huffman only sometimes won (usually with I-Frames).

Encoded frame data could effectively be seen as a stream of command
bytes, with commands in a few categories:
Draw a block (6b pattern, or a larger block if needed);
Update color endpoints (for a subsequent block);
Skip N blocks;
Combined color endpoint and block (and/or larger blocks and runs).

1:
h_sign, v_sign, h_freq(2b), v_freq(2b)
With freq based on a pattern, eg (0..7):
00: 5555 / 2222 (S=1)
01: 6521 / 1256 (S=1)
10: 6116 / 1661 (S=1)
11: 7070 / 0707 (S=1)
Both the horizontal and vertical patterns would be combined, and then
used to generate the color-selection pattern.

The color endpoints were seen as a pair of RGB555 values, but one
doesn't always want to encode a pair of RGB555 endpoints, so one would
want a scheme to compact this into a smaller representation (such as a
delta applied to the prior endpoints, a joint-coded endpoint, ...).

In the small-case, one can spend say, 5-bits, which indicates to
add/subtract 1 to each component (3^3=27), with 28..31 as a few escape
cases.

This codec didn't really do motion compensation, mostly block skip /
non-skip (translation was defined, but this effectively requires
double-buffering the decoder, which means skipped blocks need to be
copied from one frame to another).

This works well if the background stays static, but is not particular
effective with camera movements (need to effectively re-encode the whole
frame in these cases). Though, CRAM and RPZA also have this limitation.

> Terje
>
> PS. Funnily enough, my 60 MHz Pentium asm code would actually run slower
> on a 200 MHz PPro! The P6 needed an algorithm which avoided all the
> partial register updates my Pentium code took advantage of.
>

Yep.

I made a few codecs before which managed to be particularly fast on
Phenom and Bulldozer/Piledriver, but then took a relative performance
hit on Ryzen.

Partly, this was because on the former, it was often faster to load
values as a packed integer, and then bit-shift and mask out the parts
one was interested in.

On the Ryzen, it was often faster to load discrete byte or word values
from memory, without using any shift-and-mask steps.

This did not fare super well for codecs built heavily around "read 64
bits at a time and then shift-and-mask everything".

There are also some differences for optimizing for 32-bit or 64-bit x86:
x86-32: favors using 32-bit types and "small and tight" loops;
Computed values are kept close to their point of use;
Using lookup tables rather than computing stuff works well;
...
x86-64: more favors 64-bit types and unrolling.
Computed values may be spread out a little more.

In comparison, BJX2 is sorta like x86-64 here, but more so:
It tends to favor fairly aggressive loop unrolling;
Favors distancing computing values from their point of use.

So, one calculates something, and then throws some other independent
expressions between this and the expression that uses the result, with a
fair amount of unrolling, ...

Actually, in some ways, optimizing stuff for my core is kind of
suspiciously similar to the behavior of Bulldozer and friends (though I
don't have a particularly strong explanation for why this would be).

Though, for Bulldozer, there was also sort of a rule of "avoid 'if()'
blocks by any means possible" (use a bunch of extra bit-twiddly and
arithmetic operations, but whatever you do, don't use an 'if()' ...),
which is at least slightly less of an issue with BJX2.

Comparably, my Ryzen seems to not care about branching as much as on
Bulldozer, so the 'if()' branches are slightly less "avoid at any cost".

Though both are unlike the RasPi's, which seemingly don't seem to care
all that much if or where one uses a branch (one can use 'if()'
branches, get clever with function pointers all over the place, ... and
the ARM CPU doesn't seem to care).

But, if I try to run code optimized for BJX2 on a RasPi, in terms of
significant unrolling with a "whole truckload" of variables, ..., it
tends to perform like garbage (favoring smaller/tighter loops and closer
association between related expressions, more like on 32-bit x86).

Though, a lot of this is likely explainable by 32-bit ARM not liking it
when there is more state in-flight variables than the CPU has registers
to be able to hold.

...

One other thing I can note (both my prior and current CPU in my PC):
Don't enable AVX support in MSVC, as this basically ruins performance.

It seems AVX basically performs like crap when compared with plain SSE;
but if enabled, MSVC's auto-vectorization will try to use it, and there
is no good way in MSVC to turn the auto-vectorization off.

Though, in this case I am currently using a CPU that is still using a
128-bit SIMD unit internally (Zen+).

...

EricP

unread,

Jun 16, 2022, 1:46:48 PM6/16/22

to

MitchAlsup wrote:
> On Thursday, June 16, 2022 at 1:17:45 AM UTC-5, BGB wrote:
>> On 6/15/2022 7:10 PM, MitchAlsup wrote:
>
>>>> previously been using an SRAM backed stack for this, but sadly this
>>>> stack wasn't big enough to deal with the pagefile (and I needed to
>>>> switch to a larger but slower DRAM-backed stack; either that or make the
>>>> SRAM region bigger and hope the ISR doesn't still overflow the stack).
>>>>
>>> Hint: figure out how to get rid of the stack.......
> <
>> Kinda hard to do stuff without a stack.
>>
> (You and EricP)
> <
> Yes the ISR HAS a stack. No the ISR does not have to store and restore
> the interrupted thread's registers on that stack. Yes it has to save stuff it
> wants on its stack.

I am aware. I had a passing familiarity with IBM Series-1 in late 1970's,
which had 4 banks of 16 regs, one bank for each interrupt service level.
It could switch and start servicing an interrupt in one clock,
1 us I think, which at that time was terrific.

I think there are multiple reasons to avoid that "bank switch" model
(which is not the MY66K model).
First, those register banks all require transistor resources to exist
and power to hold their state, but are used only a small % of time,
and the bank use is pretty much mutually exclusive.

In an OoO implementation, these registers might be in separate
architectural banks but that is just a renamer design issue.
It would require many more physical registers in the file,
assigned to those extra architectural registers and thus pinned down,
but unused most of the time.

Second, if you looks at the ABI for ISR's Interrupt Service Routines,
only a few global environment registers can be set up in advance.
A stack-saved ISR just needs to push and pop the ABI registers
that are defined as dynamic for a call, say R0-R5, because all
others are saved and restored by the called routine.
(x86 segment registers, gates, and tasks bugger up this simple
clean model but that's its own fault.)

So in the end those extra banks really just save about 8 LD's and ST's
which might have been handled just as well by Store & Load Multiple,
and which can also be used generally for subroutines, task switches,
exception handlers.

Stefan Monnier

unread,

Jun 16, 2022, 2:02:30 PM6/16/22

to

> Yes - that's a bug part of the reason why Mill has shared virtual. Of
> course, the same problem occurs in the PLB, but a PLB entry is much bigger
> than a page so there are fewer of them and so there's less thrashing than
> you'd get in a conventional TLB.

The Mill also has the ability to save&restore some of the branch
prediction info, which might help as well, but still all the caches are
trained for the previous threads, and it doesn't take that many cache
misses to dwarf a 1000-cycle context switch.

Stefan

EricP

unread,

Jun 16, 2022, 2:04:20 PM6/16/22

to

Thirdly, I also don't want to tie the architectural model to
a particular interrupt model, which I consider model specific.
A general purpose processor has, say, 7 external device shared priority
interrupt levels, some daisy chained, other message signaled,
whose use is configured dynamically at boot.
An embedded processor version might have 255 non-shared interrupt
priority levels whose use is fixed at system design time.
All of this should be modularized into model specific interrupt hardware.

BGB

unread,

Jun 16, 2022, 2:58:21 PM6/16/22

to

On 6/16/2022 12:02 PM, MitchAlsup wrote:
> On Thursday, June 16, 2022 at 1:17:45 AM UTC-5, BGB wrote:
>> On 6/15/2022 7:10 PM, MitchAlsup wrote:
>
>>>> previously been using an SRAM backed stack for this, but sadly this
>>>> stack wasn't big enough to deal with the pagefile (and I needed to
>>>> switch to a larger but slower DRAM-backed stack; either that or make the
>>>> SRAM region bigger and hope the ISR doesn't still overflow the stack).
>>>>
>>> Hint: figure out how to get rid of the stack.......
> <
>> Kinda hard to do stuff without a stack.
>>
> (You and EricP)
> <
> Yes the ISR HAS a stack. No the ISR does not have to store and restore
> the interrupted thread's registers on that stack. Yes it has to save stuff it
> wants on its stack.

OK.

I was debating between several options:
8K SRAM stack, accessed as "fast" memory (original approach)
Unlike DRAM, does not go through the L2 and thus can't L2 miss.
But, it can still have L1 misses though.
64K DRAM stack, doesn't overflow as easily, but slower
(current situation)
1K of "register-like" SRAM, likely fast but more expensive.

Could save registers into the 8K SRAM, just still need to free up a
register to load a base pointer to save the registers to.

Unlike x86 or x86-64, BJX2 lacks true absolute-addressed loads/stores,
which means a need to liberate a register to use as a base pointer
(usually, the stack pointer serving this role for ISR entry/exit).

Though, one possibility could be to add special purpose encodings for
absolute-addressed load/store. Though, doubtful this would be used
enough to be "worthwhile" (only likely to be useful "occasionally",
mostly for system-level code).

>>
>> I had before imagined ABI designs which would have replaced the
>> call-stack with a linked list of "continuations", but I decided not to
>> that direction "for technical reasons" (it would be awkward and the
>> performance would be kinda trash).
>>
>> Then again, I guess it could be kinda funny if one could replace the
>> usual sorts of task-switching mechanisms with a
>> "call_with_current_continuation" style mechanism...
>>>>
>>>> Well, or another option being for the ISR to use absolute-addressing to
>>>> manually save registers into the SRAM region (while mostly sticking with
>>>> using DRAM for the actual stack). Essentially like a poor-man's version
>>>> of the "dedicated SRAM" approach (doesn't require adding any new
>>>> instructions).
>>> <
>>> You remain under the illusion that instructions are responsible for storing
>>> and restoring those registers.
>> Not really a good cost-effective alternative in sight.
> <
> HW sequencer dumps and reloads registers en massé (Presto: no STs and no LDs)

Possible, this would still be limited by the 6R+3W interface.
Can read 384 bits, or write 192 bits.

There are no side-channels on the GPRs, and no 4th write port, so can't
do 256-bit transfers.

Meanwhile, 128-bit can be handled with normal instructions.

With already having 64 GPRs, banking the registers via the instruction
decoder isn't really a viable option either (resource cost would go up
sharply if further increasing the size of the GPR arrays in order to
support banked registers).

MitchAlsup

unread,

Jun 16, 2022, 4:23:00 PM6/16/22

to

You have to have a stack pointer, maybe a frame pointer, and definitely
an instruction pointer, and some kind of bits from PSW-like thing.

<
> less the my66 state typical, and so may be expected to be faster.
<

You might be surprised at how few registers get ENTERED and EXITED
statically.
< BUT
Nothing in ISA prevents an implementation from performing ENTER/EXIT
lazily. However, you are correct in that I generally prefer to build wide
interfaces (such as ¼-½ cache line cache ports; smaller on smaller
machines larger on larger machines.) And then provide register BW
such that saving and restoring of registers uses that whole width.
But nothing prevents doing this lazily.
<
When it comes to context switches it is best to dump state to its long
time natural resting spot (to avoid dumping stuff to a stack, then later
moving it to long term storage {or synthesizing an effective longjump
on the users stack}.)

>
> Whether it is also half the size for a visit depends on the semantics of
> the visit primitive. If visit is like a call, and the visitee gets args
> but has no persistent state except globals, then Mill will have a speed
> advantage over My66 because there is less state. If visit is like a goto
> and the visitee gets to inherit scratch and belt from the prior visit
> then timing of the two should be similar.
<

We are not seeing large swaths of preserved registers used in the
average <static> subroutine. But we were talking about context
switches (not call/return or SVC) but the unpredictable ones
{exception, interrupt, timer, timeout, ...} that code cannot see or
predict.

MitchAlsup

unread,

Jun 16, 2022, 4:26:04 PM6/16/22

to

On Thursday, June 16, 2022 at 1:58:21 PM UTC-5, BGB wrote:
> On 6/16/2022 12:02 PM, MitchAlsup wrote:

> > HW sequencer dumps and reloads registers en massé (Presto: no STs and no LDs)
<
> Possible, this would still be limited by the 6R+3W interface.
> Can read 384 bits, or write 192 bits.
<

6R is wide enough to store 4 registers per cycle
Make one of them RorW and you have enough for 4W {6R3W, or, 5R4W}
The 4th write is only available to the inbound memory port.

chris

unread,

Jun 16, 2022, 7:38:06 PM6/16/22

to

I always thought the TI 99xx series quite good, in that the register
sets were in main memory, with a register pointer on the cpu, making
context switches fast. At the time, that was slow, but could it be
updated to make use of cache or on cpu memory ?...

Chris

Ivan Godard

unread,

Jun 16, 2022, 7:45:15 PM6/16/22

to

Exceptions are just involuntary calls. Page traps (they are not faults
defpite the common terminology) are raised in the TLB. The TLB is after
the caches, and is asynchronous to issue (as is all the hierarchy). Like
any Mill call, the handler does not inherit the trapped state; it gets
arguments magically inserted by the hardware, and can address the global
state permitted by its turf. It can use stackf to get a stack frame if
it needs.

When it returns, like any function return, it drops results on the belt
if it has any; handlers don't. The trapee is restored, with its belt,
stack, and scratch unchanged, and continues executing unaware that
anything happened.

The behavior is identical to what happens if the handler is called by an
explicit call instruction.

> My 66000 has everything it needs in control and data registers.
> Mill has everything it needs in control registers but nothing in data registers (scratch).
> How does the compiler avoid using scratch when returning to code not expecting to
> receive a control transfer.
>>
>> Whether it is also half the size for a visit depends on the semantics of
>> the visit primitive. If visit is like a call, and the visitee gets args
>> but has no persistent state except globals, then Mill will have a speed
>> advantage over My66 because there is less state. If visit is like a goto
>> and the visitee gets to inherit scratch and belt from the prior visit
>> then timing of the two should be similar.
>>
>> Mill can avoid the interlock implied by lazy restore by not generating
>> code that would try to use a value before it was restored; a static
>> machine can do that. My66 waits for everything anyway (it's OOO after
>> all) and so doesn't need a separate interlock beyond what it does anyway.
> <
> But how do you avoid this on return from page fault ?

It's just a call. There's two places where there's a potential HWrace:
on function entry, where the scratch must be save before the hardware
can be used to hold stuff for the callee, and on function exit, where
the scratch must be restored, discarding the callee content and
replacing it with the saved caller content.

We don't need an interlock on the enter side because content can be
saved faster than new content can be generated (due to the way phases
work) on typical implementations. If that ever proves false we can give
the spiller more time by not scheduling scratch-using instructions at
the start of the function entry point.

On exit there may be a scratch-using instruction that wants a value
immediately, before the restore can complete. That's why we need an
interlock on fill ops immediately after a call returns. Not being OOO,
we don't have interlocks on each value; instead there's one interlock
for the whole of scratch. Due to phasing, we have a full pipeline cycle
before the next fill can issue, and we can do at least some restore
therein. If the code tries to issue a fill op before restore is complete
we stall.

Return from a trap or interrupt is identical.

>>
>> If my66 fetch drops individual values into the dispatch as they arrive
>> then it is essentially lazy too; you will wait for later fetch contents
>> only if the code tries to use them. If the logic fetches everything in
>> full, into a regfile say, before starting the issue logic then you are
>> not laze and have a fixed overhead latency. I don't know which you do.
> <
> Not lazy fixed latency.
>>
>> Mill is semi-lazy for the belt contents, depending on whether the state
>> to be restored had been pushed into the spiller; a fast call/return can
>> avoid both save and restore; it just a change to the name mapping. A
>> chain of nested calls or nested returns may have to get stuff from the
>> memory hierarchy and stall. Scratch is lazy; the spill->fill latency is
>> long enough that save doesn't need an interlock. Currently restore has
>> an interlock and can stall. On balance we expect that lazy will average
>> faster than my66 full restore, and be similar if my66 uses an OOO restore.
> <
> But these are not the context switches we were talking about--those are
> calls/returns--for which My 66000 is like a typical RISC with a typical ABI.

What you are calling context switch I call "visit". I think yours works
like my visit with goto-like semantics. As previously explained, my66
and Mill should have similar performance for that semantic model, given
similar amounts of state.

Ivan Godard

unread,

Jun 16, 2022, 7:57:40 PM6/16/22

to

We need a code pointer and a stack id, and a turf id if the turf
changes. The stack pointer is computed in parallel from the stack and
turf ids. The stack is self-describing; we get the WKRs from that. We
don't use PSWs.

>> less the my66 state typical, and so may be expected to be faster.
> <
> You might be surprised at how few registers get ENTERED and EXITED
> statically.

Which suggests a smaller regfile?

MitchAlsup

unread,

Jun 16, 2022, 8:52:05 PM6/16/22

to

There are maximal ENTER/EXITs, too. It is just that most are not maximal
averaging 2-5 registers. For example the call to LiverMore Loops does
a maximal ENTER. The call to DGEMM saves 4-5 registers. And ~½ of
all entry points don't save any registers !

John Levine

unread,

Jun 18, 2022, 3:09:27 PM6/18/22

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>Ivan Godard <iv...@millcomputing.com> writes:
>>On 6/16/2022 12:45 AM, Anton Ertl wrote:
>>> They seem to have diversified
>>> successfully, selling PCs in addition to printers, calculators,
>>> MEASUREMENT EQUIPMENT, workstations and servers.
>...
>>Didn't HP also have a substantial instruments business?

They did but the printers and supplies were a fantastic cash cow, even
better that it was one that was just as useful as PC peripherals as
minicomputer peripherals.

Still are, I suppose.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Michael S

unread,

Jun 18, 2022, 3:16:27 PM6/18/22

to

Much less so, I suppose.
New generations (about 40 y.o. and younger) are much less hooked into
consuming information from sheets of dead trees.

John Levine

unread,

Jun 18, 2022, 3:44:29 PM6/18/22

to

According to BGB <cr8...@gmail.com>:

>> Yes, all the caches are effectively stale when a thread receives control.
>> About the best one can do is to have ASIDs to allow the caches to retain
>> as much shared state as is present.
>
>Yep, ASIDs or a single big (shared) virtual address space to avoid a
>bunch of TLB misses...

The ROMP had a 40 bit virtual address space and a reverse map with an entry per
page frame. Each 32 bit program address was split into a 4 bit segment number
and 28 bit offset. The segment number was looked up in a 16 entry fast memory
to get a 12 bit number used as the high 12 bits of the virtual address. On
a context switch it reloaded the 16 segment entries which was fast but didn't
have to change anything else for addressing.

This worked pretty well so long as 16 segments was enough. Since it was a reverse
map a single page frame could only be at a single virtual address. They faked it
by taking faults and remapping but that was slow, potentially very slow if the two
addresses were in the same process. In that era 16 segments was plenty, now with
programs routinely using 5 or 10 shared libraries I'm not so sure.

I believe POWER uses a variation of the same scheme.

BGB

unread,

Jun 18, 2022, 4:40:13 PM6/18/22

to

Probably true. I am in the sub 40 crowd (albeit getting disturbingly
close to that age now), and for most of my life I have been mostly
reading stuff on computers...

Have seen stuff where there are, occasionally, people obsessing over
E-Ink displays as a way to mimic the look of paper with an electronic
display.

Some people obsess over E-Ink for newspapers, but this isn't really
something I find all that compelling (having never really read
newspapers in the first place).

I am less convinced by their utility for active-use displays (eg, LCDs
work much better as monitors), but can imagine a lot of other use-cases:
A possible way to do large semi-static displays more cost-effectively
than using LCDs (posters or billboards);
Could be relevant for things like active camouflage (say, being able to
hide stuff via a large dynamically-adapting tarp-like display);
Maybe dynamically re-programmable clothing or similar (say, outfits that
can dynamically change to look like something different);
Cases where an actively light-emitting display are undesirable;
...

Though, some of this would require also developing a form of E-Ink
display that can also mimic the flexibility and mechanical properties
more like those of textiles (as opposed to rigid/inflexible panels), and
hopefully not be "totally broken" if punctured or similar.

Would likely also require thin cables which function effectively as both
threads and wires (flexibility and tensile strength of thread, able to
carry electrical signals like a wire). Most obvious options being either
a way to embed a metal (such as copper or silver) in a thread, or
alternatively having a fine stranded wire reinforced with fibers with a
high tensile strength (such as aramid fibers), with an enameling which
also serves as an adhesive to keep the strands from de-laminating.

...

Or such...

Anton Ertl

unread,

Jun 18, 2022, 5:46:11 PM6/18/22

to

John Levine <jo...@taugh.com> writes:
>programs routinely using 5 or 10 shared libraries

Here's for a Firefox-esr process on Debian 11:

Number of shared libraries:

[~:130782] pmap 902|grep [.]so|awk '{print $4}'|sort -u|wc -l
142

Total number of mappings:

[~:130783] pmap 902|wc -l
1978

MitchAlsup

unread,

Jun 18, 2022, 9:32:13 PM6/18/22

to

So trapping thread is in a wait state for 16 milliseconds while other threads
"occupy" the machine--and the DISK drive sends an interrupt to announce
that the read has been performed.
<
At this point it is unlikely that any of the scratch of offending thread is
present in machine scratch storage.
<
The ISR gets control transfer, runs 1-2 dozen instructions then it wants to
invoke the SATA manager layer which wants to invoke the volume manager
layer, which wants to invoke the partition management layer, which wants
to invoke the file management layer--which finally enqueues the offending
thread so it can run (again.)
<
So we started off with non of offending thread's scratch present, and wandered
through at least 6 layers of the I/O system without touching any of the offending
thread's scratch registers to be touched.
<
A good deal of the TLB and predictor states could have evaporated in the
meantime.
<
Finally, control returns to the middle of a bundle, and the first instruction(s)
execute, and because the compiler did not predict a trap at the memory
reference, it issues an instruction which was dependent on some register in
scratch. How long does it wait until the missing value arrives ?
<
I can see many mechanisms where you know the address of the needed value
but I can't see how you could preload effectively through this stack of control
transfers.

Anne & Lynn Wheeler

unread,

Jun 19, 2022, 1:07:39 AM6/19/22

to

> I believe POWER uses a variation of the same scheme.

ROMP 16 segment registers, each with 12bit segment id (top 4bits of
32bit virtual address indexed a segment register) ... they referred to
it as 40bit addressing ... the 12bit segment id plus the 28bit address
(within segment). no address space id ... designed for CP.r ... and
would change segment register values as needed. originally ROMP had no
protection domain, inline code could change segment register values as
easily as general registers could be changed ... however that had to be
"fixed" when moving to a unix process model & programming environment.

POWER (RIOS) just doubled the segment id to 24bits ... and some
documentation would still refer to it as 24+28=52bit addressing ... even
tho program model had changed to unix with different processes and
simulating virtual address space IDs with sets of segment ids.

--
virtualization experience starting Jan1968, online at home since Mar1970

Thomas Koenig

unread,

Jun 19, 2022, 4:27:59 AM6/19/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> John Levine <jo...@taugh.com> writes:
>>programs routinely using 5 or 10 shared libraries
>
> Here's for a Firefox-esr process on Debian 11:
>
> Number of shared libraries:
>
> [~:130782] pmap 902|grep [.]so|awk '{print $4}'|sort -u|wc -l
> 142
>
> Total number of mappings:
>
> [~:130783] pmap 902|wc -l
> 1978

#include <stdrant.h>

Or, to quote something random off askubuntu.com

"The snap packaging format finally frees us from dependency hell
by allowing each package to have its own copy of dependencies."

So, we have shared libraries because they can be shared between
one single application? Of course, narrow-minded pedants might
get the idea that a shared library would not even be needed
in that case.

Marcus

unread,

Jun 19, 2022, 4:57:01 AM6/19/22

to

I think that Mac OS X has done it the same way for ages with their .app
installations.

I believe that flatpak does a decent job at reusing dependencies across
bundles. Not sure about snap.

In any event, sooner or later you're going to want to find ways to reuse
shared libraries and dependencies across bundled installations if that
is a common pattern. It could be done with clever file systems, or maybe
as part of the AOT compiler (caching and hashing previous compilations)
if that is part of your system (e.g. Rosetta 2 on ARM mac, and I also
guess Mill would gain from a similar AOT system for the specializer?).

/Marcus

Anton Ertl

unread,

Jun 19, 2022, 8:01:58 AM6/19/22

to

Marcus <m.de...@this.bitsnbites.eu> writes:
>> So, we have shared libraries because they can be shared between
>> one single application? Of course, narrow-minded pedants might
>> get the idea that a shared library would not even be needed
>> in that case.

That's a question of what other benefits there are from shared
libraries, and what the corresponding costs are. In particular,
looking at today's instance of firefox-esr on my machine:

[~:130818] pmap -p 909|grep [.]so|awk '{print $4}'|sort -u|grep /usr/lib/firefox-esr|wc -l
16

So 16 of these "shared" libraries are firefox-esr-specific, and
unlikely to be shared with something else. Still they chose not to
statically link them.

[~:130824] pmap -p 909|grep [.]so|awk '{print $4}'|sort -u|grep /usr/lib/firefox-esr|xargs size -t
text data bss dec hex filename

133919335 5224280 928200 140071815 8595387 (TOTALS)

These 16 libraries have a text size of 133MB (129MB of which are in
libxul.so). The remaining libraries linked by firefox-esr have a
total text size of 165MB. The firefox-esr binary itself has a text
size of 647KB.

And looking in
<https://packages.debian.org/search?suite=bullseye&arch=any&mode=exactfilename&searchon=contents&keywords=libxul.so>,
I find that libxul.so occurs in two packages: firefox-esr and
thunderbird, each with an instance in a different directory. So
sharing is not on the table, not even on Debian's table (where package
maintainers go to quite some lengths to share stuff.

>In any event, sooner or later you're going to want to find ways to reuse
>shared libraries and dependencies across bundled installations if that
>is a common pattern.

Are we? It seems to me that the introduction of shared libraries in
the late 1980s was driven by RAM becoming big enough for more than one
application being resident at one time, yet still small enough that
sharing provides a benefit; and libraries become big enough that a
dynamic linker amortized itself. Then we had ~20 years in which
software bloated to fill the increasing RAM (and disk) sizes, so
shared libraries still looked beneficial. Then RAM growth speed
exceeded the ways that the software industry had found until then to
waste RAM and disk, so to counter that, the software industry went
back to static linking in developments such as Go and Rust, or
introduced new ways to achieve the same effect with Docker and snaps.

So I would not be so sure that "we" really want shared library reuse.

>It could be done with clever file systems

Yes, with deduplication at the file system level, and a sufficiently
intelligent virtual memory system that realizes that if an extent is
shared between files on-disk, its read-only mappings can also be
shared, one could regain some sharing. It's not clear whether "we"
really want that.

Also, one problem with that idea is that AFAIK docker and snap stuff
is inside image files, that are mounted as extra file systems. So
deduplication is more involved than for ordinary files.

Ivan Godard

unread,

Jun 19, 2022, 10:52:03 AM6/19/22

to

This is better explained over a whiteboard and a beer!

TLB is at the far end of the caches, so few requests get that far. Note
that the valid bits in the cache means that stores do not need to do a
fetch of a line that is not in cache, so stores never get to the TLB,
only loads that miss in all cache levels.

Top level caches can be scrubbed by intervening foreign code. Generally
we expect that LLC will be big enough that a trap will not purge it, but
sizing is config dependent per family member. If an app thrashes in the
LLC, for any reason, the fix is to buy a bigger Mill, just as in my66
and any other architecture.

Mill branch predictor maintains separate tables for each context, and
fetches them on context switch. The switch and fetch is lazy, so the
switched-to code can use an obsolete prediction if it gets one before
the fetch of the entry has reached the HW table. Such obsolete misses,
and simple wrong predictions in a working table, cause a miss stall,
which is five cycles in the current implementation assuming the correct
target is in the icache. No work already done is discarded.

The belt and scratch of a caller are restored by the return instruction,
so cascaded returns must cascade restores. However, a return discards
the belt and scratch of the callee, so a return to a return can skip the
in-process restore of the inner callee. Consequently, if a caller does
enough work after it makes a call and before itself returns then the
restore will complete concurrently with execution, while if it returns
before its own restore is complete then the remaining restore is
abandoned. As a result, cascaded returns do not stall on nested restores
no matter how quickly they issue.

Belt and scratch must be replaced on a context switch to a new thread.
Like a call, the belt switch-out is eager to the spiller's internal SRAM
and lazy from there to memory, while the scratch is lazy throughought.
The switch-in is eager for the belt and lazy for scratch. The spiller is
full-path-width-at-a-time in both directions; path width is per member
configured. For a mid-range Mill, with 16 belt position of 64 bits each,
a 16-byte path width, and two memory paths, the switch-in takes two
cycles to issue.

Like my66, Mill is designed to support micro-threads; some of that
support is NYF. While the approaches are quite different, the switch
times should be similar for equal quantities of eager state; slight
advantage to Mill, which has less such state. Macro-threads will see
little gain from either, because the underlying architectural speed will
be swamped by the overhead of OS dispatch.

MitchAlsup

unread,

Jun 19, 2022, 12:07:51 PM6/19/22

to

I was using shared libraries in the 1970s, I suspect the Boroughs machines
used shared libraries as early as 1965 a lot of this was driven by lack of
sufficient memory.

<
> application being resident at one time, yet still small enough that
> sharing provides a benefit; and libraries become big enough that a
> dynamic linker amortized itself. Then we had ~20 years in which
> software bloated to fill the increasing RAM (and disk) sizes, so
> shared libraries still looked beneficial. Then RAM growth speed
> exceeded the ways that the software industry had found until then to
> waste RAM and disk, so to counter that, the software industry went
> back to static linking in developments such as Go and Rust, or
> introduced new ways to achieve the same effect with Docker and snaps.
>
> So I would not be so sure that "we" really want shared library reuse.
<

Why should not ALL invocations of [*]printf[*]() not pass through the
same code ?

<
> >It could be done with clever file systems
> Yes, with deduplication at the file system level, and a sufficiently
> intelligent virtual memory system that realizes that if an extent is
> shared between files on-disk, its read-only mappings can also be
> shared, one could regain some sharing. It's not clear whether "we"
> really want that.
<

Only because you don't have shared-ASIDs and an easy way to setup
and teardown slASIDs.

Anton Ertl

unread,

Jun 19, 2022, 1:00:32 PM6/19/22

to

MitchAlsup <Mitch...@aol.com> writes:
>On Sunday, June 19, 2022 at 7:01:58 AM UTC-5, Anton Ertl wrote:
>> Are we? It seems to me that the introduction of shared libraries in
>> the late 1980s was driven by RAM becoming big enough for more than one
><
>I was using shared libraries in the 1970s, I suspect the Boroughs machines
>used shared libraries as early as 1965 a lot of this was driven by lack of
>sufficient memory.

I was thinking about Unix; SVR3 was released in 1987, and included a
restricted form of shared libraries. It would not surprise me if
mainframes had this feature much earlier. After all, they were
time-sharing much earlier, and with less memory (but still enough to
amortize the dynamic linker).

>> Then RAM growth speed
>> exceeded the ways that the software industry had found until then to
>> waste RAM and disk, so to counter that, the software industry went
>> back to static linking in developments such as Go and Rust, or
>> introduced new ways to achieve the same effect with Docker and snaps.
>>
>> So I would not be so sure that "we" really want shared library reuse.
><
>Why should not ALL invocations of [*]printf[*]() not pass through the
>same code ?

Go does not use C's printf, does it?

Rust does not use C's printf, does it?

Why don't they share the binary code for their libraries? For Rust I
guess it's a problem of getting the type system across the dynamic
linker. For Go I guess it's just simpler to implement a static
linker.

For Docker and snaps (and also for Go, and maybe Rust): Apparently
there is little trust in interface stability of libraries, so "we"
want to link the libraries we tested with, so "we" package the
libraries together we the binary in a docker or snap container, or
statically link it with a Go or Rust binary.

>> >It could be done with clever file systems
>> Yes, with deduplication at the file system level, and a sufficiently
>> intelligent virtual memory system that realizes that if an extent is
>> shared between files on-disk, its read-only mappings can also be
>> shared, one could regain some sharing. It's not clear whether "we"
>> really want that.
><
>Only because you don't have shared-ASIDs and an easy way to setup
>and teardown slASIDs.

I don't know what shared-ASIDs and slASIDs are, and I am not aware
that there is any difficulty with ASIDs in connection with shared
libraries. What I was thinking about is that it probably can be done
(and I see the main difficulties at the file system level), but it
takes work, and it is unclear to me whether the pain of non-sharing is
enough to make people actually do it.

mac

unread,

Jun 19, 2022, 1:52:58 PM6/19/22

to

>> IIUC the FPGA softcore world is very different, both in terms of the
>> hardware constraints it imposes and in the applications it targets.

Seems like one difference is the lack of large multiport “register-file”
BRAM. Instead you’d have to overclock or replicate registers. Is that how
it’s done?

I did run into a (DEC/Intel) network processor that had registers split
across two bank, where an instruction could only read operands from
separate banks. The assembler did the work of assigning symbolic registers
to banks. 128 in each bank made this easier.

John Levine

unread,

Jun 19, 2022, 2:05:28 PM6/19/22

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>>I was using shared libraries in the 1970s, I suspect the Burroughs machines

>>used shared libraries as early as 1965 a lot of this was driven by lack of
>>sufficient memory.
>
>I was thinking about Unix; SVR3 was released in 1987, and included a
>restricted form of shared libraries. It would not surprise me if
>mainframes had this feature much earlier. After all, they were
>time-sharing much earlier, and with less memory (but still enough to
>amortize the dynamic linker).

OS 360 had reusable modules which reinitialized themselves on each call
and reenterable modules which were what they sound like. Reenterable modules
could be put into a system-wide Link Pack Area when the system started up
or a Job Pack Area when each job started. The system had a search list so
when you loaded or called a module it would look first in the JPA, then
the job's libraries, then the LPA, then the system libraries. Writing
reentrant code was kind of a pain so it was rare for application libraries
but widely used for system stuff. This was in 1965. Remember that even
though OS was mostly a batch system, MFT and MVT and their successors ran
multiple programs at a time, and MVT had tasks which we would now call
threads.

>For Docker and snaps (and also for Go, and maybe Rust): Apparently

>there is little trust in interface stability of libraries, ...

I think that's a comment on the overall flakiness of Linux. SVR4 added
the shared libraries that linux and BSD still use. Each library has a
major and minor version number with the rule being that minor version
updates don't change the API. FreeBSD uses shared libraries extsnsively
and while I have run into occasional bugs where someone forgot to update
the major version number, in general they work well and unobtrusively.

I agree that when you put stuff in jails and such it is a lot harder to
keep sharing the libraries while maintaining the isolation jails are
there to provide. Shared libraries are also a lot less compelling than
they were when a megabyte was a lot of memory.

BGB

unread,

Jun 19, 2022, 2:09:29 PM6/19/22

to

As I see it, there is probably some merit in static-linking the C
library, and maybe some other libraries, but then leaving other stuff as
shared objects.

Personally I suspect systems like Linux went a little too far in the
Shared Object direction (turning pretty much everything into shared
objects), while not going far enough in some other areas (making things
more resistant to versioning issues or API mismatches).

One issue that a lot of libraries fail at is basically the issue that
when one is using a shared object, now any exported functions or
structures effectively need to either be frozen, or there needs to be
some way to detect the mismatch and select between alternate versions of
the same library. Generally, libraries on Linux do neither.

Note that C++ style name mangling wouldn't resolve these issues either:
Not generally used by the C ABIs;
Internally-generated names could generate spurious mismatches;
Would not validate the layout of structures or classes;
...

Ideally, one would want something that is:
Independent of the symbol name;
Validates the full signature;
Does not get caught up on internally generated names or numbers;
Can also validate the structural layout of any structure arguments;
Has low storage overhead and is cheap to validate;
...

One idea here is that when the compiler compiles a shared object, for
each exported symbol, it builds an "extended" type-signature string
(inlines a signature for any passed structure types), then hashes this
string, and stores the hash with the exported symbol. When another
binary is compiled against this shared object, it copies the hash. Then
the dynamic loader verifies that the hashes match up.

This could potentially be done by adding a special suffix past the end
of the symbol name string or similar.

Say, "fooSomeFunc" becomes effectively "fooSomeFunc\0XH#392C" or similar.

>>>> It could be done with clever file systems
>>> Yes, with deduplication at the file system level, and a sufficiently
>>> intelligent virtual memory system that realizes that if an extent is
>>> shared between files on-disk, its read-only mappings can also be
>>> shared, one could regain some sharing. It's not clear whether "we"
>>> really want that.
>> <
>> Only because you don't have shared-ASIDs and an easy way to setup
>> and teardown slASIDs.
>
> I don't know what shared-ASIDs and slASIDs are, and I am not aware
> that there is any difficulty with ASIDs in connection with shared
> libraries. What I was thinking about is that it probably can be done
> (and I see the main difficulties at the file system level), but it
> takes work, and it is unclear to me whether the pain of non-sharing is
> enough to make people actually do it.
>

Don't know here, but I can imagine a few possibilities:
If the OS has a shared/global dynamic linker, it could detect trying to
load libraries it has already seen;
Alternatively, the virtual memory system could "detect" and then "merge"
read-only pages which have contents that are identical to other
read-only pages.

One idea here being that one could "hash" pages within the virtual
memory subsystem, and then have another hash-table of pages hashed by
their hash code.

If one detects two pages with matching hashes (with two read-only
pages), one can compare the pages, and then selectively merge them if
they are an exact match.

This could potentially be done when swapping pages out, where a page
that is being paged out could be checked against other "still active"
pages still, likely with the page hashes being kept around by the
virtual memory subsystem (recomputed as needed, or when the VMM suspects
they may have become "stale", cleared if the page is set into a writable
mode, ...).

Ideally, wouldn't want to recompute the hashes every time, as this would
likely make such a mechanism unacceptably slow.

Wouldn't likely add that much cost relative to something like an LZ77
page compressor or similar, and could potentially reduce the number of
pages that need to be LZ compressed. The hashing could potentially also
be overlapped with the detection of Zero-Skip pages and similar (which
also requires a full scan of the pages' contents).

...

> - anton

John Levine

unread,

Jun 19, 2022, 2:48:16 PM6/19/22

to

According to BGB <cr8...@gmail.com>:

>One issue that a lot of libraries fail at is basically the issue that
>when one is using a shared object, now any exported functions or
>structures effectively need to either be frozen, or there needs to be
>some way to detect the mismatch and select between alternate versions of
>the same library. Generally, libraries on Linux do neither.

When I use ldd on a local linux system all of the shared library links
have version numbers, which is supposed to provide a stable API.

This does assume that the declarations in *.h files or equivalent match
the code in the current version of shared libraries so the type checking
is supposed to get done when the program is linked.

I haven't used linux enough to know how often this fails, but the plan
is reasonable and as I said, it works well on FreeBSD.

Thomas Koenig

unread,

Jun 19, 2022, 3:00:27 PM6/19/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> For Docker and snaps (and also for Go, and maybe Rust): Apparently
> there is little trust in interface stability of libraries, so "we"
> want to link the libraries we tested with, so "we" package the
> libraries together we the binary in a docker or snap container, or
> statically link it with a Go or Rust binary.

That appears to be the case, unfortunately.

The problem is not the concept of a shared library. For something
basic and stable that is used in many programs at the same time
(libc being one example), it has a chance of working well.

Vitally important is discipline: While the major version of the
library is kept, new functions must be added with proper symbol
versioning, and any and all incompatible change (unless it is
an outright bug fix) should be avoided at almost all cost.

That discipline is, unfortunately, sadly lacking.

MitchAlsup

unread,

Jun 19, 2022, 5:04:44 PM6/19/22

to

On Sunday, June 19, 2022 at 12:00:32 PM UTC-5, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Sunday, June 19, 2022 at 7:01:58 AM UTC-5, Anton Ertl wrote:
> >> Are we? It seems to me that the introduction of shared libraries in
> >> the late 1980s was driven by RAM becoming big enough for more than one
> ><
> >I was using shared libraries in the 1970s, I suspect the Boroughs machines
> >used shared libraries as early as 1965 a lot of this was driven by lack of
> >sufficient memory.
> I was thinking about Unix; SVR3 was released in 1987, and included a
> restricted form of shared libraries. It would not surprise me if
> mainframes had this feature much earlier. After all, they were
> time-sharing much earlier, and with less memory (but still enough to
> amortize the dynamic linker).
> >> Then RAM growth speed
> >> exceeded the ways that the software industry had found until then to
> >> waste RAM and disk, so to counter that, the software industry went
> >> back to static linking in developments such as Go and Rust, or
> >> introduced new ways to achieve the same effect with Docker and snaps.
> >>
> >> So I would not be so sure that "we" really want shared library reuse.
> ><
> >Why should not ALL invocations of [*]printf[*]() not pass through the
> >same code ?
> Go does not use C's printf, does it?
>
> Rust does not use C's printf, does it?
<

Somewhere deep inside, they will all call ftos(),... // float to string
Why multiple versions of ftos()?

>
> Why don't they share the binary code for their libraries? For Rust I
> guess it's a problem of getting the type system across the dynamic
> linker. For Go I guess it's just simpler to implement a static
> linker.
>
> For Docker and snaps (and also for Go, and maybe Rust): Apparently
> there is little trust in interface stability of libraries, so "we"
> want to link the libraries we tested with, so "we" package the
> libraries together we the binary in a docker or snap container, or
> statically link it with a Go or Rust binary.
> >> >It could be done with clever file systems
> >> Yes, with deduplication at the file system level, and a sufficiently
> >> intelligent virtual memory system that realizes that if an extent is
> >> shared between files on-disk, its read-only mappings can also be
> >> shared, one could regain some sharing. It's not clear whether "we"
> >> really want that.
> ><
> >Only because you don't have shared-ASIDs and an easy way to setup
> >and teardown slASIDs.
> I don't know what shared-ASIDs and slASIDs are, and I am not aware
> that there is any difficulty with ASIDs in connection with shared
> libraries.
<

When you enter a shared library, you want to fetch from the slASID
so that residual TLB state is preserved. You would want to fetch
sl-data from the slASID, but fetch non-shared data vial appSAID.

BGB

unread,

Jun 19, 2022, 6:41:01 PM6/19/22

to

On 6/19/2022 1:48 PM, John Levine wrote:
> According to BGB <cr8...@gmail.com>:
>> One issue that a lot of libraries fail at is basically the issue that
>> when one is using a shared object, now any exported functions or
>> structures effectively need to either be frozen, or there needs to be
>> some way to detect the mismatch and select between alternate versions of
>> the same library. Generally, libraries on Linux do neither.
>
> When I use ldd on a local linux system all of the shared library links
> have version numbers, which is supposed to provide a stable API.
>

Generally IME mostly applies to core OS libraries, and is relatively
coarse grain (depends on people manually managing versions for OS
libraries).

Signature hashing would be more fine grain, and potentially more
automatic (but would likely need to be combined with the use of version
numbers to be useful for much more than just crashing the program on
startup if there is a mismatch).

> This does assume that the declarations in *.h files or equivalent match
> the code in the current version of shared libraries so the type checking
> is supposed to get done when the program is linked.
>

Signature checking or name mangling could help here, but would need more
support in terms of the C compiler.

> I haven't used linux enough to know how often this fails, but the plan
> is reasonable and as I said, it works well on FreeBSD.
>

Haven't had too many issues in recent years at least.