68k, where it went wrong.

jacko

unread,

Feb 16, 2017, 8:02:38 PM2/16/17

to

I've written a page on the reasons I think it all went sideways. Mainly in the use of the 11 bit pattern in the size field.

https://kring.co.uk/2017/02/where-the-68k-isa-went-wrong/

What do you think?

Walter Banks

unread,

Feb 16, 2017, 8:37:32 PM2/16/17

to

Couple things. You are judging the 68k 36/37 Years after the ISA was
developed. The 68K had its roots in the PDP-11. The 68K and 6809 ISA's
both had significant influences on later microprocessor ISA's.

Without really commenting on the comments, they are hindsight. The 68K
was an early ISA that could run on a saleable platform. I had both 68000
and 68008 development systems. They had ISA's that were easy to generate
reasonably efficient code for and the 68K could be scaled into a
minicomputer type capability.

w..

Ivan Godard

unread,

Feb 16, 2017, 9:48:57 PM2/16/17

to

I wouldn't want to be the guy to write your compiler.

jacko

unread,

Feb 17, 2017, 12:57:02 AM2/17/17

to

Yes it is hindsight. And the arguments are to keep some of the ISA for a research development. The actual bootstrap instruction set is not relevant these days, and is only made relevant because code remapping to new instruction sets is considered difficult. I mean is 68k a source? Should it be?

jacko

unread,

Feb 17, 2017, 12:58:41 AM2/17/17

to

> I wouldn't want to be the guy to write your compiler.

Sorry I missed out the BCPLR (build compiler instruction), but I can't fit everything in :D

jacko

unread,

Feb 17, 2017, 1:05:25 AM2/17/17

to

Also judgement is a bit of a harsh slant. But yes, judging. The purpose of the judgement is not to say that is was done wrong out of spite, but it was done wrong in my opinion. The logic being majorly no easy 11 in the size field for 64 bit.

It all really went RISC about the death of it, not really that CISC was wrong in an absolute sense, but wrong in the relative trend. Intel demonstrates CISC works, and even ARM these days is getting CISCy, and getting a bit 68k thumby.

Have you seen the A500 Vampire FPGA core? Aros Icaros is a very fast boot x86 OS which shows off virtualbox well.

jacko

unread,

Feb 17, 2017, 1:08:06 AM2/17/17

to

The part II link at the bottom of the page goes on to relevant to all designs caching.

Quadibloc

unread,

Feb 17, 2017, 1:39:32 AM2/17/17

to

I skimmed through it, some of it was too technical.

If the 68K skimped on floating-point and memory management - and I agree that
the decision to only make the integer part of the 68060 out-of-order was a bad
one - because, in their judgement, in the market conditions of the time, that
would make their chip more cost-effective for their likely customers, I have no
basis on which to judge them in hindsight.

Personally, the one thing that keeps me from recommending that we all toss out the x86 architecture, and go back to the more sensible, cleaner, 68000 architecture is this:

The original 68000 only let you add the contents of two registers to a
displacement in the instruction if that displacement was eight bits long. The
68020 introduced a mode with two registers and a 16-bit displacement, but
instructions like that were 48 bits long, not 32 bits long.

This was BAD because that meant that the 68k architecture was treating a
fundamental, basic, ordinary, everyday means of memory access - indexed memory
reference - as though it were something exotic and rarely-used.

Everybody knows that register-to-register instructions should take 16 bits,
that's what an RR format instruction takes. And indexed memory-reference
instructions (which means base-index addressing) should take 32 bits, that's
what an RX format instruction takes.

So the 68k architecture gets a FAIL from me because it can't do what the IBM
System/360 already _proved_ was possible.

Except that the System/360 used a 12-bit displacement, not a 16-bit
displacement.

And that's why in my example of what I thought a CISC architecture _should_ be,
which became my "everything but a kitchen sink" attempt to include every feature
any computer ever had, so I could explain how those features worked...

I started with *this* format for memory-reference instructions:

opcode: 7 bits
destination register: 3 bits
index register: 3 bits
base register: 3 bits

displacement: 16 bits

Both the 68000 and the 8086 used sets of eight registers. The 68000 had eight
arithmetic registers and eight address registers. One could specify whether an
arithmetic register or an address register would supply any of the values to be
added to a displacement.

So I accepted that I would have eight general registers and eight floating-point
registers, instead of sixteen general registers like the System/360.

And I accepted an additional limitation: the index register could only and
always be a general register; the base register could only and always be an
address register. So the address registers are reserved for containing base
addresses - but they're not dedicated to different things (i.e., code versus
data), so they're not like the segment registers of the 8086. Arithmetic
registers are used for index registers, since array subscripts are the outcome
of calculations.

That, I felt, was a sound starting point for a big-endian CISC architecture.

But instead of continuing from there soundly, of course, I went off into a
wilderness.

John Savard

Bruce Hoult

unread,

Feb 17, 2017, 2:34:23 AM2/17/17

to

I think that's amazingly hard to follow.

As for "Who is $Axxx reserved for?"

I'm sure at first Motorola though they'd use it in later CPUs. But once their most important customer started using it for system calls they didn't have that opportunity.

Nick Maclaren

unread,

Feb 17, 2017, 6:29:52 AM2/17/17

to

In article <67409560-e175-4505...@googlegroups.com>,

I'll look at it later, but the reason it failed was nothing to
do with the architecture. You can describe it in various ways,
but the most accurate is 'circumstance', though there were things
that Motorola got wrong.

The biggest technical mistake was to over-complicate the design;
even with the 68040, most code used only the older subset, and
that was already a well-known phenomenon.

Just as with the Itanic, later, the RISC propaganda succeeded
where the technology failed. 'Everybody knew' that RISCs were
so much faster than CISCs that the latter were obsolete - except
for the people who actually timed them on real uses. It
wasn't until after the 68K development had been halted that the
RISCs actually caught up. [ RISC fanatics: please don't produce
the hoary chestnuts of computational kernels, hand-tuned code
etc. to 'disprove' that. ]

What killed the 68K was all the major workstation vendors backing
either the x86 range or their own RISC design. Most customers
With Clue continued to buy 68K-based machines for as long as we
could get them - yes, the early RISCs were better in some respects
and for some uses but, for most uses, 68K beat the heck out of
all other architectures until Motorola pulled the plug. And the
reasons that the IBM PC users favoured the x86 were nothing to do
with the architecture.

Note that, while the x86 brigade claim(ed) that IBM chose that
architecture on technical grounds, they weren't even seriously
considered. At most, the cost argument favoured the x86 at the
time the decision was taken, assuming that the PC/XT would be the
last in that range.

Regards,
Nick Maclaren.

jacko

unread,

Feb 17, 2017, 7:29:28 AM2/17/17

to

Yep, the d8 thing, with a high register which should have been specified in bits 11 to 9 (as this is sort of already decoded), perhaps a 9 bit displacement as there's also some MOVEQ issue with not using bit 9. There'd be even space for a third register along with a 9 bit displacement in a d8 (d9?) literal word. Maybe bit 9 should enable the A register in bits 11 to 9, and then bit 15 to 12 are still free.

Not sure why they kind of wasted space on the XXX.W mode ... And a few of those spare addressing modes following # would make SR, CCR modes and ...

Andy Valencia

unread,

Feb 17, 2017, 9:35:00 AM2/17/17

to

n...@wheeler.UUCP (Nick Maclaren) writes:
> I'll look at it later, but the reason it failed was nothing to
> do with the architecture.

I remember following the twists and turns of the 010, 020, 030, and
then 040 (anybody else involved in trying to get SMP going on the
040?). Getting Unix running was a real hassle every single time.

Compare to 386 onward. The peripherals shuffled a little, but the
base OS ran correctly without modifications.

> What killed the 68K was all the major workstation vendors backing
> either the x86 range or their own RISC design.

We were *so* happy to kiss off Moto after dealing with all the
hair underneath the 040. PA RISC was no better (IMHO), but x86
with PCI was a comparative dream. Adequate (though register starved)
architecture, stable, and a nice I/O bus. They even got most of
SMP correct (although Sequent did their own interrupt arbitration).

Andy Valencia

Bruce Hoult

unread,

Feb 17, 2017, 9:40:08 AM2/17/17

to

It's hard to say where the 680x0 series could have gone if Motorola had continued with it. I think we can assume that they tried reasonably hard with the 68060 at least.

A comparison of process and speeds achieved:

600 nm: 68060 50 MHz; PowerPC 601 80 MHz (presumably identical process)
500 nm: PowerPC 601+ 120 Mhz, PowerPC 604 180 MHz
420 nm: 68060 75 MHz
350 nm: PowerPC 603e 300 Mhz, PowerPC 604e 233 MHz

I enjoyed using and programming the 68k. The 40 MHz '040 was a beast in its time (e.g. Mac IIfx), but once the 486DX2 came out it felt as if Motorola didn't have an answer. Maybe the 68060 could hold its own against the Pentium and PPC601 at similar clock speeds -- I don't know as it was on the scene late and I've never used one. For sure it never increased its clock speed like they (and their successors) did.

thomas....@gmail.com

unread,

Feb 17, 2017, 10:44:19 AM2/17/17

to

I think the least reason for go-down of 68k is its ISA. If you can even make x86 ISA (with up to 15 byte opcodes) run fast, this should have been possible with similar effort also for 68k... (altough I think the longest instruction there is 22 bytes. But ColdFire had reduced this 6 bytes.)

Regards,

Thomas

eugene...@gmail.com

unread,

Feb 17, 2017, 11:02:30 AM2/17/17

to

Based on a nice CISC vs. RISC comparsion by John Mashe at
http://userpages.umbc.edu/~vijay/mashey.on.risc.html, I think the ISA
was part of the problem.

The problem comes with virtual memory - What happens when we get a page
fault? In the 68000, you could fetch the first operand and then have a page
fault on the second. Especially since the first fetch could be I/O or
involve modifying a register, you have to stop in mid-instrucion and be able
to resume mid-instruction after taking care of the page fault.

With the x86, the vast majority of instructions only allow one memory operand,
so if you have a page fault, you can simply back up and redo the entire
instruction if there is a page fault. This simplifies the control unit and
allows you to spend those transistors elsewhere.

I don't think the ISA was the only reason, but I think it did play a role.

Gene Styer

Quadibloc

unread,

Feb 17, 2017, 1:24:19 PM2/17/17

to

On Friday, February 17, 2017 at 4:29:52 AM UTC-7, Nick Maclaren wrote:

> What killed the 68K was all the major workstation vendors backing
> either the x86 range or their own RISC design

But in that case, the web page in question has one very valid point: that
Motorola was favoring integer performance and neglecting floating-point
performance. That made sense if your sales are to the mass market (AMD sort of
repeated that mistake with Bulldozer) but if you need sales to workstation
vendors, it's suicide.

However, I'm not sure that could be what killed the 68K, because far fewer
Apollos and so on were sold than Macintosh computers.

So Motorola wouldn't need to have sold 68K chips to people like Silicon Graphics
who bought MIPS instead... what they instead needed was a _real_ mass market,
like the one Intel had.

And so if IBM had kept the legal wraps on its BIOS, with legal PC clones being
harder to develop than they were... so that people wanting an affordable 16-bit
computer were forced to settle for an Atari ST instead of a PC clone, so that
the Atari ST (and maybe also the more expensive Amiga) survived until the
present day (yes, there is a resurrected Amiga, but it isn't a 68K machine
because of the hiatus)... *that*, not workstations, would have kept the 68K
alive and well.

Even the Macintosh would still be using 68K chips under that set of
circumstances.

> Note that, while the x86 brigade claim(ed) that IBM chose that
> architecture on technical grounds, they weren't even seriously
> considered. At most, the cost argument favoured the x86 at the
> time the decision was taken, assuming that the PC/XT would be the
> last in that range.

My understanding is that there were technical considerations that led IBM to
choose the 8086 architecture for the PC. But it wasn't that this was "better"
than the 68000 in the sense of being more powerful. Instead, there were two
factors:

1) Intel had an 8086 SX part available. They didn't call it that, of course, but
the 8088 was an 8086 with an external 8-bit data bus. That allowed the IBM PC to
be manufactured at a price competitive with Z-80 based CP/M machines.

2) The 8086 was architecturally very similar to Intel's previous 8080. This
facilitated easy conversion of programs written for CP/M to PC-DOS.

At least, that's what I read in the industry magazine articles on the subject,
and I have no reason to think they're mistaken, as the reasons sound plausible.
Even though a wider bus shouldn't make a computer that much more expensive to
build, and people even then wrote spreadsheet programs and database programs in
C instead of assembler, didn't they?

John Savard

Nick Maclaren

unread,

Feb 17, 2017, 1:53:51 PM2/17/17

to

In article <2017021714282...@a20.vsta.org>,

Andy Valencia <van...@vsta.org> wrote:
>> I'll look at it later, but the reason it failed was nothing to
>> do with the architecture.
>
>I remember following the twists and turns of the 010, 020, 030, and
>then 040 (anybody else involved in trying to get SMP going on the
>040?). Getting Unix running was a real hassle every single time.
>
>Compare to 386 onward. The peripherals shuffled a little, but the
>base OS ran correctly without modifications.

The IBM PC decision was 6+ years before that, even if one takes the
386 as a baseline. Another aspect is that the 68010 (1982) could
run Unix, but it took until the 386 (1985) before Intel could; the
286 just didn't cut the mustard, which was bemoaned by the OS/2
people, as well as users who knew what a true operating system
could do.

However, I am NOT claiming that the later 68Ks were superior - they
weren't - but that the 68K was already failing as an architecture
by the time the 68040 appeared, for non-technical reasons. Even if
the 68040 had been technically excellent, it would have failed.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Feb 17, 2017, 2:01:01 PM2/17/17

to

In article <71ebac26-d52e-4eb0...@googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>> Note that, while the x86 brigade claim(ed) that IBM chose that
>> architecture on technical grounds, they weren't even seriously
>> considered. At most, the cost argument favoured the x86 at the
>> time the decision was taken, assuming that the PC/XT would be the
>> last in that range.
>
>My understanding is that there were technical considerations that led IBM to
>choose the 8086 architecture for the PC. But it wasn't that this was "better"
>than the 68000 in the sense of being more powerful. Instead, there were two
>factors:
>
>1) Intel had an 8086 SX part available. They didn't call it that, of
>course, but
>the 8088 was an 8086 with an external 8-bit data bus. That allowed the
>IBM PC to
>be manufactured at a price competitive with Z-80 based CP/M machines.
>
>2) The 8086 was architecturally very similar to Intel's previous 8080. This
>facilitated easy conversion of programs written for CP/M to PC-DOS.
>
>At least, that's what I read in the industry magazine articles on the subject,
>and I have no reason to think they're mistaken, as the reasons sound plausible.

Yeah. But those were justifications for the decision, not the reasons
for the decision. I worked within IBM with some of the people who
would have been involved if the decision had been technical, and
heard some of the inside story. Whether the decision would have
been different if it had been considered technically, I can't say.

Regards,
Nick Maclaren.

Quadibloc

unread,

Feb 17, 2017, 2:18:31 PM2/17/17

to

On Friday, February 17, 2017 at 9:02:30 AM UTC-7, eugene...@gmail.com wrote:

> The problem comes with virtual memory - What happens when we get a page
> fault? In the 68000, you could fetch the first operand and then have a page
> fault on the second.

Presumably, a consequence of the 68000 being based too much on the PDP-11 and not
enough on the IBM System/360. The same problem would have applied to the VAX and
to the TI 9900, both more obviously modelled on the PDP-11, and allowing memory-
to-memory operations as a basic standard feature.

That problem would have been avoided by what I chose to do for a more trivial
reason - to make indexed memory-reference instructions 32 bits long instead of
48 bits long.

A PDP-11-like instruction format... say with four addressing modes like the 9900
and eight registers like the PDP-11, so one can have six bits instead of four
for the opcode, and include floating-point as a basic feature... would have to
either go to 12-bit displacements, or be resigned to having a virtual address
space of 65,536 bytes and no more.

There is, however, a workaround. You separate fetching the two operands from the
rest of executing the instruction, so that any page faults are fully handled
before anything but a read, which has changed nothing _to_ roll back, has
happened.

And maybe I could make such an instruction format work... by going to the 360/20
for inspiration. So one has a choice of either the program's primary memory
space of 32,768 bytes... or eight foreign patches of memory with 12-bit
displacements in the instruction.

John Savard

MitchAlsup

unread,

Feb 17, 2017, 2:50:56 PM2/17/17

to

There were two critical issues that concern the failure of the 68K family: one was technical, the other managerial.

The technical issue was that x86 was easier to pipeline than 68K and Intel got 486 out the door before Moto got a pipelined 68K out the door.

The managerial issue was that no manager in Moto was empowered to ever say yes to a good idea. This proved to be a greater burden than piplineability.

John Dallman

unread,

Feb 17, 2017, 3:46:37 PM2/17/17

to

In article <71ebac26-d52e-4eb0...@googlegroups.com>,

jsa...@ecn.ab.ca (Quadibloc) wrote:

> Even though a wider bus shouldn't make a computer that much
> more expensive to build,

There was a question of chip life cycles and prices. At the time, the
prices of chips usually fell significantly a while after they were
introduced. So there was the choice of using 8-bit peripheral chips that
had been out a few years, and were cheap, or Intel's newly-introduced,
and thus quite expensive, 16-bit ones.

> and people even then wrote spreadsheet programs and database
> programs in C instead of assembler, didn't they?

Not the ones that were popular at the time, which ran on 48K to 64K CP/M
machines. WordStar 3.x was definitely written in assembler, and I'm
fairly sure VisiCalc and dBase II were too.

The IBM PC was a terrible failure at one of its important requirements.
It was not supposed to damage the market for the IBM System/34 office
mini-computer.

John

Terje Mathisen

unread,

Feb 17, 2017, 4:27:54 PM2/17/17

to

John Dallman wrote:
> In article <71ebac26-d52e-4eb0...@googlegroups.com>,
> jsa...@ecn.ab.ca (Quadibloc) wrote:
>
>> Even though a wider bus shouldn't make a computer that much more
>> expensive to build,
>
> There was a question of chip life cycles and prices. At the time,
> the prices of chips usually fell significantly a while after they
> were introduced. So there was the choice of using 8-bit peripheral
> chips that had been out a few years, and were cheap, or Intel's
> newly-introduced, and thus quite expensive, 16-bit ones.
>
>> and people even then wrote spreadsheet programs and database
>> programs in C instead of assembler, didn't they?
>
> Not the ones that were popular at the time, which ran on 48K to 64K
> CP/M machines. WordStar 3.x was definitely written in assembler, and
> I'm fairly sure VisiCalc and dBase II were too.

Lotus 123 was probably the first "I need to buy a PC to run that
program", it was written by a single guy over just a few months, i pure asm.

>
> The IBM PC was a terrible failure at one of its important
> requirements. It was not supposed to damage the market for the IBM
> System/34 office mini-computer.

To me who worked with PC compatibles from more or less day one it was
very early quite obvious that you should always run any given task on
the smallest/cheapest architecture that was large enough to actually do
so: This was at least an order of magnitude cheaper than the nest step up.

I.e. PC vs mini vs mainframe in those days.

The only thing that surprised me at the time was how long the
transitions took. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

jacko

unread,

Feb 17, 2017, 4:51:57 PM2/17/17

to

An interesting issue is the lex/yacc situation. It is possible to develop a file to specify the minimum RTL sub units needed to implement say C (for arguments sake), and then fill in the integer bit patterns for such. So surprisingly a C compiler is technically easier to bootstrap, although I don't have the source.

I guess then it comes down to usage counts of instruction sequences, and requests for re-interpretation. Then the bit patterns would be the outline index pivot, and the mnemonics would be interleaved in edit. Producing a disassembly, and it as the pivot outline index, has more of the same or bit patterns with spaces perhaps typed in again. Then the longest oft used pivot index outlines it again.

timca...@aol.com

unread,

Feb 17, 2017, 6:39:37 PM2/17/17

to

You only need one instruction with multiple memory references to cause the headache of adding the mechanism to deal with it.

From an OS porting point of view, the 68K had the "stack puke" to be able to restart the instruction. The 386 just faulted (and I think gave you a faulting address). In theory, the 68K method was better because you could always make forward progress, the 386 you restarted from scratch. In practice, the stack puke contents changed with every generation and Moto decided not to make the documentation generally available.

In retrospect, for the amount of bang for # of transistors, the ARM architecture used slightly more transistors than the 8086, but provided better performance than a 68K (or at least as good). I wonder what a similar implementation of ARM64 would have required back then?

- Tim

jacko

unread,

Feb 17, 2017, 7:09:34 PM2/17/17

to

> You only need one instruction with multiple memory references to cause the headache of adding the mechanism to deal with it.

MOVE -(An), -(Am)

It's as easy as only having 1 write cycle per instruction, and committing the register changes to An and Am also in that write cycle. All A registers need single purpose subtraction logic, and a mux An, An-1.

> In retrospect, for the amount of bang for # of transistors, the ARM architecture used slightly more transistors than the 8086, but provided better performance than a 68K (or at least as good). I wonder what a similar implementation of ARM64 would have required back then?

If the ARM64 was 2 clock word serial, ummm ...?

Melzzzzz

unread,

Feb 17, 2017, 8:00:21 PM2/17/17

to

On 2017-02-17, John Dallman <j...@cix.co.uk> wrote:
> In article <71ebac26-d52e-4eb0...@googlegroups.com>,
> jsa...@ecn.ab.ca (Quadibloc) wrote:
>
>> Even though a wider bus shouldn't make a computer that much
>> more expensive to build,
>
> There was a question of chip life cycles and prices. At the time, the
> prices of chips usually fell significantly a while after they were
> introduced. So there was the choice of using 8-bit peripheral chips that
> had been out a few years, and were cheap, or Intel's newly-introduced,
> and thus quite expensive, 16-bit ones.
>
>> and people even then wrote spreadsheet programs and database
>> programs in C instead of assembler, didn't they?
>
> Not the ones that were popular at the time, which ran on 48K to 64K CP/M
> machines. WordStar 3.x was definitely written in assembler, and I'm
> fairly sure VisiCalc and dBase II were too.

I had CPC6128 which ran CP/M in first 64kb, leaving other 64kb for programs.
I had WordStar and dBase II and real C compiler ;)

--
press any key to continue or any other to quit...

Bruce Hoult

unread,

Feb 17, 2017, 8:10:42 PM2/17/17

to

On Saturday, February 18, 2017 at 12:27:54 AM UTC+3, Terje Mathisen wrote:
> > Not the ones that were popular at the time, which ran on 48K to 64K
> > CP/M machines. WordStar 3.x was definitely written in assembler, and
> > I'm fairly sure VisiCalc and dBase II were too.
>
> Lotus 123 was probably the first "I need to buy a PC to run that
> program", it was written by a single guy over just a few months, i pure asm.

Visicalc was four years earlier for "I'm buying whatever computer runs that!"

BGB

unread,

Feb 18, 2017, 12:24:39 AM2/18/17

to

and, on a modern PC (an AMD FX-8350), can make the SuperH / SH4 ISA go
at pretty ok speeds in an emulator (over 300 MIPS).

relative to native:
tests like Dhrystone seem to show my real CPU as absurdly fast
like, a 80x speed difference (PC at like 17 kMIPS or such)
other more general tests show a much smaller difference
my other tests imply closer to a 10x difference for stuff like this.
could be better if the emulator's JIT were smarter, ...

here is it running Quake:
https://www.youtube.com/watch?v=OGocUiRImw8

though, in contrast, SH4 is using fixed-width 16-bit instructions.

I think it faced the problem that pretty much no one used it, but the
patents on it are expiring, so one can have fun with it.

but, in general, I liked the ISA design, and it resembled a lot of my
own hypothetical ISA designs, so I went with it.

note that the SH ISA can be either BE or LE, this is controlled by the
boot image (currently ELF). I had thought if I did a raw ROM images
there might be some analog of a BOM.

left wondering if it could be faster if the ISA supported x86-style
memory addressing, ex: @(Rm, R0*Sc, Disp*Sc) instruction forms (though
probably via extended 32-bit instructions).

these could potentially be able to reduce the number of instructions
needed for memory accesses.

well, that, and the various SH2A instruction forms could be nice
(though, SH2A is newer so not really "safe" as of yet).

some ugly hair with the ISA is that its MMU design is pretty much insane
(too much exposed internals and would likely be rather problematic to
implement efficiently in an emulator).

the partial strategy I have come up to deal with this is to not actually
bother implementing the SH4 MMU for now, and instead just stuck x86/ARM
style page-tables on it and called it good.

superficially faking enough of the SH4 MMU to try to make Linux happy,
where Linux basically just loads up some page tables and uses it like it
were x86; though this means Linux's MMU related interrupts are never
called, and some MMU-related address regions are no-op stubs.

theoretically, could add a fallback case and implement a full MMU (which
takes over from the "fast" MMU if needed), but this hasn't come up yet.

some other CPU instructions aren't implemented either, but are ones that
seem to not be used by the compiler (GCC).

one other questionable aspect is that the behavior of some instructions
depends on bits in control registers, which means that the instructions
need to be executed to determine their exact instruction form.

well, and some convoluted exposed-pipeline behavior (so the emulator
needs to partly mimic the behavior of the instruction pipeline).

some logic also exists to try to detect and optimize away memory loads
that only seem to exist just to load immediates (and replace them
internally with load-immediate operations).

...

Terje Mathisen

unread,

Feb 18, 2017, 4:05:11 AM2/18/17

to

Sure, I took it as given that everyone here knew about that.

Anyway, this did make the Lotus task far easier in that Mitch Kapor with
his "tame" asm programmer Jonathan Sachs knew what the target had to be.

Anton Ertl

unread,

Feb 18, 2017, 4:48:28 AM2/18/17

to

n...@wheeler.UUCP (Nick Maclaren) writes:
>Another aspect is that the 68010 (1982) could
>run Unix, but it took until the 386 (1985) before Intel could; the
>286 just didn't cut the mustard

I used a 286 that ran Xenix. Looked good enough at the time.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Feb 18, 2017, 6:07:39 AM2/18/17

to

thomas....@gmail.com writes:
>I think the least reason for go-down of 68k is its ISA.

You are probably right, in some sense. The 88110 had similar problems
of being late with a slow clock as the 68060 (well, the 88110 was
introduced two years earlier, so the ISA had an effect).

My guess is that many of the microprocessor companies had some good
initial CPUs, but then had problems when the complexity exceeded a
certain point, with pipelining, superscalarity, OoO, deeper pipelines,
and whatever else they did for fast clocks (e.g., at one time, domino
logic) all adding to the complexity. Of those who persevered, most
took quite a lot of time until they were able to do it, but until
then, they had low clocks and project delays. Others gave up, or at
least gave up the ISA.

There are plenty of examples:

Inmos: The T400/T800 series was a big success, the T9000 was delayed
and slow, and Inmos eventually gave up.

MIPS/SGI: The R2000/R3000 were very competetive, R4000 initally, too,
but then they took until 1996 to produce the R10000, and its clock
was relatively slow for the time, and eventually MIPS abandoned the
general-purpose market.

Motorola/Freescale: the initial 68000 was a success, the 68040 was
relatively late, the 68060 even later, and then they abandoned the
general-purpose market. The 88100 was late, the 88110 was later and
relatively low-clocked, and they abandoned the ISA. For the PowerPC,
they could not keep up in the clock race, so they eventually abandoned
the general-purpose market.

IBM: The original Power was competetive, Power3 had a relatively slow
clock rate, but they finally kept up in the clock race with Power 4,
and have produced fast-clocked parts ever since. The S/360 successors
had a similar development, somewhat later.

Sun: The first SPARCs did ok, but they soon had trouble keeping up.
Fortunatley, they had a market that was not that
performance-sensitive, so they could continue, and eventuelly they
managed to get fast clocks (not sure about IPC). Also, it seems that
Fujitsu managed to design SPARCs with competetive clock rate and
competetive IPC.

ARM: They gave up the desktop market pretty soon, and focused on the
embedded market. They got a good foothold in the phone market, and
when that started to need fast CPUs with smartphones, ARM and Apple
apparently were able to manage the complexity of the bigger, faster
CPUs. It's pretty clear to me how Apple did it (P.A.Semi had already
demonstrated that they could do it), the ARM side of this story is
unclear to me.

Intel: IA-64 was late from the start, and did not get competetive
clock rates; maybe they finally have a competetive product with
Poulson, but it's hard to know because apparently marketing (of all
involved companies) has given up on IA-64 and they don't publish SPEC
results for it.

I think that the ISA is a contributing factor in that a complex ISA
leads to having to invest complexity into the ISA implementation
instead of into performance features, and making performance features
harder to implement. That was the RISC advantage in the 1985-1995,
and that resulted in small companies like Acorn and MIPS producing
pipelined CPUs in 1985/86 while Intel took until 1989 to introduce the
486 and Motorola took until 1990 for the 68040.

Nick Maclaren

unread,

Feb 18, 2017, 9:40:14 AM2/18/17

to

In article <2017Feb1...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>Another aspect is that the 68010 (1982) could
>>run Unix, but it took until the 386 (1985) before Intel could; the
>>286 just didn't cut the mustard
>
>I used a 286 that ran Xenix. Looked good enough at the time.

How familiar were you with the better Unices of the previous era?

The point is that the 80286 could run something a heck of a lot
better than an unprotected run-time executive (e.g. CPM and its
descendents) but not a proper operating system, especially not
one with virtual memory, good RAS against wayward programs and
decent debuggers.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Feb 18, 2017, 9:50:10 AM2/18/17

to

In article <44a6f444-8104-445f...@googlegroups.com>,

By the time of the 486, the 68K line was failing, because all of
its major users had plans to abandon it, and the minor ones were
a dwindling market. Whether a pipelined 68K could have reversed
that, I can't say, though I doubt it. This is a slightly different
meaning of failure to the one you are using, of course.

The management were definitely to blame, because those plans were
all either public or should have been known by a competent vendor.
I have no idea whether its later success in the embedded market
was Motorola's competence or serendipity - do you?

Regards,
Nick Maclaren.

John Dallman

unread,

Feb 18, 2017, 12:11:41 PM2/18/17

to

In article <o886b4$dq9$1...@news.albasani.net>, Melz...@zzzzz.com (Melzzzzz)
wrote:

> I had CPC6128 which ran CP/M in first 64kb, leaving other 64kb for
> programs. I had WordStar and dBase II and real C compiler ;)

Yup, but that machine was a low-end box when it was introduced in 1985.
The design decisions on the IBM PC were taken in 1980, when CP/M was a
leading business OS. Things changed fast in those days.

John

Quadibloc

unread,

Feb 18, 2017, 12:53:08 PM2/18/17

to

On Saturday, February 18, 2017 at 7:50:10 AM UTC-7, Nick Maclaren wrote:

> By the time of the 486, the 68K line was failing, because all of
> its major users had plans to abandon it, and the minor ones were
> a dwindling market. Whether a pipelined 68K could have reversed
> that, I can't say, though I doubt it.

I doubt it too. The 68060 _was_ a pipelined 68k, but that came out after the
Pentium, not at the time of the 486, and so its existence doesn't count one way
or the other for that argument.

Why did Apple abandon the PowerPC for the x86? We know the answer to *that*
question. Because more x86 chips were sold in consumer machines than PowerPC
chips, there existed a wide choice of x86 chips from Intel designed to provide
high performance and low power consumption for use in laptops. IBM's focus with
the PowerPC was to develop chips for its server products; Apple didn't make
enough of its own laptops to make it worth IBM's while to develop a decent range
of PowerPC chips for laptops.

When it comes to the shift from the 68k to the PowerPC, I don't recall the
specific rationale. I _think_ it was because Motorola had indicated, whether
specifically to Apple, or to the world at large, that the PowerPC was the "wave
of the future" - that it was the architecture they would concentrate on in
future development efforts.

And _that_ decision may have been motivated in part by diminishing 68k sales.
But then one would have to explain what started them diminishing.

So here is what I think the root cause was:

a) There was, at the time, a _perceived_ technical superiority of RISC over
CISC. Note, too, that "technical superiority" isn't an absolute attribute of an
ISA, but rather a function of the interaction between an ISA and the
implementation technology of a given time.

So if your technology lets you put a 360/40 on a single chip - or a 386 - then
since you're doing it all in microcode, there's nothing at all wrong with CISC,
and it's nice to the programmer.

If your technology lets you put a 360/195 on a single chip - or a Pentium II
(both had OoO for FP only, both had L1 caches only, both used a fancy
high-performance division algorithm) - well, then you have enough transistors to
implement CISC with high performance.

So it's no accident that RISC got perceived as technically superior in the *486*
era. If RISC gives you enough registers so you can implement deep pipelines, but
CISC doesn't, since you can't yet afford the overhead of OoO, then RISC clearly
_is_ technically superior - it makes a high-performance CPU possible.

Stripped-down OoO that's enough to deal with cache misses may still be possible,
I used OoO above to refer to full-blown register-renaming Tomasulo.

b) A _declining_ market for the 68k chips wasn't required to set off the
implosion. In the heyday of the 68k, when the Macintosh, the Atari ST, and the
Commodore Amiga were all going strong, far fewer 68k chips were being sold than
x86 chips.

And so at the beginning of that heyday, that might not have been a problem, even
if Motorola would have been happier if it had a spot on the gravy train.

But as we have noticed recently as fewer and fewer companies that design
microprocessors have their own fabs, as processes shrink, the capital investment
required for a fab keeps growing.

So a market that was big enough to support making 68k microprocessors for that
market *did not have to contract* in order to change to being *not* big enough.
The inexorable advance of technology is enough to change that.

Now then. What I think I've stated above is "the obvious", and so I should be
able to say it "without fear of contradiction". However, I suspect that what
I've said above _will_ be debated, which is fine - I'd be interested in hearing
what I might have missed.

John Savard

Nick Maclaren

unread,

Feb 18, 2017, 1:13:26 PM2/18/17

to

In article <7658aa34-a30a-4f07...@googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>When it comes to the shift from the 68k to the PowerPC, I don't recall the
>specific rationale. I _think_ it was because Motorola had indicated, whether
>specifically to Apple, or to the world at large, that the PowerPC was the "wave
>of the future" - that it was the architecture they would concentrate on in
>future development efforts.

Look up the IBM/Motorola collaboration on that. The reason that the
PowerPC failed in the workstation market was complicated, but can be
assigned entirely to managerial incompetence.

Regards,
Nick Maclaren.

jacko

unread,

Feb 18, 2017, 1:46:54 PM2/18/17

to

I've put together a consistent ISA now, so the thought process has been completed unless physical realization becomes a need.

https://dl.dropboxusercontent.com/u/1615413/Own%20Work/68k2.pdf

There is quite a lot of density, and no spare instructions really. Some instructions have changes, and many opcodes were relocated (a simple map).

Bruce Hoult

unread,

Feb 18, 2017, 2:07:36 PM2/18/17

to

The Xeon Phi says that even Intel now understands that you can get more total performance from a lot of simple in-order processors using the same chip area and energy resources as fewer complex out of order processors.

In that situation, RISC is again technically superior, as I think we will see quite soon.

Intel hasn't even been able to break into the smartphone market, despite x86 being treated as a first class architecture by Android.

Quadibloc

unread,

Feb 18, 2017, 3:02:29 PM2/18/17

to

On Saturday, February 18, 2017 at 12:07:36 PM UTC-7, Bruce Hoult wrote:

> The Xeon Phi says that even Intel now understands that you can get more total
> performance from a lot of simple in-order processors using the same chip area
> and energy resources as fewer complex out of order processors.

> In that situation, RISC is again technically superior, as I think we will see
> quite soon.

Um, yes, but there is the question of *throughput* versus *latency*. Some people
need to minimize latency, not just get lots of throughput.

This is the only excuse for chips that consume lots of power and require
elaborate cooling, instead of having more reasonable designs with greater
parallelism.

And Intel has more capital behind it, which gives it an advantage that can
overcome a technical disadvantage.

John Savard

MitchAlsup

unread,

Feb 18, 2017, 3:36:05 PM2/18/17

to

On Saturday, February 18, 2017 at 1:07:36 PM UTC-6, Bruce Hoult wrote:

> The Xeon Phi says that even Intel now understands that you can get more total performance from a lot of simple in-order processors using the same chip area and energy resources as fewer complex out of order processors.

About a decade ago (almost exactly) I was promoting an x86 that went back
to in order mildly pipelined. The core would have been about 16× smaller
than the then current Opteron, and delivered just under 1/2 of the large
OoO core's perf. It probably would have come in close to 16× better in
power per instruction. Could not sell AMD management on its utility or need.

> In that situation, RISC is again technically superior, as I think we will see quite soon.

It is not clear that after one has a pipelined FP unit, the the decode
overhead of x86 is more than the 10% level of overhead.

What has always been clear is that cubic dollars spent on engineering the
heck out of a somewhat deficient architecture will beat startup dollars
in the hands of brilliant minds dedicated to the task before them.

jacko

unread,

Feb 18, 2017, 3:40:14 PM2/18/17

to

If you had 12 bits to make an addressing mode with an input register as a given, what would it be?

(reg + b12) is quite boring, and already done. Those weird peculiar modes that come in useful for certain programming constructs. What are proper scientific strides? Is there a dimension to preferred strides curve?

jacko

unread,

Feb 18, 2017, 3:45:22 PM2/18/17

to

Have you seen the Viper II+ FPGA board for old Amiga systems? AROS Icaros looking good too for an old new OS. :D I'm considering things inspired by it. Yes, it's not even top metal silicon, but if it ever were, it would need a fast to translate OS, and maybe an app base.

already...@yahoo.com

unread,

Feb 18, 2017, 6:35:17 PM2/18/17

to

Except that Xeon Phi consisting of simple in-order processors retired last year.
A new Xeon Phi is made out of not so simple OoO processors.
Hopefully, for the next iteration they will finally see a light and will build it out of regular fat OoO cores with proportionally wider VPUs to compensate for reduced # cores.

>
> In that situation, RISC is again technically superior, as I think we will see quite soon.

For designs, similar to current iteration of Xeon Phi, i.e. ~60 cores with 2x512b VPUs per core, RISC indeed will be superior, because it will allow 3-way instruction decoders. 2-way decoders in KNL looks like a serious bottleneck.

But for "Xeon Phi done right", as proposed above, RISC vs x86 it does not matter.
For design with 1 VPU per core, as original Xeon Phi, it also does not matter, because for 1 VPU 2-way instruction decoding is sufficient, and 2-way x86 decoding is relatively easy.

>
> Intel hasn't even been able to break into the smartphone market, despite x86 being treated as a first class architecture by Android.

But Intel appear quite successful in tablets. Too bad for them that tablets market reached a peak 3 years ago and shrinking since then.

jacko

unread,

Feb 18, 2017, 6:51:32 PM2/18/17

to

Yep, https://dl.dropboxusercontent.com/u/1615413/Own%20Work/68k2-PC%23d12.pdf is now a better document, 10 not 3 extra addressing modes, and I still have a greater than 18 bit data space to put an opcode embedding in, within the 32 bit length instructions. All the other bits are almost a full 16 bit opcode space.

Still no vector unit though. What do you think of vector instruction set extensions?

Jacko

Quadibloc

unread,

Feb 19, 2017, 12:32:44 AM2/19/17

to

On Saturday, February 18, 2017 at 1:36:05 PM UTC-7, MitchAlsup wrote:

> About a decade ago (almost exactly) I was promoting an x86 that went back
> to in order mildly pipelined. The core would have been about 16× smaller
> than the then current Opteron, and delivered just under 1/2 of the large
> OoO core's perf. It probably would have come in close to 16× better in
> power per instruction. Could not sell AMD management on its utility or need.

I'm not surprised, but I presume that at the present time AMD is aware that a
processor like that would indeed have a market. Perhaps, however, they feel that
instead of directly competing with Intel's Atom (and Intel recently modified the
Atom design so that it is OoO, IIRC) while such a chip has a market, it would
have a better market, and work more effectively, if it were an ARM chip instead
of an x86 chip.

Are they crazy over there at AMD for believing that?

I don't think so.

Having 8 registers instead of 32 in a register bank *does* mean that the
performance of a pipelined CISC machine is crippled by not being OoO in a way
that the performance of a pipelined RISC machine is not. For a certain depth of
pipeline - if the pipeline is deeper, one might have to go to 128 registers like
an Itanium, and that brings with it certain other problems which is why few have
dared to go that way.

While the x86 ISA does have a technical disadvantage compared to RISC, it is
also true that under certain circumstances that disadvantage is slight or easily
mitigated. Under _other_ circumstances, specifically for processors targeted at
different performance levels, perhaps not so much.

So, if the x86 has a cost, when is it worth paying that cost?

Answer: if you're making a box to run Microsoft Windows software. (And these
days, OS X software too.) Then you need to be x86 in order to be compatible with
a giant pool of existing third-party software distributed as closed-source
binaries.

Thus, if your chip is so lightweight that it is intended to go into a smartphone
(where ARM dominates anyhow for Android) or if it's in some massively parallel
chip that is going to run the output of your FORTRAN compiler and *not* multiple
copies of off-the-shelf Windows programs in parallel... there's no Earthly
reason for doing it in x86.

So a chip like yours has basically one niche: like Intel's Atom, it is for
lightweight netbooks that run Windows but not Crysis for surfing the Web, doing
E-mail, and spreadsheets and word processing. With really good battery life.

Which are competing against 5-year-old hand-me-down laptops that run Windows 7
instead of that abomination, Windows 10.

So I feel your pain, but I also understand the tension between "technically
sweet" and "hey, if we make it, is anyone going to want to pay money to buy it".
Of course, the marketing suits in management are rather less competent at what
they do than the technical guys, which creates mistrust and makes this harder to
bear. But the fickle consumer is intrinsically hard to second-guess, so temper
one's sentiments with some sympathy for their difficult task.

John Savard

Quadibloc

unread,

Feb 19, 2017, 12:40:14 AM2/19/17

to

On Saturday, February 18, 2017 at 1:36:05 PM UTC-7, MitchAlsup wrote:

> On Saturday, February 18, 2017 at 1:07:36 PM UTC-6, Bruce Hoult wrote:

> > In that situation, RISC is again technically superior, as I think we will see quite soon.

> It is not clear that after one has a pipelined FP unit, the the decode
> overhead of x86 is more than the 10% level of overhead.

The *decode overhead* of x86 is not the issue, as that is utterly trivial.

The issue is the number of registers - for *certain pipeline depths*, x86 needs
Tomasulo and RISC doesn't to get decent performance. That overhead is
non-trivial.

Intel can still beat some new startup with things like the Xeon Phi with one
hand tied behind its back, but IBM, AMD, TSMC, and Oracle, while they don't have
as much money as Intel, have enough money to have a fighting chance.

I don't know if the Xeon Phi will fail, though; isn't it possible to use it to
accelerate the performance of *Windows application programs*, for which only x86
will do?

But when Moores' Law starts to run out of steam, so that companies like Intel no
longer have a big process advantage and so on and so forth... in addition to
non-x86 players catching up with the Xeon Phi, could this *also* mean that
Oracle, with its interesting high-throughput SPARC designs, could finally end up
dethroning IBM mainframes?

Oracle has some other issues, and IBM's dominance is based on other things
besides chip-making smarts; they happen to have database expertise going 'way
back, and they understand the needs of businesses _very_ well... but this is
another point where the balance might change a bit.

John Savard

Bruce Hoult

unread,

Feb 19, 2017, 12:48:37 AM2/19/17

to

And Intel has always had at least a 10% lead in process. Their progress is stalling out now, but they still may be able to achieve what others can not at all.

A claim I've seen recently is that Moore's law has already ended in that although smaller processes continue to be developed (slowly) the cost per transistor is rising in those new processes not falling.

jacko

unread,

Feb 19, 2017, 1:16:03 AM2/19/17

to

Knight's Landing was impressive, but as one man pointed out the AVX does not do the IEEE math for the figures, and so he gets a big fine for under reporting risk, so the results being different stop him. (sounds strange, surly stats would help)

So maybe one massive AVX array clouded up, and a 32 core super simple at 500 MHz, with oodles of L2. Even if half are on a stall or wait state, that's still an impressive 4 core equivalent double issue.

Niels Jørgen Kruse

unread,

Feb 19, 2017, 6:21:55 AM2/19/17

to

Quadibloc <jsa...@ecn.ab.ca> wrote:

> If your technology lets you put a 360/195 on a single chip - or a Pentium
> II (both had OoO for FP only, both had L1 caches only, both used a fancy
> high-performance division algorithm) - well, then you have enough
> transistors to implement CISC with high performance.

The Pentium II was fully OoO and had offchip L2.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

BGB

unread,

Feb 19, 2017, 11:11:42 AM2/19/17

to

I considered doing a custom 64-bit SH-based ISA, but haven't formally
done so yet.

so, this is basically using the good ol' SH4 ISA. there is the slightly
newer 4A and 2A ISA's, which have some tempting instructions, but it
isn't known safe to use them as there may(?) still exist patents which
cover them.

also, I wouldn't have a ready-made compiler supporting a custom ISA; I
would either need to modify GCC to do so (ick!), or write a dedicated C
compiler/backend (its own set of issues).

unlike the SH5 ISA (a completely new ISA for 64-bit mode), my ISA would
have more worked like in x86-64, quietly expanding the registers to 64
bits, and allowing many existing opcodes to function in various widths.

most 32-bit ops remained 32-bits, but many 16-bit ops could be promoted
to 64-bits (with a few 32-bit ops promoting, and certain redundant
special cases remaining as 16-bit forms).

this would partially penalize using 16-bit types in 64-bit mode, which
would generally require less efficient instruction sequences.

also how the FPU worked was tweaked a bit, the FPU now having 16 double
registers (vs 16x float or 8x split-double).

unclear would be if this were the most sane way to do this.

as-before, instructions were still fixed-width 16-bit (it was otherwise
still basically the SH4 ISA).

I fiddled more, got the emulator up to around 410 MIPS in some tests
(ex: running Quake, which gives ~ 370..410 MIPS). some other tests give
different results (ex: decoding CRAM video is closer to 300 MIPS, ...).

I suspect getting much above this value would require adding register
allocation to the JIT compiler (and maybe finding some way to further
somehow shave cycles off of memory access).

one thing that was eating cycles was the SHAD and SHLD instructions,
which are surprisingly hairy (for being shift instructions).

did gain some speed from making the decoder interpret a few special
cases as super-ops:
MOV #imm, Rx; SHAD Rx, Rn
MOV #imm, Rx; SHLD Rx, Rn
reinterpreted as, effectively:
SHAD #imm, Rm, Rn
SHLD #imm, Rm, Rn

these immediate forms allow the elimination of the internal branches in
favor of using a constant shift.

similar could be applied to things like compare+branch:
CMPEQ Rm, Rn; BF lbl
CMPGT Rm, Rn; BT lbl
being reinterpreted as:
JCMPNE Rm, Rn, lbl
JCMPGT Rm, Rn, lbl

etc...

super-ops aren't really great for raw MIPS counts, as they are currently
counted as fewer instructions by the emulator (1), but are generally
useful for improving performance.

1: I could use instruction-words as another metric, but this could skew
things the other way (any 32-bit ops would be counted as 2 instructions,
though I think this isn't too far off from the real SH hardware where
AFAIK the 32-bit I-forms eat an additional pipeline slot).

I may call the performance "good" for right now, as I reached the
original goal I set for myself here (breaking 400 MIPS, and generally
exceeding the theoretical CPU performance of the Dreamcast).

also I seem to now be often hitting 60 FPS in Quake 1 in 640x480, so...

potentially, part of my difficulty hitting 400 MIPS may have been due to
Quake internally sleep-throttling in some cases (as it has a 72Hz
framerate limiter); though going up to 800x600 would require more effort
(could just disable the limiter though).

otherwise debating whether to try to add a GPU, and if so how to go with it:
* emulate the DC's PowerVR chipset
** issues: neither maps that well to OpenGL nor is straightforward
** it is new enough that patents may still apply.
* do something that basically just maps to a GLES2 style interface
** probably sane option if I wanted serious GL with it
** ex: if I wanted to forward it to native OpenGL.
* do something sort of like the older S3 chipsets (ex: Virge)
** simpler, but leaves a more work for the code running in the emulator
** triangles would be given in a buffer with screen-space coords

also debating whether to use the real hardware GPU (via OpenGL), or
going the "technically simpler in this case" route of using a software
rasterizer (makes most sense if using precooked screen-space triangles).

or such...

MitchAlsup

unread,

Feb 19, 2017, 1:03:15 PM2/19/17

to

On Saturday, February 18, 2017 at 11:40:14 PM UTC-6, Quadibloc wrote:
> On Saturday, February 18, 2017 at 1:36:05 PM UTC-7, MitchAlsup wrote:
>
> > It is not clear that after one has a pipelined FP unit, the the decode
> > overhead of x86 is more than the 10% level of overhead.
>
> The *decode overhead* of x86 is not the issue, as that is utterly trivial.
>
> The issue is the number of registers - for *certain pipeline depths*, x86 needs
> Tomasulo and RISC doesn't to get decent performance. That overhead is
> non-trivial.

I am going to ignore the glaring error that x86-64 (the only machine up for
consideration) had 16 GPRs not 8.

To a good first order, All the OoO stuff gains a factor of 2× over a nicely
pipelined IO machine. So we have Great Big OoO getting 1.2 I/C and to
compete, the Little Bitty IO machine only has to get 0.6 I/C. There is no
reason the LBIO machine can't run at the same frequency, and there may be
frequency upside just because the power consumption is correspondingly
way down.

At 0.6 IPC and a 128-bit wide <unified cache> instructions can be fetched
<essentially> without interference with data accesses. IFetch load is
1 fetch every 4 cycles on average, DAccess is 2.2 Accesses every 4 cycles.

At 0.6 I/C the FP units can be operated at DIV2 clock for a big savings in
flip-flop count.

At 0.6 I/C the entire core can be deClocked on a cache miss saving lots of
power and not having any parts of the pipeline OoO wrt other parts of the
pipeline. The core is small enough that this decision can be made and instantiated in a single clock cycle.

The actual target for this effort was to make an x86 small enough one could
put it on the Southbridge* chip and run the slow high latency communications
parts of the OS down on the SB chip only interrupting the GBOoO machines
after the run Queues had been updated. No change in OS code was required,
only some table setup early in the boot process. In effect, the I/O drivers
ran down on the SB and performed all services they require somewhat like
IBM channels but with a x86 programming model.

(*) Southbridge or any ASIC one wanted.

Ivan Godard

unread,

Feb 19, 2017, 1:39:06 PM2/19/17

to

On 2/19/2017 10:03 AM, MitchAlsup wrote:

<snip>

> At 0.6 I/C the entire core can be deClocked on a cache miss saving lots of
> power and not having any parts of the pipeline OoO wrt other parts of the
> pipeline. The core is small enough that this decision can be made and instantiated in a single clock cycle.

How do you get the stall signal out so quickly? The core is several
clocks wide just in wire delay, isn't it? You can stop the master
assuming it is located near the cache, but what about the ticks that are
in flight already?

jacko

unread,

Feb 19, 2017, 2:02:53 PM2/19/17

to

I'd put the write backs to register all close to the master, then ALU clocking just does a small amount of stage forwarding.

Nick Maclaren

unread,

Feb 19, 2017, 2:46:55 PM2/19/17

to

In article <7fb0d01c-86c1-4703...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>
>To a good first order, All the OoO stuff gains a factor of 2 over a nicely
>pipelined IO machine. So we have Great Big OoO getting 1.2 I/C and to
>compete, the Little Bitty IO machine only has to get 0.6 I/C. There is no
>reason the LBIO machine can't run at the same frequency, and there may be
>frequency upside just because the power consumption is correspondingly
>way down.

Or have several times as many cores.

Regards,
Nick Maclaren.

George Neuner

unread,

Feb 19, 2017, 2:47:27 PM2/19/17

to

That's true. However, Xenix systems mostly ran office software rather
than being open systems. Xenix on a 286 could easily handle 4-6 users
for word processing, document management, etc.

Xenix itself was a pretty decent Unix. I didn't do any development on
the Intel version [I only saw it in operation], but I did some work
with the 68K version.

YMMV.

>Regards,
>Nick Maclaren.

George

jacko

unread,

Feb 19, 2017, 6:01:28 PM2/19/17

to

> I considered doing a custom 64-bit SH-based ISA, but haven't formally
> done so yet.
>
> so, this is basically using the good ol' SH4 ISA. there is the slightly
> newer 4A and 2A ISA's, which have some tempting instructions, but it
> isn't known safe to use them as there may(?) still exist patents which
> cover them.

a more final version spec https://dl.dropboxusercontent.com/u/1615413/Own%20Work/68k2-PC%23d12.pdf with only a few added instructions and some new addressing modes. There's even a table at the end to shown the free instruction space.

Most of the patents come from implementation of instructions.

> also, I wouldn't have a ready-made compiler supporting a custom ISA; I
> would either need to modify GCC to do so (ick!), or write a dedicated C
> compiler/backend (its own set of issues).

This is a big plus to only need a small mod.

> unlike the SH5 ISA (a completely new ISA for 64-bit mode), my ISA would
> have more worked like in x86-64, quietly expanding the registers to 64
> bits, and allowing many existing opcodes to function in various widths.

I used the size = 11 option and removed many 68020+ instructions, and relocated a lot.

> most 32-bit ops remained 32-bits, but many 16-bit ops could be promoted
> to 64-bits (with a few 32-bit ops promoting, and certain redundant
> special cases remaining as 16-bit forms).

I used some of the 8 bit instructions as 64, and the 8 bit ones in more complex patterns as they are mainly IO rare, and often maskable in 16 bit.

> this would partially penalize using 16-bit types in 64-bit mode, which
> would generally require less efficient instruction sequences.

Yes, this would be a problem with an addressing mode which is capable of using a 64 bit register as 4 * 16 bit registers.

> also how the FPU worked was tweaked a bit, the FPU now having 16 double
> registers (vs 16x float or 8x split-double).

There is no hard spec on the length of the FP registers.

> unclear would be if this were the most sane way to do this.
>
> as-before, instructions were still fixed-width 16-bit (it was otherwise
> still basically the SH4 ISA).

<snip>

> super-ops aren't really great for raw MIPS counts, as they are currently
> counted as fewer instructions by the emulator (1), but are generally
> useful for improving performance.

<snip>

> also I seem to now be often hitting 60 FPS in Quake 1 in 640x480, so...
>

<snip>

> also debating whether to use the real hardware GPU (via OpenGL), or
> going the "technically simpler in this case" route of using a software
> rasterizer (makes most sense if using precooked screen-space triangles).

OpenGLES2 is the most popular "on the market"

jacko

unread,

Feb 19, 2017, 6:05:56 PM2/19/17

to

> Xenix itself was a pretty decent Unix. I didn't do any development on
> the Intel version [I only saw it in operation], but I did some work
> with the 68K version.

It was the sweet spot. 8 bit was too simple. 16 bit was too short a time before 32 bit came along. 32 bit really opened up the market for doing apps and OS things. 64 bit is just an after thought. Yes faster in someways, and an easy make everything 64 bit waste of high order bits.

68K, a nice to map 32 bit opcode space. x86 was at that point heavily "opcode overloaded"

gg...@yahoo.com

unread,

Feb 19, 2017, 6:34:05 PM2/19/17

to

On Sunday, February 19, 2017 at 12:03:15 PM UTC-6, MitchAlsup wrote:
> On Saturday, February 18, 2017 at 11:40:14 PM UTC-6, Quadibloc wrote:
> > On Saturday, February 18, 2017 at 1:36:05 PM UTC-7, MitchAlsup wrote:
> >
> > > It is not clear that after one has a pipelined FP unit, the the decode
> > > overhead of x86 is more than the 10% level of overhead.
> >
> > The *decode overhead* of x86 is not the issue, as that is utterly trivial.
> >
> > The issue is the number of registers - for *certain pipeline depths*, x86 needs
> > Tomasulo and RISC doesn't to get decent performance. That overhead is
> > non-trivial.
>
> I am going to ignore the glaring error that x86-64 (the only machine up for
> consideration) had 16 GPRs not 8.
>
> To a good first order, All the OoO stuff gains a factor of 2× over a nicely
> pipelined IO machine. So we have Great Big OoO getting 1.2 I/C and to
> compete, the Little Bitty IO machine only has to get 0.6 I/C. There is no
> reason the LBIO machine can't run at the same frequency, and there may be
> frequency upside just because the power consumption is correspondingly
> way down.

I have heard that rename does not scale and limits ILP, if true the 68k having two named register files would seem to be an advantage for very high end performance.

x86 is stuck at 4 wide rename, Apple does 6 wide but is at half the clock so you have twice the time to do work. IBM power does 6 wide but is liquid cooled, and IBM may be doing things the way they group instructions chains to reduce the effective rename they need. (squash dead renames in a chain?)

One third of instructions are loads and stores, using the 68k address registers, you can only have 32 or so loads and stores pending so you would give the address instructions one quarter of the 128 physical registers. You would issue at most 3 address instructions a cycle, 2 loads and a store, so three wide rename should do for the address registers.

Now you have the data registers and you would do full 4 wide rename for those, giving a total of 7 wide rename. Once you are doing 7 wide rename the single register file may become an issue prompting you to spit the physical register file to get more write ports, because the register write ports do not scale well either.

Does this sound reasonable to you, or is rename and write ports not such and issue?

I am well aware that ILP averages 1, I am talking about computational loads, and winning the bench marketing wars with highest peak performance. (And doing so with perhaps better thermals as well.)

Note that were one to split the register file like 68k did, today one would design a better instruction set to go with it.

jacko

unread,

Feb 19, 2017, 6:41:08 PM2/19/17

to

2 bits on each pipe stage:

00 empty
01 filled
10 spilled
11 stalled

The spilled state is the output was parallel written to the stage spill register, and didn't get wrote last cycle. So a bit of mux, and the spill register gets re-inserted. The stall appears a cycle later at the pipe input stage

Ivan Godard

unread,

Feb 19, 2017, 7:31:02 PM2/19/17

to

Yeah, but how do you tell it to stop? A handshake between stages is way
expensive. And how do you get all the pipes (which may be time remote
from each other) to stop at the same point? And how do you capture the
intermediate state for replay when the stall is actually a trap
requiring handler entry?

But what do I know? I'm no hardware guy, as my question shows.

Ivan Godard

unread,

Feb 19, 2017, 7:48:16 PM2/19/17

to

On 2/19/2017 3:34 PM, gg...@yahoo.com wrote:
<snip>

> I have heard that rename does not scale and limits ILP, if true the
> 68k having two named register files would seem to be an advantage for
> very high end performance.
>
> x86 is stuck at 4 wide rename, Apple does 6 wide but is at half the
> clock so you have twice the time to do work. IBM power does 6 wide
> but is liquid cooled, and IBM may be doing things the way they group
> instructions chains to reduce the effective rename they need. (squash
> dead renames in a chain?)

One of the big wins in Mill is freedom from rename while not paying the
2x cost of OOO->IO. Your guess about IBM may be right; we find that the
great majority of dataflow chains are very short, so renaming the chains
rather than the individual ops may well win.

The lifetime distribution is extremely bi-modal; nearly everything is
use-once, so if you rename anything that is used more than once and
chain the rest then you should cut the rename load dramatically. Bill
Wulf had an interesting ISA once where the fixed-length instruction had
three arguments and two opcodes and computed (A op1 B) op1 C, with the
intermediate result hidden; sort of a similar idea.

Sarr Blumson

unread,

Feb 19, 2017, 9:44:35 PM2/19/17

to

Quadibloc <jsa...@ecn.ab.ca> Wrote in message:
>
> My understanding is that there were technical considerations that led IBM to
> choose the 8086 architecture for the PC. But it wasn't that this was "better"
> than the 68000 in the sense of being more powerful. Instead, there were two
> factors:
>
> 1) Intel had an 8086 SX part available. They didn't call it that, of course, but
> the 8088 was an 8086 with an external 8-bit data bus. That allowed the IBM PC to
> be manufactured at a price competitive with Z-80 based CP/M machines.

There was a 68008 with an 8 bit bus, but it may have been too late.

> 2) The 8086 was architecturally very similar to Intel's previous 8080. This
> facilitated easy conversion of programs written for CP/M to PC-DOS.
>
> At least, that's what I read in the industry magazine articles on the subject,
> and I have no reason to think they're mistaken, as the reasons sound plausible.

3) The x86 had better _hardware_ development tools, I've been told.

--

----Android NewsGroup Reader----
http://usenet.sinaapp.com/

jim.bra...@ieee.org

unread,

Feb 19, 2017, 10:23:16 PM2/19/17

to

One should also consider the Motorola mentality: there were strong ties to the automobile market which the 68xx servred well? The 68xxx did well in the communications market which was also part of their mindset? They might not have wanted to join the fray in the PC market?

jacko

unread,

Feb 20, 2017, 12:17:51 AM2/20/17

to

This is why the little IO co-processor idea works. The trap is either not real time, and a simple equivalent to a jump insert at the beginning of the pipe is good, with perhaps a condition code write just before it.

Or it's real time, and a dirty flush immediate jump needs doing, implying a short and slow perhaps pipe. A micro-controller with enough buffer space is essential for fast ACK service, proxy to the main CPU taking over 100 cycles to pick up the buffer. To be honest, real time interrupts on such a GHz beast are an L3 fetch perhaps away anyway, so stopping the pipe for even 50 cycles would be a mistake most of the time.

The pipe doesn't stop, it spills. If it spills it will appear as stalled in the next cycle to it's predecessor, and give out its spill to its successor. Its successor will either then pick it up in the next cycle if it in turn is not stalled, which this pipe stage knows. So the present stage will know if it should un-stall this cycle.

already...@yahoo.com

unread,

Feb 20, 2017, 5:01:07 AM2/20/17

to

Automotive and communication controllers were based '010 ISA with very few (if any, I'm not sure) of '020 additions.
Before '020 68K was not much more CISCy than S/360 or i386.

already...@yahoo.com

unread,

Feb 20, 2017, 5:07:54 AM2/20/17

to

On Monday, February 20, 2017 at 7:17:51 AM UTC+2, jacko wrote:
> This is why the little IO co-processor idea works.

All industrial evidence suggests that it does not work.
If you want I/O related intelligence, then right way to do it is to couple it to I/O device itself rather than to main CPU.

already...@yahoo.com

unread,

Feb 20, 2017, 5:23:02 AM2/20/17

to

On Sunday, February 19, 2017 at 8:03:15 PM UTC+2, MitchAlsup wrote:
> On Saturday, February 18, 2017 at 11:40:14 PM UTC-6, Quadibloc wrote:
> > On Saturday, February 18, 2017 at 1:36:05 PM UTC-7, MitchAlsup wrote:
> >
> > > It is not clear that after one has a pipelined FP unit, the the decode
> > > overhead of x86 is more than the 10% level of overhead.
> >
> > The *decode overhead* of x86 is not the issue, as that is utterly trivial.
> >
> > The issue is the number of registers - for *certain pipeline depths*, x86 needs
> > Tomasulo and RISC doesn't to get decent performance. That overhead is
> > non-trivial.
>
> I am going to ignore the glaring error that x86-64 (the only machine up for
> consideration) had 16 GPRs not 8.
>
> To a good first order, All the OoO stuff gains a factor of 2× over a nicely
> pipelined IO machine. So we have Great Big OoO getting 1.2 I/C and to
> compete, the Little Bitty IO machine only has to get 0.6 I/C. There is no
> reason the LBIO machine can't run at the same frequency, and there may be
> frequency upside just because the power consumption is correspondingly
> way down.
>
> At 0.6 IPC and a 128-bit wide <unified cache> instructions can be fetched
> <essentially> without interference with data accesses. IFetch load is
> 1 fetch every 4 cycles on average, DAccess is 2.2 Accesses every 4 cycles.
>
> At 0.6 I/C the FP units can be operated at DIV2 clock for a big savings in
> flip-flop count.

I don't know what's exactly wrong about your theory, but Intel tried it (original Atom) and found that it does not work.

>
> At 0.6 I/C the entire core can be deClocked on a cache miss saving lots of
> power and not having any parts of the pipeline OoO wrt other parts of the
> pipeline. The core is small enough that this decision can be made and instantiated in a single clock cycle.
>
> The actual target for this effort was to make an x86 small enough one could
> put it on the Southbridge* chip and run the slow high latency communications
> parts of the OS down on the SB chip only interrupting the GBOoO machines
> after the run Queues had been updated. No change in OS code was required,
> only some table setup early in the boot process. In effect, the I/O drivers
> ran down on the SB and performed all services they require somewhat like
> IBM channels but with a x86 programming model.
>
> (*) Southbridge or any ASIC one wanted.

For unchanged OS you need a a cache-coherent link (not just I/O coherence which was always available on PC/PCI platform, but a true coherence as between CPU cores) between CPU+Northbridge and Southbridge. Which completely kills the idea, because one slow device participating in CC will make the whole system slow. Alternatively, you could make your small processor cacheless, but then it will be too slow even for "easy" Southbridge tasks.

John Dallman

unread,

Feb 20, 2017, 6:00:40 AM2/20/17

to

In article <o8dl30$um9$1...@dont-email.me>, sarr.b...@alum.dartmouth.org

(Sarr Blumson) wrote:

> There was a 68008 with an 8 bit bus, but it may have been too late.

It was really slow. I used one for a day, in a Sinclair QL. It felt
slower than a good 8-bit system of the period.

John

Robert Wessel

unread,

Feb 20, 2017, 6:21:35 AM2/20/17

to

On Sun, 19 Feb 2017 21:44:31 -0500 (EST), Sarr Blumson
<sarr.b...@alum.dartmouth.org> wrote:

>Quadibloc <jsa...@ecn.ab.ca> Wrote in message:
>>
>> My understanding is that there were technical considerations that led IBM to
>> choose the 8086 architecture for the PC. But it wasn't that this was "better"
>> than the 68000 in the sense of being more powerful. Instead, there were two
>> factors:
>>
>> 1) Intel had an 8086 SX part available. They didn't call it that, of course, but
>> the 8088 was an 8086 with an external 8-bit data bus. That allowed the IBM PC to
>> be manufactured at a price competitive with Z-80 based CP/M machines.
>
>There was a 68008 with an 8 bit bus, but it may have been too late.

The 68008 did not ship until 1982, which would have been far too late
for PC development, the 68000 it was derived from was released in
1979. The 16-bit-bus 8086 was released in 1978, and the 8-bit-bus
version of that, the 8088, was released in 1979.

>> 2) The 8086 was architecturally very similar to Intel's previous 8080. This
>> facilitated easy conversion of programs written for CP/M to PC-DOS.
>>
>> At least, that's what I read in the industry magazine articles on the subject,
>> and I have no reason to think they're mistaken, as the reasons sound plausible.
>
>3) The x86 had better _hardware_ development tools, I've been told.

I'm not sure exactly what you mean by that, but Intel had a couple of
advantages. First, the 8088 and 8086 worked very well with 8080
peripherals, and there were *vast* numbers of those, many both well
tested and quite inexpensive. The 68000 worked OK with 6800/6500
peripherals, although there were far fewer of those available than
8080-type parts. And at the time there were very few 68000
peripherals. Now it was not that hard to drive most 8080 peripherals
off a 6800 or 68000 (although a bit more of a pain if you needed to
translate 8080 waits states into clock stretches), but it was one more
thing needing hardware. Even driving 6800 devices from a 68000 needed
some external hardware (although less than 8080-style devices).

Actually the reverse (driving 6800 peripherals from a 8080) was also a
PITA, but given the variety of 8080 compatible parts, often including
many with considerable more functionality, almost no one ever
bothered. One of the few exceptions was the MC6845 video controller
(it was, in fact, a pretty good display controller in its segment),
which was used on both the MDA and CGA cards (as well as a bunch of
other places and systems). But the choice of the 6845 on the display
adapter didn't impact the motherboard at all, since any issues
supporting the 6800-style interface were local to the adapter card.

Read another way, it's true that one of the reason that Intel had
some many design wins was that they focused heavily on support for the
designers of systems, including many tools, both software and
hardware. That actually came from the top at Intel, from a period
where they were analyzing why customers were not successful using
their products. Motorola was much more laissez-faire about that.

Robert Wessel

unread,

Feb 20, 2017, 6:29:20 AM2/20/17

to

On Sat, 18 Feb 2017 14:38:21 -0000 (UTC), n...@wheeler.UUCP (Nick
Maclaren) wrote:

>In article <2017Feb1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>>Another aspect is that the 68010 (1982) could
>>>run Unix, but it took until the 386 (1985) before Intel could; the
>>>286 just didn't cut the mustard
>>
>>I used a 286 that ran Xenix. Looked good enough at the time.
>
>How familiar were you with the better Unices of the previous era?
>
>The point is that the 80286 could run something a heck of a lot
>better than an unprotected run-time executive (e.g. CPM and its
>descendents) but not a proper operating system, especially not
>one with virtual memory, good RAS against wayward programs and
>decent debuggers.

While I don't have much familiarity with the 286 version of Xenix,
there was nothing preventing a 286 from running an OS with virtual
memory and decent isolation between processes. You just had to do it
with segments. OS/2 did just that, for example, and I believe that
Xenix did have at least some of that.

Of course OS/2 took a major trip into the weeds trying to support
real-mode (MS-DOS) programs on the 286.

Robert Wessel

unread,

Feb 20, 2017, 6:38:06 AM2/20/17

to

On Sun, 19 Feb 2017 10:03:12 -0800 (PST), MitchAlsup
<Mitch...@aol.com> wrote:

>The actual target for this effort was to make an x86 small enough one could
>put it on the Southbridge* chip and run the slow high latency communications
>parts of the OS down on the SB chip only interrupting the GBOoO machines
>after the run Queues had been updated. No change in OS code was required,
>only some table setup early in the boot process. In effect, the I/O drivers
>ran down on the SB and performed all services they require somewhat like
>IBM channels but with a x86 programming model.
>
>(*) Southbridge or any ASIC one wanted.

That answers at least one of the issues with excessive smarts on a
peripheral, namely that over time the main CPU got enough faster (at
least historically) that you'd soon find the devices onboard CPU
actually slowing things down. Since this would be in the Southbridge,
or wherever, it would change (and presumably be upgraded) with the
system SAPs on zArch do more or less the same thing (and dodge the
problem in the same way - by being specific to the system), but are
actually the identical GBOoO cores running normal code, just dedicated
to the I/O function. Of course the cost considerations for a very low
volume device like a zArch box are rather different, as are things
like environmental issues (so a core optimized to the I/O support
function has substantial NREs to spread over a small volume, while
just using the GBOoO core has little NRE).

Robert Wessel

unread,

Feb 20, 2017, 7:31:10 AM2/20/17

to

The 8088 was badly bus-bound. For most cases you could ignore the
instruction timings, and just count the number of bytes of memory
accessed by the instructions (including, crucially, instruction
fetch).

The 68008 had it even worse. Not only were many memory references for
bigger operands (although in many cases these would be equivalent to
two shorter references on an 8086), but the instruction stream was
significantly bulkier. The QL made a bad situation even worse by
putting the display buffer in main memory, and it too needed to hit
the one eight bit bus (sure, the Mac did the same* thing, but the
display requirement came off the 16-bit bus, leaving much more for the
CPU).

*The size of the frame buffer on the Mac was actually a bit smaller,
but I don't know the relative refresh rates on the two systems.

Nick Maclaren

unread,

Feb 20, 2017, 8:21:34 AM2/20/17

to

In article <68ff7b8e-dcc6-4c8d...@googlegroups.com>,

<already...@yahoo.com> wrote:
>On Sunday, February 19, 2017 at 8:03:15 PM UTC+2, MitchAlsup wrote:
>>
>> At 0.6 I/C the FP units can be operated at DIV2 clock for a big savings in
>> flip-flop count.
>
>I don't know what's exactly wrong about your theory, but Intel tried it
>(original Atom) and found that it does not work.

Well, there could be something wrong with yours. Perhaps Intel just
cocked it up? Lots of technologies have been falsely damned because
the first implementation that people had heard of was a cock-up.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Feb 20, 2017, 8:23:47 AM2/20/17

to

In article <tkilac98suf1cl8na...@4ax.com>,

Robert Wessel <robert...@yahoo.com> wrote:
>On Sun, 19 Feb 2017 21:44:31 -0500 (EST), Sarr Blumson
><sarr.b...@alum.dartmouth.org> wrote:
>>Quadibloc <jsa...@ecn.ab.ca> Wrote in message:
>>>
>>> My understanding is that there were technical considerations that led IBM to
>>> choose the 8086 architecture for the PC. But it wasn't that this was
>"better"
>>> than the 68000 in the sense of being more powerful. Instead, there were two
>>> factors:
>

>... Intel had a couple of

>advantages. First, the 8088 and 8086 worked very well with 8080
>peripherals, and there were *vast* numbers of those, many both well
>tested and quite inexpensive. The 68000 worked OK with 6800/6500
>peripherals, although there were far fewer of those available than
>8080-type parts. And at the time there were very few 68000
>peripherals. Now it was not that hard to drive most 8080 peripherals
>off a 6800 or 68000 (although a bit more of a pain if you needed to
>translate 8080 waits states into clock stretches), but it was one more
>thing needing hardware. Even driving 6800 devices from a 68000 needed
>some external hardware (although less than 8080-style devices).

That was irrelevent to IBM's decision, as the IBM PC was not
intended to be a general-purpose computer. Seriously.

Regards,
Nick Maclaren.

David Brown

unread,

Feb 20, 2017, 8:46:19 AM2/20/17

to

One of the most popular automotive 68k devices was the 68332, which
which had a CPU32 core that was based on the 68020 (but missing the
coprocessor support, IIRC, and with a few extra instructions added).

Terje Mathisen

unread,

Feb 20, 2017, 8:56:02 AM2/20/17

to

Robert Wessel wrote:
> On Mon, 20 Feb 2017 11:00 +0000 (GMT Standard Time), j...@cix.co.uk
> (John Dallman) wrote:
>
>> In article <o8dl30$um9$1...@dont-email.me>, sarr.b...@alum.dartmouth.org
>> (Sarr Blumson) wrote:
>>
>>> There was a 68008 with an 8 bit bus, but it may have been too late.
>>
>> It was really slow. I used one for a day, in a Sinclair QL. It felt
>> slower than a good 8-bit system of the period.
>
>
> The 8088 was badly bus-bound. For most cases you could ignore the
> instruction timings, and just count the number of bytes of memory
> accessed by the instructions (including, crucially, instruction
> fetch).

Exactly right.

For every single integer instruction except DIV and to a lesser degree
MUL, you would count the number of data and code bytes (a
read-modify-store instruction would count the data size twice) and
multiply by 4 (the number of cycles need for an 8-bit bus transfer),
then add 5-10% for memory refresh and other bus overhead.

On the classic 4.77 MHz IBM PC this meant barely more than 1 MB/s,
typically 60-80% of it being used for code.

The times calculated this way would correspond extremely well with
actual measured times.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

already...@yahoo.com

unread,

Feb 20, 2017, 8:56:04 AM2/20/17

to

No, CPU32 not based on '020.

Anton Ertl

unread,

Feb 20, 2017, 9:26:33 AM2/20/17

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Saturday, February 18, 2017 at 1:36:05 PM UTC-7, MitchAlsup wrote:
>

>> About a decade ago (almost exactly) I was promoting an x86 that went back
>> to in order mildly pipelined. The core would have been about 16=C3=97 sma=
>ller
>> than the then current Opteron, and delivered just under 1/2 of the large
>> OoO core's perf. It probably would have come in close to 16=C3=97 better =
>in
>> power per instruction. Could not sell AMD management on its utility or ne=
>ed.
>
>I'm not surprised, but I presume that at the present time AMD is aware that=
> a=20
>processor like that would indeed have a market. Perhaps, however, they feel=
> that=20
>instead of directly competing with Intel's Atom (and Intel recently modifie=
>d the=20
>Atom design so that it is OoO, IIRC)

Already Silvermont (2013) is OoO. And given that Intel first had an
in-order design for these low-power parts, and then switched to OoO,
it looks like Intel thinks that OoO is better for the market for which
these chips are designed. AMD decided to go with OoO for their
lower-power designs right from the start. It seems that they both
think that OoO is better in the 1W-10W market.

Here's the performance I see on our LaTeX benchmark:

time System
2.323s Bonnel Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit
1.216s Bobcat AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit
1.052s Silvermont Celeron J1900 2416MHz (Shuttle XS35V4) Ubuntu16.10 64-bit
0.712s Goldmont Celeron J3455 2300MHz, ASRock J3455-ITX, Ubuntu16.10 64-bit

The Bonnel is in-order, the others OoO.

>while such a chip has a market, it wou=
>ld=20
>have a better market, and work more effectively, if it were an ARM chip ins=
>tead=20
>of an x86 chip.

There are a bunch of in-order ARM cores around, starting with the
Cortex-M0.

>So a chip like yours has basically one niche: like Intel's Atom, it is for=
>=20
>lightweight netbooks that run Windows but not Crysis for surfing the Web, d=
>oing=20
>E-mail, and spreadsheets and word processing. With really good battery life=
>.

Actually, I used the Atom 330 in an X-Terminal (aka Thin Client). We
replaced it with the Celeron J3455 not because the CPU was
insufficient, but because the Atom 330 board does not support 4K
screens, and the board with the J3455 does. The Bobcat is actually
indeed in a laptop, and while the CPU is not particularly fast, it is
sufficient for my needs, and the laptop is quiet.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Feb 20, 2017, 9:36:53 AM2/20/17

to

n...@wheeler.UUCP (Nick Maclaren) writes:
>In article <2017Feb1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>>Another aspect is that the 68010 (1982) could
>>>run Unix, but it took until the 386 (1985) before Intel could; the
>>>286 just didn't cut the mustard
>>
>>I used a 286 that ran Xenix. Looked good enough at the time.
>
>How familiar were you with the better Unices of the previous era?

What "previous era" are you referring to? At the time I was not
familiar with other Unixes.

>The point is that the 80286 could run something a heck of a lot
>better than an unprotected run-time executive (e.g. CPM and its
>descendents) but not a proper operating system, especially not
>one with virtual memory, good RAS against wayward programs and
>decent debuggers.

We used the system in a systems programming course with hundreds of
different users (and maybe a dozen simultaneously), so it probably had
to deal with a lot of different "wayward" programs. It held up
nicely.

I don't know if it swapped segments out if memory was tight, but in
any case, I did not get ENOMEM, processes killed by the OOM killer or
other problems that made me wish for better memory management.

Anton Ertl

unread,

Feb 20, 2017, 9:38:19 AM2/20/17

to

Bruce Hoult <bruce...@gmail.com> writes:
>A claim I've seen recently is that Moore's law has already ended in that al=
>though smaller processes continue to be developed (slowly) the cost per tra=
>nsistor is rising in those new processes not falling.

If that was the case, why would Intel invest money in smaller processes?

David Brown

unread,

Feb 20, 2017, 9:40:27 AM2/20/17

to

I can't answer for the hardware design itself, but the ISA is often
described as being based on the 68020. It was fully 32-bit internally
(but the 68332 had a 16-bit external bus), and had the same addressing
modes as the 68020. It did not support the bitfield instructions, but
had some additional ones for table lookup and interpolation. A key
point is that the 68000 (and 68010) is often called (by Motorola) a
16/32 bit microprocessor, but the CPU32 is called a 32-bit core.

The CPU32 reference manual describes it as being based on the MC68000
with many of the features of the MC68010 and MC68020.

already...@yahoo.com

unread,

Feb 20, 2017, 9:45:37 AM2/20/17

to

So, ARM cocked it up too?
Their Cortex-A53 is an excellent LBIO according to most observers. And still on the same process it is only about twice more power-efficient than their BGOoO Cortex-A73.
Yes, the difference in performance per mm^2 is much bigger than 2x, but that's almost totally irrelevant in practice, mostly because to do anything usefull you need caches and prefetch and interconnects and memory controllers etc.. all of them serving as equalizers.
Also a difference in absolute performance is more like 2.5 rather than 2x.

jacko

unread,

Feb 20, 2017, 9:49:35 AM2/20/17

to

The simplest form of OoO is jump over stall if no dependants. The question is how many pipe elements to jump over in one clock?

EricP

unread,

Feb 20, 2017, 12:01:17 PM2/20/17

to

Quadibloc wrote:
>
> So here is what I think the root cause was:
>
> a) There was, at the time, a _perceived_ technical superiority of RISC over
> CISC. Note, too, that "technical superiority" isn't an absolute attribute of an
> ISA, but rather a function of the interaction between an ISA and the
> implementation technology of a given time.
>
> So if your technology lets you put a 360/40 on a single chip - or a 386 - then
> since you're doing it all in microcode, there's nothing at all wrong with CISC,
> and it's nice to the programmer.
>
> If your technology lets you put a 360/195 on a single chip - or a Pentium II
> (both had OoO for FP only, both had L1 caches only, both used a fancy
> high-performance division algorithm) - well, then you have enough transistors to
> implement CISC with high performance.
>
> So it's no accident that RISC got perceived as technically superior in the *486*
> era. If RISC gives you enough registers so you can implement deep pipelines, but
> CISC doesn't, since you can't yet afford the overhead of OoO, then RISC clearly
> _is_ technically superior - it makes a high-performance CPU possible.

It took a while to round up benchmarks from that 1992 time frame.

Bottom line: It is pretty consistent in these lists
that RISC's benchmark as 1.5 to 5 times faster than CISCs.
So it was more than just a perception.
Also you need to consider R&D cost and time.

The first 1992 article that compares the 80486, 68040
(both pipelined) and a bunch of others including RISC's.
About half way down are the Dhrystone integer and
then Linpack DP performance benchmarks.

CISC: The Intel 80486 vs. The Motorola MC68040
Advanced Microprocessors by Daniel Tabak, 1992
https://textfiles.meulie.net/computers/486vs040.txt

This has a performance table of many processors:
https://en.wikipedia.org/wiki/Instructions_per_second

Not on the lists from 1992 was the heavily pipelined
90 MHz NVAX which was 25 MIPS-VUPS (times the 11/780),
so slightly faster than the 25 MHz 68040.
Also missing is the 200 MHz pipelined, dual issue
Alpha 21064 which lists as 135 MIPS-VUPS.

Those two are important because they both
use the same 0.75u triple-metal CMOS process.
The 200 MHz Alpha benchmarks as 5 times faster
than the NVAX.

This has results comparing VAX, MIPS, PA-RISC,
Pentium, Celeron, Alpha 21064, Alpha 20162:
http://www.xanthos.se/~joachim/results.pdf

> Stripped-down OoO that's enough to deal with cache misses may still be possible,
> I used OoO above to refer to full-blown register-renaming Tomasulo.

As I've noted before, you could add renaming to an in-order core.
That no one does this may be because it doesn't help.

Eric

MitchAlsup

unread,

Feb 20, 2017, 12:15:29 PM2/20/17

to

On Sunday, February 19, 2017 at 12:39:06 PM UTC-6, Ivan Godard wrote:
> On 2/19/2017 10:03 AM, MitchAlsup wrote:
>
> <snip>

>
> > At 0.6 I/C the entire core can be deClocked on a cache miss saving lots of
> > power and not having any parts of the pipeline OoO wrt other parts of the
> > pipeline. The core is small enough that this decision can be made and instantiated in a single clock cycle.
>

> How do you get the stall signal out so quickly? The core is several
> clocks wide just in wire delay, isn't it? You can stop the master
> assuming it is located near the cache, but what about the ticks that are
> in flight already?

No, the core is but one clock wide. Remember it is a very small core
with no OoO, almost nobranch prediction, 1-wide, in-order, shallow pipeline;
and the cache tags are positioned at the center of the core itself.

MitchAlsup

unread,

Feb 20, 2017, 12:15:39 PM2/20/17

to

On Sunday, February 19, 2017 at 1:46:55 PM UTC-6, Nick Maclaren wrote:
> In article <7fb0d01c-86c1-4703...@googlegroups.com>,

> MitchAlsup <Mitch...@aol.com> wrote:
> >
> >To a good first order, All the OoO stuff gains a factor of 2 over a nicely
> >pipelined IO machine. So we have Great Big OoO getting 1.2 I/C and to
> >compete, the Little Bitty IO machine only has to get 0.6 I/C. There is no
> >reason the LBIO machine can't run at the same frequency, and there may be
> >frequency upside just because the power consumption is correspondingly
> >way down.
>

> Or have several times as many cores.

Something on the order of 8× the number of cores.

MitchAlsup

unread,

Feb 20, 2017, 12:15:44 PM2/20/17

to

On Monday, February 20, 2017 at 8:26:33 AM UTC-6, Anton Ertl wrote:
>
> Already Silvermont (2013) is OoO. <snip>

>
> Here's the performance I see on our LaTeX benchmark:
>
> time System
> 2.323s Bonnel Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit
> 1.216s Bobcat AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit
> 1.052s Silvermont Celeron J1900 2416MHz (Shuttle XS35V4) Ubuntu16.10 64-bit
> 0.712s Goldmont Celeron J3455 2300MHz, ASRock J3455-ITX, Ubuntu16.10 64-bit

One should not be comparing 1 LBIO core with 1 GBOoO core, but a cluster of
cores that occupy the same die footprint.

Thus <let's say> 6 LBIO cores should be compared to 1 GBOoO core.

Then one needs to find a workload which can keep a multiplicity of cores occupied.......

BGB

unread,

Feb 20, 2017, 12:48:30 PM2/20/17

to

On 2/19/2017 5:01 PM, jacko wrote:
>> I considered doing a custom 64-bit SH-based ISA, but haven't formally
>> done so yet.
>>
>> so, this is basically using the good ol' SH4 ISA. there is the slightly
>> newer 4A and 2A ISA's, which have some tempting instructions, but it
>> isn't known safe to use them as there may(?) still exist patents which
>> cover them.
>
> a more final version spec
https://dl.dropboxusercontent.com/u/1615413/Own%20Work/68k2-PC%23d12.pdf
with only a few added instructions and some new addressing modes.
There's even a table at the end to shown the free instruction space.
>
> Most of the patents come from implementation of instructions.
>

here is a summary spec showing the current 32-bit ISA:
https://github.com/cr88192/bgbtech_shxemu/wiki/SH-ISA

though the Hitachi SH-4 spec is also usable, though a few rarely used
SH-4 instructions aren't implemented, and my MMU works differently.

here is a more official spec of the ISA:
http://www.st.com/content/ccc/resource/technical/document/user_manual/69/23/ed/be/9b/ed/44/da/CD00147165.pdf/files/CD00147165.pdf/jcr:content/translations/en.CD00147165.pdf

don't currently have an up-to-date / complete spec for the 64-bit ISA
idea (which basically needs to not conflict with the 32-bit ISA).

I may modify it some, as the old design had 4 modes:
32-bit mode;
a mode with 32-bit addressing and 32/64-bit data ops.
a mode with 64-bit addressing and 16/32-bit data ops.
a mode with 64-bit addressing and 32/64-bit data ops.

I may reduce this down to a simple mode-select, and if 32-bit addressing
is desired, it would be handled more like in X32.

originally, the 4 modes were partly because in an earlier version,
32/64-bit data ops would completely eliminate 16-bit W ops.

if I limited it to a single-bit mode select, the only W loads/stores
available would be:
85m0 MOV.W @Rm, R0
81n0 MOV.W R0, @Rn

>> also, I wouldn't have a ready-made compiler supporting a custom ISA; I
>> would either need to modify GCC to do so (ick!), or write a dedicated C
>> compiler/backend (its own set of issues).
>
> This is a big plus to only need a small mod.
>

taking a 32-bit ISA and making it 64-bit is an easy feature for
emulators and assemblers, but C compilers deal with it much less gracefully.

though, in this case, the C ABI would be mostly unchanged (apart from
being 64-bit), which is very much unlike the case for the x86 vs x86-64
(now we have 3 somewhat different ABI's).

>> unlike the SH5 ISA (a completely new ISA for 64-bit mode), my ISA would
>> have more worked like in x86-64, quietly expanding the registers to 64
>> bits, and allowing many existing opcodes to function in various widths.
>
> I used the size = 11 option and removed many 68020+ instructions, and
relocated a lot.
>

SH-4 had instructions for Byte/Word/Long.
I needed Byte/Word/Long/Quad, but didn't want extensive modification to
the existing ISA, or to require significantly more opcodes.

>> most 32-bit ops remained 32-bits, but many 16-bit ops could be promoted
>> to 64-bits (with a few 32-bit ops promoting, and certain redundant
>> special cases remaining as 16-bit forms).
>
> I used some of the 8 bit instructions as 64, and the 8 bit ones in
more complex patterns as they are mainly IO rare, and often maskable in
16 bit.
>
>> this would partially penalize using 16-bit types in 64-bit mode, which
>> would generally require less efficient instruction sequences.
>
> Yes, this would be a problem with an addressing mode which is capable
of using a 64 bit register as 4 * 16 bit registers.
>

no direct analog here.

the issue here is mostly with loads/stores, where in the simple case,
one has:
MOV.W @Rm, R0
MOV.W R0, @Rn
rather than:
MOV.W @Rm, Rn
MOV.W Rm, @Rn
MOV.W @(Rm, disp), R0
MOV.W R0, @(Rn, disp)
...

this means a displacement+word load would require something like:
MOV R3, R8
ADD #6, R8
MOV.W @R8, R0
rather than:
MOV.W @(R3,6), R0
...

the word case is already penalized vs the long case, which has:
MOV.L @(Rm, disp), Rn
though, could use a similar hack to the word case and be like:
5nm0 MOV.W @Rm, Rn
given:
5nm0 MOV.L @(Rm, 0), Rn
is functionally equivalent to:
6nm2 MOV.L @Rm, Rn

previously, this I-form was spec'ed as a way to allow 64-bit loads when
dealing mostly with 16-bit data; but it is a tradeoff vs making all this
depend on context bits (like it is with the FPU instructions).

>> also how the FPU worked was tweaked a bit, the FPU now having 16 double
>> registers (vs 16x float or 8x split-double).
>
> There is no hard spec on the length of the FP registers.
>

in SH4, they were fixed at 32-bits, and dealing with 64-bit doubles
effectively worked with them split in-half across a pair of registers;
similar to the EDX:EAX system in x86.

in the emulator, because the halves are backwards of the useful order,
this generally means that double operations involve loading like:
MOV RAX, [...]
ROL RAX, 32
MOVQ XMMn, RAX

GCC seems to be helpful in that it seems to treat "double" more as a
guideline than as an actual requirement, so much of the time, even if
"double" is given to the compiler, it still emits code working on 32-bit
float values...

seems to be it has a sort of cleverness that, if it sees that the end
result is truncated to float, any of the calculations which lead to it
are truncated to float as well.

>> unclear would be if this were the most sane way to do this.
>>
>> as-before, instructions were still fixed-width 16-bit (it was otherwise
>> still basically the SH4 ISA).
> <snip>
>> super-ops aren't really great for raw MIPS counts, as they are
currently
>> counted as fewer instructions by the emulator (1), but are generally
>> useful for improving performance.
> <snip>
>> also I seem to now be often hitting 60 FPS in Quake 1 in 640x480, so...
>>
> <snip>
>> also debating whether to use the real hardware GPU (via OpenGL), or
>> going the "technically simpler in this case" route of using a software
>> rasterizer (makes most sense if using precooked screen-space triangles).
>
> OpenGLES2 is the most popular "on the market"
>

though, OTOH, a real/modern GPU is rather unlikely to be available if an
implementation were done with an FPGA. dealing with potentially a
software-emulated GPU, or something more along the lines of the S3 Virge
(say: limited fixed-function triangle drawing, *), seems more viable.

*: though the actual Virge seems to do polygon drawing, and apparently
is fed a single primitive at a time; I was more thinking of triangles
fed in via a queue.

I am left wondering how little I can get away with and still have Quake3
be able work ok (I did a software-rendered OpenGL before, but this time
I am considering something a little more limited).

though GLES2 is not likely to work well with a Virge-like design, but is
likely to also need either SW emulation or a GPU a bit more advanced
than can likely be shoved into an FPGA (alongside a few CPU cores and
some IO peripherals).

like, even if the GPU is effectively just a multicore processor, making
typical fragment shader code perform well is itself non-trivial (more so
if the GPU cores are too small to afford things like an FPU and SIMD,
say, if SH2 cores were used as a GPU).

sadly, the emulator is currently itself a bit faster than what I could
generally expect from a possible FPGA version.

like, right now (after some more JIT optimization, 1), experimentally I
am getting ~ 410-460 MIPS from the emulator (for a single thread), and
can effectively get ~1.6k MIPS with 4 threads.

1: added a basic register allocator, which per-trace may map SH4
registers onto x86-64 registers (currently R12D..R14D, EBX, and ESI).
tested using EBP for register allocation as well, but for some
unexplained reason, performance fell off a cliff while doing so.

the register allocator gained an average of about 50 MIPS or so (vs
always using loads/stores to the emulated CPU context).

already...@yahoo.com

unread,

Feb 20, 2017, 12:52:00 PM2/20/17

to

On Monday, February 20, 2017 at 7:15:44 PM UTC+2, MitchAlsup wrote:
> On Monday, February 20, 2017 at 8:26:33 AM UTC-6, Anton Ertl wrote:
> >
> > Already Silvermont (2013) is OoO. <snip>
> >
> > Here's the performance I see on our LaTeX benchmark:
> >
> > time System
> > 2.323s Bonnel Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit
> > 1.216s Bobcat AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit
> > 1.052s Silvermont Celeron J1900 2416MHz (Shuttle XS35V4) Ubuntu16.10 64-bit
> > 0.712s Goldmont Celeron J3455 2300MHz, ASRock J3455-ITX, Ubuntu16.10 64-bit
>
> One should not be comparing 1 LBIO core with 1 GBOoO core, but a cluster of
> cores that occupy the same die footprint.

Bobcat and Silvermont are certainly not GBOoO cores.
It's very likely that Bobcat is *smaller* than Bonnel when manufactured on the same process. Silvermont is bigger, but certainly not 6 times bigger. More like 1.5-2 times.
Goldmont in my book qualifies as "small GBOoO", but even Goldmont is not 6 times bigger than Bonnel.

As to perf/power it's hard to tell, because of differences in manufacturing. I'd guess that on the same process Silvermont will be the most efficient, then Bonnel/Saltwell. Bobcat/Jaguar/Puma and Goldmont are likely less efficient, but not by much.

>
> Thus <let's say> 6 LBIO cores should be compared to 1 GBOoO core.
>
> Then one needs to find a workload which can keep a multiplicity of cores occupied.......

ARM had really small application processor in form Cortex-A5. It was less successful than still small, but bigger Cortex-A7.

Quadibloc

unread,

Feb 20, 2017, 1:02:48 PM2/20/17

to

On Sunday, February 19, 2017 at 7:44:35 PM UTC-7, sarr wrote:

> There was a 68008 with an 8 bit bus, but it may have been too late.

Yes, it was. It was not available in 1981 for the IBM PC, but when it did become
available, it was used in the Sinclair QL.

John Savard

Stefan Monnier

unread,

Feb 20, 2017, 1:14:56 PM2/20/17

to

> interesting ISA once where the fixed-length instruction had three arguments
> and two opcodes and computed (A op1 B) op2 C, with the intermediate result
> hidden; sort of a similar idea.

Not only it saves register renames, but it also saves you from placing
latches between op1 and op2 (and it additionally lets you use a simpler
forwarding network between the supported combination of op1 and op2) if
you're interested in using a slower clock rate.

Stefan

Ivan Godard

unread,

Feb 20, 2017, 2:09:34 PM2/20/17

to

On 2/20/2017 9:48 AM, BGB wrote:
> On 2/19/2017 5:01 PM, jacko wrote:

<snip>

> here is a summary spec showing the current 32-bit ISA:
> https://github.com/cr88192/bgbtech_shxemu/wiki/SH-ISA
>

Amazing how much complexity goes away when data defines its own width.

Megol

unread,

Feb 20, 2017, 3:37:49 PM2/20/17

to

On Monday, February 20, 2017 at 3:38:19 PM UTC+1, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >A claim I've seen recently is that Moore's law has already ended in that al=
> >though smaller processes continue to be developed (slowly) the cost per tra=
> >nsistor is rising in those new processes not falling.
>
> If that was the case, why would Intel invest money in smaller processes?

As long as there is a performance benefit there's a reason to continue improving process technology. If not another actor will be able to increase their market influence by reducing per chip profits.

However I doubt that transistors are getting more expensive already, at 7 nm to 5 nm perhaps. Quadruple multi-patterning is expensive and EUV would still be problematic. EUV needs multi-patterning on those nodes, EUV masks are more expensive, easier to destroy, harder to verify and hard to correct.

Megol

unread,

Feb 20, 2017, 3:43:42 PM2/20/17

to

In the instruction set? Yes and it saves some instruction bits too. But for the actual execution hardware?
The difference is where the size information comes from otherwise there shouldn't be any difference AFAIK. There can be some problems too as specifying the operand size in the instruction allows execution width to be changed per execution unit, doing that with size-tagged data is more complicated.

BGB

unread,

Feb 20, 2017, 3:44:43 PM2/20/17

to

?...

do you mean this in that SuperH only has fixed-width registers, and
limits 8/16 bit operations to memory loads/stores? (as opposed to being
like x86 and having 8/16/32/64 bit operations for every register operation).

though, luckily this ISA isn't all that bad, and managed to fit pretty
much everything into fixed-width 16-bit instruction words.

ATM, I am not really sure what is usually considered good performance
for a CPU emulator, still generally at around 410-460 MIPS (running on a
4GHz AMD FX-8350).

I am not actually sure where all the time is going in this case, as
nothing seems to be obviously eating a mountain of cycles (and a lot of
the instructions are mapping pretty close to native x86 instructions,
intuitively I would expect it a little better).

checking frame times (same resolution and viewpoint), Quake in the
emulator is ~ 7x slower than a version compiled as 32-bit x86 (~ 35 ms
frames vs ~ 5 ms frames).

I guess probably not too terrible.

though, if it were 1:1, for a ~ 7x slowdown I would be expecting to be
seeing something like ~ 600-700 MIPS from the emulator or something?...
(given that a 4GHz CPU presumably pulls off 4000 MIPS and SH4
instructions are presumably worth slightly less than x86 instructions).

so, apparently, I am pulling around a 7x slowdown at around 420 MIPS, hrrm.

or such...

EricP

unread,

Feb 20, 2017, 4:11:53 PM2/20/17

to

EricP wrote:
>
> Bottom line: It is pretty consistent in these lists
> that RISC's benchmark as 1.5 to 5 times faster than CISCs.
> So it was more than just a perception.

Here is another: a pretty detailed analysis by DEC
comparing a MIPS M/2000 to a VAX 8700, and finding
"resulting [MIPS] advantage in cycles per program ranges
from slightly under a factor of 2 to almost a factor of 4,
with a geometric mean of 2.7"

VAX 8700 has an integer SPECmark of 5.0, MIPS M/2000 is 19.7.

Performance from Architecture: Comparing a RISC and a CISC
with Similar Hardware Organization (1991)
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.5563

Eric

already...@yahoo.com

unread,

Feb 20, 2017, 4:55:46 PM2/20/17

to

On Monday, February 20, 2017 at 10:44:43 PM UTC+2, BGB wrote:
> On 2/20/2017 1:09 PM, Ivan Godard wrote:
> > On 2/20/2017 9:48 AM, BGB wrote:
> >> On 2/19/2017 5:01 PM, jacko wrote:
> >
> > <snip>
> >
> >> here is a summary spec showing the current 32-bit ISA:
> >> https://github.com/cr88192/bgbtech_shxemu/wiki/SH-ISA
> >>
> >
> > Amazing how much complexity goes away when data defines its own width.
> >
>
> ?...
>
> do you mean this in that SuperH only has fixed-width registers, and
> limits 8/16 bit operations to memory loads/stores? (as opposed to being
> like x86 and having 8/16/32/64 bit operations for every register operation).
>
>

No, Ivan means that Mill metadata is absolutely great! :-)

Bruce Hoult

unread,

Feb 20, 2017, 5:29:43 PM2/20/17

to

On one benchmark that I've run across a wide variety of machines (counting prime numbers) qemu-riscv is 3.4x slower than native on an i7 6700K, while qemu-arm is 8.58x slower and qemu-aarch64 is 8.73x slower than native.

qemu-riscv uses 114% of the clock cycles on the i7 compared to the same code native on the FE310 core (a 32 bit "Rocket" single issue in-order).

qemu-arm uses 246% of the clock cycles on the i7 compared to native on an A7 (Raspberry Pi 2), and 452% of the clock cycles compared to native on an A15 (Odroid XU04)

Would be interesting to know how an SH and your emulator do.

// Program to count primes. Not great code, but I wanted something that
// could run in 16 KB and took time, and not optimizable (and with
// unpredictable branches). Size is for just countPrimes() with gcc -O1
//
// SZ = 1000 -> 3713160 primes, all primes up to 7919^2 = 62710561
// 2.872 sec i7 6700K @ 4200 MHz 240 bytes 12.1 billion clocks
// 4.868 sec i7 3770 @ 3900 MHz 240 bytes 19.0 billion clocks
// 9.740 sec i7 6700K qemu-riscv 182 bytes 40.9 billion clocks
// 11.445 sec Odroid XU4 A15 @ 2 GHz 204 bytes 22.9 billion clocks
// 19.500 sec Odroid C2 A53 @ 1.536 GHz A64 276 bytes 30.0 billion clocks
// 23.940 sec Odroid C2 A53 @ 1.536 GHz T32 204 bytes 36.8 billion clocks
// 24.636 sec i7 6700K qemu-arm 204 bytes 103.5 billion clocks
// 25.060 sec i7 6700K qemu-aarch64 276 bytes 105.3 billion clocks
// 30.420 sec Pi3 Cortex A53 @ 1.2 GHz 204 bytes 36.5 billion clocks
// 47.910 sec Pi2 Cortex A7 @ 900 MHz 204 bytes 42.1 billion clocks
// 112.163 sec HiFive1 RISCV @ 320 MHz 182 bytes 35.9 billion clocks
// 140.241 sec HiFive1 RISCV @ 256 MHz 182 bytes 35.9 billion clocks

#include <stdio.h>

#define SZ 1000
int primes[SZ], sieve[SZ];
int nSieve = 0;

int countPrimes(){
primes[0] = 2; sieve[0] = 4; ++nSieve;
int nPrimes = 1, trial = 3, sqr=2;
while (1){
while (sqr*sqr <= trial) ++sqr;
--sqr;
for (int i=0; i<nSieve; ++i){
if (primes[i] > sqr) goto found_prime;
while (sieve[i] < trial) sieve[i] += primes[i];
if (sieve[i] == trial) goto try_next;
}
break;
found_prime:
if (nSieve < SZ){
//printf("Saving %d: %d\n", nSieve+1, trial);
primes[nSieve] = trial;
sieve[nSieve] = trial*trial;
++nSieve;
}
++nPrimes;
try_next:
++trial;
}
return nPrimes;
}

int main(){
int res = countPrimes();
printf("%d primes found\n", res);
return 0;
}

BGB

unread,

Feb 20, 2017, 5:44:45 PM2/20/17

to

does it come mostly for free, or is it something that one has to pay for
via runtime costs? (ex: checking type-tags or similar eats a certain
number of clock-cycles and/or involves branching based on status flags,
...).

FWIW, I didn't design this ISA, mostly the original design was done by
Hitachi back in the 90s (and has hit-or-miss support by the Linux kernel
and by GCC).

then the J-Core people started reviving it (due to expired patents, 1),
and I encountered the project, and liked the ISA design, so wrote an
emulator.

though they are mostly focusing on SH2 (also supported by the emulator),
but I partly went for SH4 partly as it has some more features.

1: apparently also because it is simpler than 32-bit x86, more capable
and has better code density than ARMv3, has GCC support, ...

Stephen Fuld

unread,

Feb 20, 2017, 6:29:02 PM2/20/17

to

On 2/20/2017 6:37 AM, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
>> A claim I've seen recently is that Moore's law has already ended in that al=
>> though smaller processes continue to be developed (slowly) the cost per tra=
>> nsistor is rising in those new processes not falling.
>
> If that was the case, why would Intel invest money in smaller processes?

Well, smaller transistors means smaller dies for the same number of
transistors. This means more dies per wafer, thus lower cost. That
saving may out weigh any increased cost per transistor. Or, Intel could
spend the die area on a more parallel GPU unit or even larger cache.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)