Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Some More of the 68000's Greatest Mistakes

2,678 views
Skip to first unread message

Quadibloc

unread,
May 25, 2018, 2:20:36 AM5/25/18
to
I could not resist this title, given that of the existing thread.

It is true that at the beginning, the 68000 "went wrong" by not offering an option like the 68008 in time for IBM to have considered it for the PC.

Given the similarities of the 8086 to the 8080, and thus the ease of converting programs from CP/M to PC-DOS that resulted, though... I think that although many at resolutely big-endian IBM would have liked the 68008, it still wouldn't be a slam dunk.

At the end of the life of the 68000 architecture, though, I can think of three actions on Motorola's part that were mistakes...

1) The 68060 was brought out to compete with the Pentium.

However, it made what seems to me to be a mistake... although a most reasonable
one. One that AMD keeps making as well.

Given that floating-point is an exotic data type, mainly of use to scientists
and FORTRAN programmers, and integer performance is what matters...

then AMD's decision with Bulldozer must have seemed reasonable at the time, to
de-emphasize floating-point vector performance to save die space...

and the decision with the 68060 to do the opposite of what Intel did with the
Pentium AND the opposite of what IBM did with the Model 91, and make the
*integer* unit pipelined, but *not* the *floating-point* unit must have seemed
equally sensible.

2) The decision to abandon the 68000 architecture when it still had
*customers*... Apple, with its Macintosh, and as well the Atari ST and the
Commodore Amiga... well, that branded Motorola as an unreliable supplier.

3) And then there's ColdFire. Maybe the 68000 had too many baroque addressing
modes, and even doing what Intel was about to do with the Pentium Pro, and AMD
did by buying another company's design for the... was it the K6? ... and
converting code into RISC-like micro-ops would not have saved it.

However, ColdFire went a step too far. One of the addressing modes abandoned was
16-bit displacement + base register + index register. Without that, one has to
compile code that uses arrays into long, cumbersome sequences involving extra
instructions. This is one of the most basic standard addressing modes - RX
format - not some exotic VAX-like foray into an exotic realm.

Had they not done this, one could have imagined 68000 customers converting to
the more streamlined ColdFire architecture, if only the features omitted were
ones they could live without.

However, as ColdFire was positioned as being for the embedded market, it is also
not as if beefy processors with that architecture would be available.

John Savard

matt...@gmail.com

unread,
May 25, 2018, 2:31:03 PM5/25/18
to
On Friday, May 25, 2018 at 1:20:36 AM UTC-5, Quadibloc wrote:
> I could not resist this title, given that of the existing thread.
>
> It is true that at the beginning, the 68000 "went wrong" by not offering an option like the 68008 in time for IBM to have considered it for the PC.
>
> Given the similarities of the 8086 to the 8080, and thus the ease of converting programs from CP/M to PC-DOS that resulted, though... I think that although many at resolutely big-endian IBM would have liked the 68008, it still wouldn't be a slam dunk.

I agree. IBM and Intel already had a business relationship and the 8088 was available in large quantities for cheap (likely due to poor performance and weak demand). Motorola probably could not have supplied large quantities of even the 68000 to one customer because they were selling so many to many different customers. Many of these customers were also higher end and higher margin CPU customers for workstations. The focus was on the high margin high end CPU market which later disappeared quickly with the RISC hype and propaganda. Had Motorola expanded more quickly into the embedded and mass production affordable CPU PC markets with the 68008, it would have helped the 68k family survive longer even without IBM choosing it for the IBM PC. It would be interesting to know if the 68008 would have been chosen for the Apple Macintosh, Atari ST and/or C= Amiga also if it had been available earlier. As it was, only the short lived Sinclair QL received the 68008 (first preemptive multitasking PC OS followed shortly after by the AmigaOS showing easier compiler support for an orthogonal CPU with many GP registers).

> At the end of the life of the 68000 architecture, though, I can think of three actions on Motorola's part that were mistakes...
>
> 1) The 68060 was brought out to compete with the Pentium.
>
> However, it made what seems to me to be a mistake... although a most reasonable
> one. One that AMD keeps making as well.
>
> Given that floating-point is an exotic data type, mainly of use to scientists
> and FORTRAN programmers, and integer performance is what matters...
>
> then AMD's decision with Bulldozer must have seemed reasonable at the time, to
> de-emphasize floating-point vector performance to save die space...
>
> and the decision with the 68060 to do the opposite of what Intel did with the
> Pentium AND the opposite of what IBM did with the Model 91, and make the
> *integer* unit pipelined, but *not* the *floating-point* unit must have seemed
> equally sensible.

I don't know that the 68060 deliberately "de-emphasized" floating point performance. The 68040 FPU design had already simplified the FPU by trapping many of the 6888x instructions and handling in software (the x86 FPU retained most legacy FPU instructions in hardware). I expect the lack of fully pipelined FPU for the 68060 was lack of time and/or a 2.5 million transistor budget. The 68060 design team probably realized they had to do more with less than the Pentium to compete. The 68k won the battle with the 68060 but the 68k lost the war.

Pentium@75MHz 80502, 3.3V, 0.6um, 3.2 million transistors, 9.5W max
68060@75MHz 3.3V, 0.6um, 2.5 million transistors, ~5.5W max *1
PPC 601@75MHz 3.3V, 0.6um, 2.8 million transistors, ~7.5W max *2

*1 estimate based on 68060@50MHz 3.9W max, 68060@66MHz 4.9W max
*2 estimate based on 601@66MHz 7W max, 601@80MHz 8W max

The 68060 had the best integer performance even though benchmarks often showed the Pentium to be competitive if not slightly ahead due to much better compiler support. The Pentium did have better theoretical FPU performance but likely required hand laid assembler to achieve it. The 68060 FPU is more compiler friendly and performs well on mixed code as integer instructions can often operate in parallel (although FPU instructions using immediates annoyingly can't probably due to another transistor saving strategy). From optimizing FPU code, I suspect the small 8kB data cache becomes the bottleneck on these older processors with games like Quake which became so important to sales at the time. My 68060@75MHz Amiga with Voodoo 4 (no T&L) can play Quake 512x384x16 at ~25 fps. It is obviously not as well optimized as the PC version but shows the Pentium FPU advantage was better for marketing than performance. Certainly at that time it was better to focus on integer performance but the 68060 still had a good FPU.

> 2) The decision to abandon the 68000 architecture when it still had
> *customers*... Apple, with its Macintosh, and as well the Atari ST and the
> Commodore Amiga... well, that branded Motorola as an unreliable supplier.

Apple was in the AIM (Apple, IBM and Motorola) agreement so already was on board with switching ISAs to PPC. Atari ST sales were dropping and the Amiga was grossly mismanaged. Commodore did try to license the 68k from Motorola to make a cheap SoC Amiga before they went bankrupt. I suspect it would have been a 68020 or 68030 instead of the 68060 though. A single chip Amiga with 68060 would have been awesome especially if Commodore could have continued to improve the 68060. Commodore had bought MOS (Chuck "6502" Peddle joined C= but they wasted his talents in typical C= fashion), was using FPGAs for custom chip development back then, and had added instructions to the PA-RISC for their Hombre gfx chipset so they had the technology to improve it if not sabotaged by management. Many of the C= engineers had bought into the RISC hype so there was no guarantee the 68k would have been further developed by C=. The 68060 was also better performance than the comparable PA-RISC CPU of the time at the same clock speed. I did the following 68060 comparison a while ago...

The PA-RISC 7100@99MHz (L1: 256kB ICache/256kB DCache) without SIMD could decode MPEG 320x240 video at 18.7 fps. My 68060@75MHz (L1: 8kB ICache/8kB DCache) using the old RiVA 0.50 decodes MPEG video between 18-22fps (average ~20fps). An update to the new RiVA 0.52 works now giving 21-29 fps (average is ~26fps with more 68060 optimization possible). Note that the PA-RISC 7100 was introduced in 1992 and used in technical and graphical workstations and computing servers while the 68060 was introduced in 1994 for desktop and embedded applications (less demanding and lower cost applications). The PA-RISC 7100LC@60MHz (L1: 32kB ICache/32kB DCache) introduced in 1994 with SIMD (initially 32 bit MAX but may have been upgraded to MAX-2 later?) could do 26fps decoding 320x240 MPEG. MAX not only improved the performance (finally better than the 68060 at MPEG fps) but improved the code density by replacing many RISC instructions allowing the cache sizes to be reduced tremendously. The PA-RISC 7100LC@80MHz (L1: 128kB ICache/128kB DCache) with MAX SIMD could do 33fps decoding 320x240 MPEG. As we can see, the PA-RISC had unimpressive performance even with an SIMD and lots of resources.

> 3) And then there's ColdFire. Maybe the 68000 had too many baroque addressing
> modes, and even doing what Intel was about to do with the Pentium Pro, and AMD
> did by buying another company's design for the... was it the K6? ... and
> converting code into RISC-like micro-ops would not have saved it.

The 68k doesn't need to waste energy moving to OoO and breaking down instructions as far as x86/x86_64 which means it can do a better Atom like CPU. The powerful addressing modes may reduce the clock speed some but there are indications that it would have very good single core performance/clock. A lower clock speed is bad for marketing but good for embedded applications.

> However, ColdFire went a step too far. One of the addressing modes abandoned was
> 16-bit displacement + base register + index register. Without that, one has to
> compile code that uses arrays into long, cumbersome sequences involving extra
> instructions. This is one of the most basic standard addressing modes - RX
> format - not some exotic VAX-like foray into an exotic realm.

I altered a 68k disassembler to analyzed Amiga 68k code and didn't see (bd16,An,Xi*SF) used too often. Maybe other operating systems use it more. Maybe compilers were already breaking it down into multiple instructions (because of the EA calc cost of this addressing mode?). Most modern large 68k programs use absolute addressing because there is no cheap (d32,An) or (d32,PC) when the programs become large and absolute addressing is cheaper than (bd,An,Xi*SF). This is a waste as many accesses would fit in (d16,An) or (d16,PC) which would improve code density. There is a simple and compact way to encode (d32,PC) but not (d32,An) which is why I suggested an ABI code model which merges all sections and allows most accesses to be PC relative (requires allowing PC relative writes which are currently illegal). PC relative addressing is more compact and saves a base address register. Certainly absolute addressing makes no sense when moving to a 64 bit 68k CPU where an efficient (d32,PC) is needed anyway.

Here are the EA calc times of the 68060.

Dn Data Register Direct 0(0/0)
An Address Register Direct 0(0/0)
(An) Address Register Indirect 0(0/0)
(An)+ Address Register Indirect with Postincrement 0(0/0)
–(An) Address Register Indirect with Predecrement 0(0/0)
(d16,An) Address Register Indirect with Displacement 0(0/0)
(d8,An,Xi*SF) Address Register Indirect with Index and Byte Displacement 0(0/0)
(bd,An,Xi*SF) Address Register Indirect with Index and Base 16/32 Bit Displacement 1(0/0)
([bd,An,Xn],od) Memory Indirect Preindexed Mode 3(1/0)
([bd,An],Xn,od) Memory Indirect Postindexed Mode 3(1/0)
(xxx).W Absolute Short 0(0/0)
(xxx).L Absolute Long 0(0/0)
(d16,PC) Program Counter with Displacement 0(0/0)
(d8,PC,Xi*SF) Program Counter with Index and Byte Displacement 0(0/0)
(bd,PC,Xi*SF) Program Counter with Index and Base 16/32 Bit Displacement 1(0/0)
#<data> Immediate 0(0/0)
([bd,PC,Xn],od) Program Counter Memory Indirect Preindexed Mode 3(1/0)
([bd,PC],Xn,od) Program Counter Memory Indirect Postindexed Mode 3(1/0)

I expect the powerful addressing modes could be done in fewer cycles today.
If (bd,An,Xi*SF) and (bd,PC,Xi*SF) could be reduced from 1 cycle to 0 cycles then it could be used more. This should reduce the double indirect mode cost to 1 cycle which would be great (compilers should split these instructions when other instructions can be scheduled in between).

> Had they not done this, one could have imagined 68000 customers converting to
> the more streamlined ColdFire architecture, if only the features omitted were
> ones they could live without.
>
> However, as ColdFire was positioned as being for the embedded market, it is also
> not as if beefy processors with that architecture would be available.

The big mistake of ColdFire was not to allow full 68k compatibility with traps to software. Motorola did a poor job of marketing the 68k for embedded but that was where it was really good and did catch on with a loyal following of 68k developers and fans which slowly went over to ARM due to ColdFire's lack of 68k compatibility and support. The CPU32 ISA with the MVS, MVZ, BYTESWAP and BITSWAP instructions would have been better than ColdFire. ColdFire aimed too low where the 68k can't compete with minimalist RISC and isn't powerful anymore but abandoned the higher end embedded market where the 68k code density and ease of use make it a good choice. Motorola tried to shove PPC down the throats of developers for high end embedded as well as desktop PC processors after the IAM agreement. The huge success of the 68k disappeared practically overnight due to Motorola incompetence. The 68060 was one of the greatest processors of its time but had an identity crisis so Motorola through the baby out with the bathwater. The greener pastures of PPC on the other side of the fence don't seem so green now.

Quadibloc

unread,
May 25, 2018, 3:56:40 PM5/25/18
to
On Friday, May 25, 2018 at 12:31:03 PM UTC-6, matt...@gmail.com wrote:

> The greener pastures of PPC on the other side of the fence don't seem so green
> now.

That may be, but in a way that explains why it wasn't a mistake at the time to
try hopping to the PowerPC. Yes, that didn't succeed; the PowerPC didn't become
a very popular architecture. But they didn't have much to lose: the 68000
architecture had a... *presence*... given the Macintosh, the Atari ST, and the
Amiga, but still, this presence was not remotely as big as what the x86 had with
the IBM PC and its clones.

Neither the Macintosh, the Atari ST, nor the Amiga could be cloned.

Only the Macintosh of those three was something that businesses took seriously;
the other two were strictly home computers, even if the Atari ST looked the
part, and the Amiga, in most of its versions, sort of _looked_ like an office
desktop.

Apple's tendency to keep everything proprietary and high-priced was established
back then, it isn't something that's new today.

So there was no 68000 platform with growth potential, no standard 68K box that
could rival the PC.

John Savard

matt...@gmail.com

unread,
May 25, 2018, 5:15:11 PM5/25/18
to
On Friday, May 25, 2018 at 2:56:40 PM UTC-5, Quadibloc wrote:
> On Friday, May 25, 2018 at 12:31:03 PM UTC-6, matt...@gmail.com wrote:
>
> > The greener pastures of PPC on the other side of the fence don't seem so green
> > now.
>
> That may be, but in a way that explains why it wasn't a mistake at the time to
> try hopping to the PowerPC. Yes, that didn't succeed; the PowerPC didn't become
> a very popular architecture. But they didn't have much to lose: the 68000
> architecture had a... *presence*... given the Macintosh, the Atari ST, and the
> Amiga, but still, this presence was not remotely as big as what the x86 had with
> the IBM PC and its clones.

It was no doubt disconcerting when the bread and butter profitable workstation makers jumped from 68k to RISC. With the 68k, it was still Amiga, Macintosh, Atari ST and should have been embedded vs x86 clones. With PPC, it was just Macintosh and PPC shoved down embedded developers throats vs x86 clones which turned into dead PPC vs x86 clones vs ARM for embedded. Now Motorola is a Chinese company which doesn't make CPUs and Freescale/NXP is a Dutch company which pays license Fees to ARM for most of their CPU designs. ARM won the war because they were willing to get dirty in the trenches while the mighty Motorola panicked and surrendered at the first opportunity.

> Neither the Macintosh, the Atari ST, nor the Amiga could be cloned.

There were Mac clones for a little while but Apple changed their mind. Apple also changed their mind about PPC. I expect more than a few business partners of Apple lost piles of money and some probably went bankrupt. Some people still do business with Apple and take them seriously though.

> Only the Macintosh of those three was something that businesses took seriously;
> the other two were strictly home computers, even if the Atari ST looked the
> part, and the Amiga, in most of its versions, sort of _looked_ like an office
> desktop.

The Amiga was the computer for desktop video with the Toaster ((near real time operations no competitor could match) so it found a niche (also some very good paint programs). The Atari ST wasn't bad for DTP, audio and databases. The Mac was really only DTP on the 68k. They did get some MS software but it was generally considered inferior to the PC clone versions.

> Apple's tendency to keep everything proprietary and high-priced was established
> back then, it isn't something that's new today.
>
> So there was no 68000 platform with growth potential, no standard 68K box that
> could rival the PC.

Apple had trouble back then too. MS bailed them out (bought something like 25% of Apple) or they likely would have gone bankrupt. Motorola could have bought Apple, Amiga or Atari ST for a song at the right time if they wanted to be vertically integrated as they would have benefited the most from an open 68k platform (clone makers would have bought 68k CPUs too). The desktop and gaming markets are cyclical while the embedded markets are consistent and defensive so I would have pushed more and faster into them with the 68k CPU that developers loved. I probably would have developed PPC or the 88k to have a RISC offering as well. Choice and happy customers are good.

MitchAlsup

unread,
May 25, 2018, 10:08:05 PM5/25/18
to
On Friday, May 25, 2018 at 2:56:40 PM UTC-5, Quadibloc wrote:
> On Friday, May 25, 2018 at 12:31:03 PM UTC-6, matt...@gmail.com wrote:
>
> > The greener pastures of PPC on the other side of the fence don't seem so green
> > now.
>
> That may be, but in a way that explains why it wasn't a mistake at the time to
> try hopping to the PowerPC. Yes, that didn't succeed; the PowerPC didn't become
> a very popular architecture. But they didn't have much to lose: the 68000
> architecture had a... *presence*... given the Macintosh, the Atari ST, and the
> Amiga, but still, this presence was not remotely as big as what the x86 had with
> the IBM PC and its clones.

When I was at Moto, we had a saying "Apple paid for the FAB, but you made
no profit on them"

THe same can be said for being a supplier to SAPRC, too.

Terje Mathisen

unread,
May 26, 2018, 3:11:11 AM5/26/18
to
matt...@gmail.com wrote:
> The 68060 had the best integer performance even though benchmarks
> often showed the Pentium to be competitive if not slightly ahead due
> to much better compiler support. The Pentium did have better
> theoretical FPU performance but likely required hand laid assembler
> to achieve it. The 68060 FPU is more compiler friendly and performs
> well on mixed code as integer instructions can often operate in
> parallel (although FPU instructions using immediates annoyingly can't
> probably due to another transistor saving strategy). From optimizing
> FPU code, I suspect the small 8kB data cache becomes the bottleneck
> on these older processors with games like Quake which became so
> important to sales at the time. My 68060@75MHz Amiga with Voodoo 4
> (no T&L) can play Quake 512x384x16 at ~25 fps. It is obviously not as
> well optimized as the PC version but shows the Pentium FPU advantage
> was better for marketing than performance. Certainly at that time it
> was better to focus on integer performance but the 68060 still had a
> good FPU.

Since I got my name into the Quake manual for the work I did helping
optimize the asm code, I still remember quite a bit from that time:

Comparing the original Quake (pure sw rendering) with a later version
supporting the Voodoo card is not even apples vs oranges!

Mike Abrash (with a little bit of help from me, maybe 5%?) managed to
triple (!) the speed of John Carmack's original C code (which is what
you would have to run on that 68K).

This code was extremely tightly coupled with the instruction latencies
and throughput of the Pentium, both integer and FPU, among many other
things it did a proper division for correct perspective once every 16
pixels, and the latency of that 32-bit FDIV (17 cycles afair) was very
carefully overlapped with other parts of the code.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

David Brown

unread,
May 26, 2018, 9:26:40 AM5/26/18
to
On 25/05/18 20:31, matt...@gmail.com wrote:
> On Friday, May 25, 2018 at 1:20:36 AM UTC-5, Quadibloc wrote:
<snip>
>>
>> However, as ColdFire was positioned as being for the embedded
>> market, it is also not as if beefy processors with that
>> architecture would be available.
>
> The big mistake of ColdFire was not to allow full 68k compatibility
> with traps to software. Motorola did a poor job of marketing the 68k
> for embedded but that was where it was really good and did catch on
> with a loyal following of 68k developers and fans which slowly went
> over to ARM due to ColdFire's lack of 68k compatibility and support.
> The CPU32 ISA with the MVS, MVZ, BYTESWAP and BITSWAP instructions
> would have been better than ColdFire. ColdFire aimed too low where
> the 68k can't compete with minimalist RISC and isn't powerful anymore
> but abandoned the higher end embedded market where the 68k code
> density and ease of use make it a good choice. Motorola tried to
> shove PPC down the throats of developers for high end embedded as
> well as desktop PC processors after the IAM agreement. The huge
> success of the 68k disappeared practically overnight due to Motorola
> incompetence. The 68060 was one of the greatest processors of its
> time but had an identity crisis so Motorola through the baby out with
> the bathwater. The greener pastures of PPC on the other side of the
> fence don't seem so green now.
>

The 68K architecture was, AFAIK, used in two embedded arenas - network
equipment, and automotive (engine controllers and the like). Networking
split into two main branches - chips for routers, firewalls, etc.,
requiring more general processing, and chips for switches and security
products where speed was the top requirement. For the general usage,
ColdFire became popular - it was the original target for ucLinux, and
the devices with MMU were a major choice for "normal" embedded Linux, as
well VxWorks and other OS's. But it was never fast enough for the heavy
switches - they used PPC cores.

In the automotive world, and a bit wider in industrial electronics, the
68332 was an immensely popular chip. (I loved it.) Motorola, then
Freescale, tried to kill it off for years and move customers over to
ColdFire or PPC, but failed - people kept buying the 68332 and its
immediate relatives like the 68376.

Motorola/Freescale's biggest problem with ColdFire is that it had too
many cores. It had the old 68K that wouldn't die, the ColdFire, the
PPC, MCore, and a range of smaller chips and DSP cores of various
strengths - but the popular market was moving to ARM. It went through a
few stages of mixes, like families of microcontrollers where you had the
same peripheral set and pinouts but could choose 8-bit 68xx cores or
32-bit ColdFire's. But in the end, it settled on ARM cores and PPC.

matt...@gmail.com

unread,
May 26, 2018, 12:26:17 PM5/26/18
to
On Saturday, May 26, 2018 at 2:11:11 AM UTC-5, Terje Mathisen wrote:
> matthey wrote:
> > The 68060 had the best integer performance even though benchmarks
> > often showed the Pentium to be competitive if not slightly ahead due
> > to much better compiler support. The Pentium did have better
> > theoretical FPU performance but likely required hand laid assembler
> > to achieve it. The 68060 FPU is more compiler friendly and performs
> > well on mixed code as integer instructions can often operate in
> > parallel (although FPU instructions using immediates annoyingly can't
> > probably due to another transistor saving strategy). From optimizing
> > FPU code, I suspect the small 8kB data cache becomes the bottleneck
> > on these older processors with games like Quake which became so
> > important to sales at the time. My 68060@75MHz Amiga with Voodoo 4
> > (no T&L) can play Quake 512x384x16 at ~25 fps. It is obviously not as
> > well optimized as the PC version but shows the Pentium FPU advantage
> > was better for marketing than performance. Certainly at that time it
> > was better to focus on integer performance but the 68060 still had a
> > good FPU.
>
> Since I got my name into the Quake manual for the work I did helping
> optimize the asm code, I still remember quite a bit from that time:
>
> Comparing the original Quake (pure sw rendering) with a later version
> supporting the Voodoo card is not even apples vs oranges!

Part of the code is shared and part is no longer needed between the SW only and GL versions of Quake. I have done 68k optimizations and bug fixes for SW and GL Quake I and Quake II as well as the Amiga GL 3D gfx drivers and 68k vbcc compiler (the game is a good performance benchmark and has historical significance). There is a surprising amount of 68k assembler code for some of the SW rendering Quake ports on the Amiga but only manages about half of the GL rendering frame rate.

> Mike Abrash (with a little bit of help from me, maybe 5%?) managed to
> triple (!) the speed of John Carmack's original C code (which is what
> you would have to run on that 68K).
>
> This code was extremely tightly coupled with the instruction latencies
> and throughput of the Pentium, both integer and FPU, among many other
> things it did a proper division for correct perspective once every 16
> pixels, and the latency of that 32-bit FDIV (17 cycles afair) was very
> carefully overlapped with other parts of the code.

The level of optimization and waste of human man hours because of the poor x86 ISA is most impressive for Quake. The original C code must have been poor to see a 3x performance boost too. FDIVs are indeed a good place to look for optimizations, first to try FMUL times an immediate reciprocal (Frank Wille and I added this optimization to the 68k assembler vasm used by vbcc) and then to do parallel integer instructions while the FDIV is calculating. I optimized a SW rendering function in Quake II that moved a good portion of the integer code under a couple of FDIVs and was proud of myself until I benchmarked something like a .2 fps performance increase. Examples like this made me think the DCache size was a bottleneck for the large amount of data but maybe the x86 FPU was just better performance with hand laid assembler code.

matt...@gmail.com

unread,
May 26, 2018, 2:21:56 PM5/26/18
to
On Saturday, May 26, 2018 at 8:26:40 AM UTC-5, David Brown wrote:
The uCLinux first target was the 68k (CPU32 on 68332, not 68328 DragonBall as wiki incorrectly states). I have Jeff Dionne's e-mail telling me this. He is involved with the open core J-core (SuperH ISA) CPU project where he wants to use mass produced embedded sensors for his business in Japan. He thinks the SuperH will scale from micro-contoller with DSP add-ons where a very simple SH-3 is well suited to powerful 64 bit CPUs. I tried to convince him a more robust CPU design like a 64 bit 68k (patents have also expired) with a SIMD unit would be better suited. He admits he can program the 68k like the wind compared to the SuperH and acknowledged major flaws in the SuperH ISA (see posts by BGB and me on this forum) but he likes the simple SuperH CPU design.

Motorola was slow to ramp up the features and performance of 68k CPUs after the AIM agreement. They did not want the 68k competing with the PPC (which they pushed for embedded too) so it was weakened and demoted to the low end embedded cellar where it is not particularly well suited. No 68k or ColdFire CPU from Motorola has exceeded the single core performance/MHz of the 68060 despite massive leaps in technology (the 68060 design was good but had lots of room for improvement). Low end FPGA 68k CPU cores today give better performance than Motorola/Freescale/NXP offerings. Developers loved and preferred the 68k ISA to ARM offerings but Motorola/Freescale let them slip away by not upgrading features and performance and shoving PPC down their throats instead. ARM has this *huge* market but nobody really wants to compete with them.

> In the automotive world, and a bit wider in industrial electronics, the
> 68332 was an immensely popular chip. (I loved it.) Motorola, then
> Freescale, tried to kill it off for years and move customers over to
> ColdFire or PPC, but failed - people kept buying the 68332 and its
> immediate relatives like the 68376.

The CPU32 ISA is pretty good for a simplified 68k ISA. That is what InnovASIC's FIDO CPU uses as well. Transistors are cheap for all but the lowest end embedded CPUs today and the decoder penalty of full 68k is practically nothing.

> Motorola/Freescale's biggest problem with ColdFire is that it had too
> many cores. It had the old 68K that wouldn't die, the ColdFire, the
> PPC, MCore, and a range of smaller chips and DSP cores of various
> strengths - but the popular market was moving to ARM. It went through a
> few stages of mixes, like families of microcontrollers where you had the
> same peripheral set and pinouts but could choose 8-bit 68xx cores or
> 32-bit ColdFire's. But in the end, it settled on ARM cores and PPC.

Yep, poor management and lack of understanding of their products by Motorola/Freescale. ColdFire could have been a subset of 68k able to execute 68k code with software traps while new ColdFire instructions could have been added to higher end 68k designs as most encodings are open. Instead they needed more CPU offerings which required more support. PPC never was well suited for embedded. It is a pain to program in assembler, doesn't have good code density, is complex for RISC and lacks embedded features like ARM has. MCore was the right idea for low end embedded but too late, simplified too far again and abandoned quickly.

already...@yahoo.com

unread,
May 27, 2018, 2:57:13 AM5/27/18
to
On Saturday, May 26, 2018 at 9:21:56 PM UTC+3, matt...@gmail.com wrote:
>
> Developers loved and preferred the 68k ISA to ARM offerings
>

I'd guess, the statement above is correct, but incomplete.

The complete statement would be: "Small minority of developers loved and preferred the 68k ISA to ARM, another loved small minority of developers and preferred the ARM ISA to 68k, while absolute majority of developers didn't care about ISA.

> PPC never was well suited for embedded.
> It is a pain to program in assembler,

True, but doesn't matter.

> doesn't have good code density,

Did you look at variant of PPC ISA implemented by e200 cores?
I never bothered to measure, but on paper it looks like its code density (compiled code, asm is irrelevent) should be excellent - at least as good as Coldfire, but likely somewhat better.
Of course, measurements are better than feelings.

>is complex for RISC


True, but doesn't matter.

>and lacks embedded features like ARM has

What features?
Intuitively, I would think that wider immediate field in PPC load/store instructions is useful for embedded.
IMHO, "ARM classic" is a reasonable embedded ISA, but not something to read home about. Quite comparable with "PPC classic". on the other hand, Thumb2 ISA is really quite good.


David Brown

unread,
May 27, 2018, 7:03:14 AM5/27/18
to
OK. The 68332 is an odd target for it - it had only a 16-bit external
databus, so it would be rather slow for uCLinux. I still have a
ColdFire 2 ucLinux board somewhere in the office.

> He is involved with the open core J-core (SuperH
> ISA) CPU project where he wants to use mass produced embedded sensors
> for his business in Japan. He thinks the SuperH will scale from
> micro-contoller with DSP add-ons where a very simple SH-3 is well
> suited to powerful 64 bit CPUs. I tried to convince him a more robust
> CPU design like a 64 bit 68k (patents have also expired) with a SIMD
> unit would be better suited. He admits he can program the 68k like
> the wind compared to the SuperH and acknowledged major flaws in the
> SuperH ISA (see posts by BGB and me on this forum) but he likes the
> simple SuperH CPU design.

The one thing that I see as a potential issue for bigger and faster 68K
devices is the limited number of registers - 8 general purpose data
registers and 7 address registers. As far as I have seen in the history
of 68k, and processors in general, there has been a trend towards using
more registers and fewer complex addressing modes. This would be more
noticeable for a 64-bit design with more pipelining, superscaling, etc.

Apart from that, I have always thought the 68K was a very nice ISA.
PPC took a while to get established for embedded work like industrial
and automotive applications. (For networking, it seemed to be
successful - especially the 64-bit version. But I have not worked in
that area myself.)

The first PPC microcontrollers from Freescale were devices like the
MPC555 and MPC565. These were seen as direct successors to the 68332 -
that is how we used them. (I preferred the ColdFire MCF5234 as a
replacement to the 68332, but it was not available until a little
later). The MPC5xx suffered from poorer, more limited and more
expensive development tools compared to the 68332, but its key problem
for such systems was interrupt handling. It was very inefficient, and
difficult to get right - as you say, it was not fun to program in assembly.

But the modern PPC microcontroller cores (like the e200z6), good
interrupt controllers, and newer tools make these far nicer to work
with. I did a couple of PPC microcontroller projects a few years ago,
and was mostly happy with them.



already...@yahoo.com

unread,
May 27, 2018, 7:59:11 AM5/27/18
to
No, 64-bit PPC never was successful outside of IBM servers, where it was called POWER.
In networking most successful PPC was 32-bit PowerQuick series, esp. PowerQuick1. The same chips that you mentioned below.

>
> The first PPC microcontrollers from Freescale were devices like the
> MPC555 and MPC565. These were seen as direct successors to the 68332 -
> that is how we used them. (I preferred the ColdFire MCF5234 as a
> replacement to the 68332, but it was not available until a little
> later). The MPC5xx suffered from poorer, more limited and more
> expensive development tools compared to the 68332,

We used Diab Data tools. Worked fine, as far as I remember. Licensing was a bit nasty, but nothing extraordinary. Same for price.

> but its key problem
> for such systems was interrupt handling. It was very inefficient, and
> difficult to get right - as you say, it was not fun to program in assembly.

Somehow, for us it never was a problem. May be, because we never needed especially fast interrupt response.

>
> But the modern PPC microcontroller cores (like the e200z6), good
> interrupt controllers, and newer tools make these far nicer to work
> with. I did a couple of PPC microcontroller projects a few years ago,
> and was mostly happy with them.

I didn't touch PPC MCUs since the middle of the previous decade. The last one was IBM 405 (or 440? I don't remember). I liked Freescale gear much better, but it was probably an overkill outside of the range that could take full advantage of communication co-processor. IBM's was more general-purpose, less ambitious. But I didn't like it, less so a core, more so a peripherals.

Today I don't have much use for the class of MCUs that are equipped with e200z6 or it's peers (ARM Cortex-R5?) . For most of our tasks e200z1 would probably be insufficient, while e200z3 would be an overkill. But even if there was something in the middle, I see little reason to use e200 over MCUs based on ARM Cortex-M4. Variety for sake of variety? Thanks, it's not my way of thinking.


Terje Mathisen

unread,
May 27, 2018, 11:20:48 AM5/27/18
to
matt...@gmail.com wrote:
> On Saturday, May 26, 2018 at 2:11:11 AM UTC-5, Terje Mathisen wrote:
>> Mike Abrash (with a little bit of help from me, maybe 5%?) managed
>> to triple (!) the speed of John Carmack's original C code (which is
>> what you would have to run on that 68K).
>>
>> This code was extremely tightly coupled with the instruction
>> latencies and throughput of the Pentium, both integer and FPU,
>> among many other things it did a proper division for correct
>> perspective once every 16 pixels, and the latency of that 32-bit
>> FDIV (17 cycles afair) was very carefully overlapped with other
>> parts of the code.
>
> The level of optimization and waste of human man hours because of the
> poor x86 ISA is most impressive for Quake. The original C code must

"Waste"?

When you have a breakthrough game that more or less created an entire
industry, with millions and millions of users and a few orders of
magnitude more hours spent playing it, the fact that a few man-years was
spent writing and optimizing it really doesn't matter imho. :-)

> have been poor to see a 3x performance boost too. FDIVs are indeed a
> good place to look for optimizations, first to try FMUL times an
> immediate reciprocal (Frank Wille and I added this optimization to
> the 68k assembler vasm used by vbcc) and then to do parallel integer
> instructions while the FDIV is calculating. I optimized a SW
> rendering function in Quake II that moved a good portion of the
> integer code under a couple of FDIVs and was proud of myself until I
> benchmarked something like a .2 fps performance increase. Examples
> like this made me think the DCache size was a bottleneck for the
> large amount of data but maybe the x86 FPU was just better
> performance with hand laid assembler code.

The Pentium FPU is and was very hard to compile for, but a very nice
(even if somewhat mind-bendingly) puzzle to figure out for an x86 asm
hacker.

BGB

unread,
May 27, 2018, 12:44:55 PM5/27/18
to
The 68k's ability to have 48+ bit instructions and ability to have
multiple memory accesses in a single instruction seemed likely very
problematic IMO.

On a previous topic seen earlier, IME, (Base+(Index+Disp)*Sc) and
similar addressing modes ended up very rarely used in my tests, and it
seem are fairly infrequent.

Most commonly used/useful cases IME:
(Reg)
(Reg, Disp*Sc)
(SP, Disp*Sc): Very common
(Reg, Reg*Sc)
(PC, Disp)
And, much less commonly:
(Reg+) / @Reg+
(Reg-) / @-Reg
(Reg, (Reg+Disp)*Sc)
... others ...


FWIW: That part of the software renderer was one area which posed a
major problem for my BJX1 effort:
The use of floating point FDIV operations right in the middle of the
rasterization loop basically ruined attempts to use alternatives to a
fast hardware divide and still get plausible performance.

It would have been necessary to modify the renderer then in order to
eliminate any inner-loop floating-point ops, to allow a slower divisor
to be usable (ex: one using iterative Newton-Raphson for the reciprocal).

But, even then, timing was still hard, and I (probably) would have just
ultimately reduced the clock to the FPU to 1/4 (ex: 25MHz) hopefully so
that I could get FMUL and similar to pass timing.


The alternative being to probably split it up into internal pipelines,
where I do a trick similar to the integer multiply of doing 16*16->32
multiplies and then adding the results together afterwards:
(AA, AB)*(BA, BB)
Clk 0: EX (sets up multiplier vars)
Clk 1: A=AA*BA; B=AB*BA; C=AA*BB; D=AB*BB
Also a few more values for signed multiply:
A[31]?(~B[31:16]):0 and similar.
Clk 2: Add intermediate results.
Clk 3: Store results back to MACH:MACL.

This being because a direct 32*32->64 multiply also seems to fail timing
even when done by itself.

This makes only providing a 16-bit unsigned multiply (MULU.W) in the ISA
tempting (since this can be done directly), except for the issue of
every integer multiply now needing to be a runtime call to fake the
common case of a 32*32->32 multiply (lame).


SHAD/SHLD isn't nearly as steep, ex:
Clk 0: EX stage, setup for SHAD/SHLD
Clk 1: Do the SHAD/SHLD (a big "case()")

With the alternative being fixed shifts of 1/2/4*/8/16 bits (*: the 4
bit shift being an extended feature).


Funny enough, Quake didn't seem to mind so much when using a runtime
call for integer multiply, or the crappiness of doing shifts via a
computed branch into a series of SHLL or SHAR instructions... But, so
help you if that FDIV isn't fast...

It seems possible though, that if floating-point were eliminated from
the software renderer, then much of the rest of the engine could live on
with a much slower FPU (or possibly FPU emulation).


But, I suspect this particular effort (trying to make an FPGA CPU core
capable of running Quake) may not continue much more as-is (unless maybe
I get a much higher-end FPGA dev-board).


The narrower scope (BSR1) effort (namely focusing on microcontroller
tasks with a more "cleaned-up" ISA relative to SH) at least seems more
doable.

Have noted that I seem to be seeing code-density of around 3-5kB per
kLOC of C (tending towards the lower end with plain integer code; and
towards the higher-end with a lot of "long long" similar thrown in).


Not all perfect though, as while doing a lot of stuff via 'DLR' seems to
be working out in-general, some instruction forms end up existing which
would not have needed to exist if the register could be addressed like a
GPR.

OTOH, cutting more registers off of the GPR space would leave fewer GPRs
available and would have required modifying the C ABI.

I guess it mostly is a matter for those who judge ISA complexity mostly
by counting the number of superficial instruction forms and similar,
while ignoring any special-case behaviors (It is like claiming the
MSP430 only has 27 instruction forms... Yeah, about that...).

BGB

unread,
May 27, 2018, 3:45:04 PM5/27/18
to
Mentioned already, yeah, complex addressing modes seem to be fairly
infrequent vs simpler modes.

For example, if one takes away cases relative to SP, then usage of Rm+
and -Rn addressing nearly drops off the table (there aren't nearly
enough "*cs++" and similar operations in typical C code to keep them
worthwhile).

It seems nicer to have a few simpler modes which can adequately emulate
more complex modes as needed, for example, in my newer ISA:
(Reg)
(Reg, DLR)
(PC, DLR)
(DLR)
(DLR_i4) //(code density)
(PC, DLR_i4) //(code density)

Pretty much all of the addressing modes (from SH and BJX1) can be
emulated via compound sequences (typically a 32-bit instruction pair).
(Reg), 16-bit (1 op)
(Reg, Reg*Sc), 32-bit (2-op)
(Reg, Disp13s), 32-bit (2-op)
(PC, disp17s), 32-bit (2-op)
(Abs16), 32-bit (2-op)
(Reg+), 32-bit (2-op)
(-Reg), 32-bit (2-op)
(Reg, R0), 48-bit (3-op, *)
*: Need to fake non-scaled R0 (to emulate SH behavior).
In the compiler output, R0 cases are now fairly rare though.
(Reg, Reg, Disp13s), 48-bit (3-op)
...

Most of these, as can be noted, are done via the DLR register.

Had spec'ed some modes with GBR and TBR as base registers, but these
have since been demoted to compound sequences.

While functionally not that different from the way R0 was used in SH, it
does carry the advantage that it's use is much more specific; so it
isn't needing to fight with also being used as a function return value
and as an implicit source/destination for various other operations
(forcing a lot of hackish fallback cases).

For immediate/displacement values, can now have the high-level compiler
logic mostly ignore value ranges (far fewer special and fallback cases
needed). Similarly, the produced sequences can potentially still be
decoded as larger variable-width instructions (without some of the
drawbacks of an "actual" variable-length instruction encoding for
simpler cores).



As for GPRs, 16 or 32 seem about optimal in my tests.

16 GPRs: works pretty well in general, but sometimes register pressure
is enough to start causing thrashing (particularly if working with
values which require GPR pairs).
32 GPRs: may be helpful in higher register pressure situations, but only
a minority of functions seem to benefit significantly.
64 GPRs: from what I can tell, there is "hardly ever" enough register
pressure to justify this.

OTOH: with 8 GPRs, thrashing is a serious problem.


Granted, 8 or 16 GPRs is better for a 16-bit instruction coding, as 32
GPRs would leave few bits left over for the opcode field.

So, I suspect 16 GPRs is probably optimal for an ISA with 16-bit
instructions, and 32 GPRs for an ISA with 24 or 32-bit instructions.


Some of my BJX1 variants had 32 GPRs.

BSR1 currently only does 16 GPRs. If I spec a version with
variable-width 16/32 instruction coding, it is likely it would also
expand back to 32 GPRs.

But, granted, this expanded version probably wouldn't be for a small
microcontroller use-case (larger microcontroller? or to compete with
things I am currently doing with a RasPi?...).

matt...@gmail.com

unread,
May 27, 2018, 3:51:16 PM5/27/18
to
On Sunday, May 27, 2018 at 1:57:13 AM UTC-5, already...@yahoo.com wrote:
> On Saturday, May 26, 2018 at 9:21:56 PM UTC+3, matt...@gmail.com wrote:
> >
> > Developers loved and preferred the 68k ISA to ARM offerings
> >
>
> I'd guess, the statement above is correct, but incomplete.
>
> The complete statement would be: "Small minority of developers loved and preferred the 68k ISA to ARM, another loved small minority of developers and preferred the ARM ISA to 68k, while absolute majority of developers didn't care about ISA.

It would be difficult to do an unbiased poll. It didn't matter as ARM evolved (perhaps too much with all the modes and variations) and the 68k did not. ARM has good support while Motorola/Fresscale anti-marketed the 68k. There was no choice for most developers as cut down '90s 68k designs could not meet their requirements and needs after awhile. The ISA is less important today with most embedded code being compiled but it is still important to be able to debug and look for compiler inefficiencies in compiler generated assembler code.

> > PPC never was well suited for embedded.
> > It is a pain to program in assembler,
>
> True, but doesn't matter.

The PPC only started to catch on for embedded when compilers became more common for embedded. I expect most embedded PPC CPUs today are high end only due to the difficulty of debugging and optimizing.

> > doesn't have good code density,
>
> Did you look at variant of PPC ISA implemented by e200 cores?
> I never bothered to measure, but on paper it looks like its code density (compiled code, asm is irrelevent) should be excellent - at least as good as Coldfire, but likely somewhat better.
> Of course, measurements are better than feelings.

PPC Book E VLE? I couldn't find much analysis or real world data on it. NXP claims a 30% overall code size reduction with a <10% execution path increase from normal PPC code. The 68020/CPU32 ISA code is generally 35%-50% smaller than PPC code and ColdFire code 0%-5% worse so it could be approaching the code density but likely falls short. Vince Weaver obtained an embedded board which supports the PPC Book E VLE so maybe someday he will add the results to his code density web site.

PPC did not encode the lower 2 bits of displacements in branches making decompress on fetch (DF) compression challenging. IBM's CodePack for the PPC claimed a 60% code compression but it was a Decompress on Cache Fill (DCF) dictionary based compression between the L1 ICache and memory. The L1 ICache held uncompressed instructions so it did not benefit from a reduced L1 size or improved instruction fetch bandwidth from L1. DCF is usually considered to be less efficient than DF and CodePack often gave reduced performance. CodePack did allow the full PPC instruction set including all 32 GP registers though. Most DF based RISC compression formats like ARM Thumb, PPC book E VLE, RVC32C, RV64C, MIPS16, MicroMIPS and SPARC16 reduce or restrict the number of accessible registers which increases the number of instructions, load/stores and program size (16 GP registers is the sweet spot). There are more factors to look at than just code compression obviously. This is why I suggested Vince Weaver add categories for the number of instructions, average instruction size, data size, number of branch instructions and number of memory access instructions.

> >is complex for RISC
>
>
> True, but doesn't matter.
>
> >and lacks embedded features like ARM has
>
> What features?
> Intuitively, I would think that wider immediate field in PPC load/store instructions is useful for embedded.
> IMHO, "ARM classic" is a reasonable embedded ISA, but not something to read home about. Quite comparable with "PPC classic". on the other hand, Thumb2 ISA is really quite good.

ARM has specialized ISA extensions for about everything embedded like security, DSP, SIMD, byte code support, etc. ARMv8 AArch64 is more standardized and indeed much like PPC but better. It is a little too complex and heavy for many embedded applications and has only modestly better code density than PPC. Thumb2 is like a lower end and lighter RISC version of the 68k and is in several ways a better ISA than SuperH and ColdFire which were based on the 68k.

matt...@gmail.com

unread,
May 27, 2018, 5:02:38 PM5/27/18
to
On Sunday, May 27, 2018 at 6:03:14 AM UTC-5, David Brown wrote:
> The one thing that I see as a potential issue for bigger and faster 68K
> devices is the limited number of registers - 8 general purpose data
> registers and 7 address registers. As far as I have seen in the history
> of 68k, and processors in general, there has been a trend towards using
> more registers and fewer complex addressing modes. This would be more
> noticeable for a 64-bit design with more pipelining, superscaling, etc.

16 GP registers is optimal. The following paper predicted an overall less than 2% performance increase above 12 GP registers on an x86_64.

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf

A paper called "High-Performance Extendable Instruction Set Computing" found a MIPS CPU load/stores increased about 14% from 16->8 GP registers but only about 2% from 27->16 GP registers.

The sweet spot is 16 GP registers. It is important to use them efficiently though and the 68k can be improved. Some ideas.

1) Suppress the frame pointer by using the stack pointer (vbcc compiler with 68k target does this by default)
2) Merge executable sections or place in memory pools with proximity and use PC relative addressing where possible including in libraries
3) Open up address register sources so an intermediate register is not needed
4) MOVEQ trash register is no longer needed with a simple immediate (pseudo)addressing mode which auto compresses immediates
5) Fast bit field instructions use fewer registers

There are many more small ways to use 68k registers more efficiently. These would require ISA and ABI changes. The 68k can already access all 16 GP registers without a code size increase (x86_64 when accessing the upper 8 GP registers) and CISC is a register miser compared to RISC. A few more registers would be nice but not worthwhile to introduce prefixes or tiered registers.

> PPC took a while to get established for embedded work like industrial
> and automotive applications. (For networking, it seemed to be
> successful - especially the 64-bit version. But I have not worked in
> that area myself.)
>
> The first PPC microcontrollers from Freescale were devices like the
> MPC555 and MPC565. These were seen as direct successors to the 68332 -
> that is how we used them. (I preferred the ColdFire MCF5234 as a
> replacement to the 68332, but it was not available until a little
> later). The MPC5xx suffered from poorer, more limited and more
> expensive development tools compared to the 68332, but its key problem
> for such systems was interrupt handling. It was very inefficient, and
> difficult to get right - as you say, it was not fun to program in assembly.
>
> But the modern PPC microcontroller cores (like the e200z6), good
> interrupt controllers, and newer tools make these far nicer to work
> with. I did a couple of PPC microcontroller projects a few years ago,
> and was mostly happy with them.

If you needed modern performance or features then you had no choice but to move away from the 68k and ColdFire to PPC or ARM. PPC is not a bad architecture and has some good ideas. However, it is unfriendly, boring and practically requires a good compiler and source level debugger. Sadly, I expect more developers like x86/x86_64 assembler than PPC assembler.

already...@yahoo.com

unread,
May 27, 2018, 5:40:27 PM5/27/18
to
On Monday, May 28, 2018 at 12:02:38 AM UTC+3, matt...@gmail.com wrote:

> Sadly, I expect more developers like x86/x86_64 assembler than PPC assembler.

Why sadly?
What could be wrong when developers like more pleasant asm coding experience better than less pleasant asm coding experience?

matt...@gmail.com

unread,
May 27, 2018, 5:46:24 PM5/27/18
to
On Sunday, May 27, 2018 at 11:44:55 AM UTC-5, BGB wrote:
> The 68k's ability to have 48+ bit instructions and ability to have
> multiple memory accesses in a single instruction seemed likely very
> problematic IMO.

I don't think my 64 bit 68k ISA will increase the maximum instruction length at least. It sure is nice to have one instruction for immediates and load+calc+store rather than a chain of dependent instructions. MOVE mem,mem is common and I believe can be done in 1 cycle in most cases. It is CISC so multi-cycle instructions are tolerable. It all needs more resources but so does a 64 bit CPU. I prefer less obfuscation and a more friendly ISA with the complexity of 64 bit.

> On a previous topic seen earlier, IME, (Base+(Index+Disp)*Sc) and
> similar addressing modes ended up very rarely used in my tests, and it
> seem are fairly infrequent.
>
> Most commonly used/useful cases IME:
> (Reg)
> (Reg, Disp*Sc)
> (SP, Disp*Sc): Very common
> (Reg, Reg*Sc)
> (PC, Disp)
> And, much less commonly:
> (Reg+) / @Reg+
> (Reg-) / @-Reg
> (Reg, (Reg+Disp)*Sc)
> ... others ...

You are probably talking about the frequency the instructions appear in the code and not the run time frequency. Post-increment and pre-decrement are commonly used in loops. Primitive compilers have trouble generating them and they have an additional cost in some CPU designs which discourages their use. They are good for code density and can be free (no additional EA calc cost). I would not remove them. Some of the addressing modes are not particularly common but simplify compiler support and complete addressing mode support. I don't like the idea of removing addressing modes based solely on frequency of use. Addressing modes are very powerful (multiplier effect) with an orthogonal ISA.
That is what the 68000 does. Lots of partial sum 16*16->32 multiplies.

> SHAD/SHLD isn't nearly as steep, ex:
> Clk 0: EX stage, setup for SHAD/SHLD
> Clk 1: Do the SHAD/SHLD (a big "case()")
>
> With the alternative being fixed shifts of 1/2/4*/8/16 bits (*: the 4
> bit shift being an extended feature).
>
>
> Funny enough, Quake didn't seem to mind so much when using a runtime
> call for integer multiply, or the crappiness of doing shifts via a
> computed branch into a series of SHLL or SHAR instructions... But, so
> help you if that FDIV isn't fast...
>
> It seems possible though, that if floating-point were eliminated from
> the software renderer, then much of the rest of the engine could live on
> with a much slower FPU (or possibly FPU emulation).

I heard a rumor about an integer only version of Quake for some console. There is always non-FP Doom too.

MitchAlsup

unread,
May 27, 2018, 7:46:27 PM5/27/18
to
I went the other direction: the key data addressing mode in the MY 66000
ISA is :: [Rbase+Rindex<<SC+Disp]

When Rbase == R0 then IP is used in lieu of any base register
When Rindex == R0 then there is no indexing (or scaling)
Disp comes in 3 flavors:: Disp16, Disp32, and Disp64

The assembler/linker is task with choosing the appropriate instruction form
from the following:

MEM Rd,[Rbase+Disp16]
MEM Rd,[Rbase+Rindex<<SC]
MEM Rd,[Rbase+Rindex<<SC+Disp32]
MEM Rd,[Rbase+Rindex<<Ec+Disp64]

Earlier RISC machines typically only had the first 2 variants. My experience
with x86-64 convinced me that adding the last 2 variants was of low cost
to the HW and of value to the SW.

In a low end machine, the displacement will be coming out of the decoder
and this adds nothing to the AGEN latency or data path width. The 2 gates
of delay (3-input adder) is accommodated by the 2 gates of delay associated
the scaling of the Rindex register (Rbase+Disp)+(Rindex<<SC) without adding
any delay to AGEN.

Any high end machine these days will have 3-operand FMAC instructions. Those
few memory references that need 3 operands are easily serviced on those paths.

Having SW create immediates and displacements by executing instructions is
simply BAD FORM*. Immediates and displacements should never pass through the
data cache nor consume registers from the file(s), nor should they be found
in memory that may be subject to malicious intent.

(*) or lazy architecting--of which there is way too much.

The same issues were involved in adding 32-bit and 64-bit immediates to the
calculation parts of the ISA.

DIV R7,12345678901234,R19
is handled as succinctly as:
DIV R7,R19,12345678901234

Almost like somebody actually tried to encode it that way.

BGB

unread,
May 28, 2018, 12:01:03 AM5/28/18
to
This design was partly motivated by working within the limits of a
fixed-width 16-bit instruction coding, trying to save encoding space,
and not wanting to have 3 register read ports, ...

I was able to get everything done with 2 register read ports, and with
some other registers (PC, SP, DLR, ...) routed directly through the
execute unit (they form sort of a "loop" between the register-file and
execute unit).


> Any high end machine these days will have 3-operand FMAC instructions. Those
> few memory references that need 3 operands are easily serviced on those paths.
>

Yeah, what I am aiming for right now probably wont even have an FPU.

There is an FPU in the spec, but this is a more intended for future
expansion.

There are some 3-register arithmetic ops, but internally these are also
done as 2-instruction sequences.



> Having SW create immediates and displacements by executing instructions is
> simply BAD FORM*. Immediates and displacements should never pass through the
> data cache nor consume registers from the file(s), nor should they be found
> in memory that may be subject to malicious intent.
>
> (*) or lazy architecting--of which there is way too much.
>

The BSR1 ISA does not load immediate or displacement values from memory,
but rather they are typically composed inline via a load-shift sequence
via a hard-wired special-purpose register.

One cost is that the whole Axxx/Bxxx space (or about 1/8 of the total
encoding space), was used to load a 13-bit value into the DLR register.

Likewise, 26xx will tack on an additional 8 bits (or,
"DLR=(DLR<<8)|Imm8"), and these can be chained as-needed.

So, After "A123 2645 2667" DLR will hold the value 0x1234567.

If I extend it: "A123 2645 2667 48B8", it magically becomes,
essentially, "MOV 0x12345678, R11".

Granted, this takes 4 cycles, and is internally done via a 4-instruction
sequence, but oh well.


This is quite different from SH, which mostly relied on PC-relative
memory loads.



> The same issues were involved in adding 32-bit and 64-bit immediates to the
> calculation parts of the ISA.
>
> DIV R7,12345678901234,R19
> is handled as succinctly as:
> DIV R7,R19,12345678901234
>
> Almost like somebody actually tried to encode it that way.
>

OK.

There is no DIV instruction, but it is possible to encode:
"XOR R3, 0x12345, R9"
As:
"A123 2645 5C93"

BGB

unread,
May 28, 2018, 12:16:02 AM5/28/18
to
On 5/27/2018 4:46 PM, matt...@gmail.com wrote:
> On Sunday, May 27, 2018 at 11:44:55 AM UTC-5, BGB wrote:
>> The 68k's ability to have 48+ bit instructions and ability to have
>> multiple memory accesses in a single instruction seemed likely very
>> problematic IMO.
>
> I don't think my 64 bit 68k ISA will increase the maximum instruction length at least. It sure is nice to have one instruction for immediates and load+calc+store rather than a chain of dependent instructions. MOVE mem,mem is common and I believe can be done in 1 cycle in most cases. It is CISC so multi-cycle instructions are tolerable. It all needs more resources but so does a 64 bit CPU. I prefer less obfuscation and a more friendly ISA with the complexity of 64 bit.
>

A 64-bit ISA need not be all that much more complicated than a 32-bit
ISA, and interestingly the width of the ALU and GPRs doesn't really seem
to effect overall cost all that much (at least if compared with the
number of GPRs or the number of register file ports; and excluding
shift/multiply which scale sharply).

As-is, a move between two memory locations will take ~ 8 cycles with my
current design, possibly memory accesses could be made cheaper, but
doing so would add cost and complexity.



>> On a previous topic seen earlier, IME, (Base+(Index+Disp)*Sc) and
>> similar addressing modes ended up very rarely used in my tests, and it
>> seem are fairly infrequent.
>>
>> Most commonly used/useful cases IME:
>> (Reg)
>> (Reg, Disp*Sc)
>> (SP, Disp*Sc): Very common
>> (Reg, Reg*Sc)
>> (PC, Disp)
>> And, much less commonly:
>> (Reg+) / @Reg+
>> (Reg-) / @-Reg
>> (Reg, (Reg+Disp)*Sc)
>> ... others ...
>
> You are probably talking about the frequency the instructions appear in the code and not the run time frequency. Post-increment and pre-decrement are commonly used in loops. Primitive compilers have trouble generating them and they have an additional cost in some CPU designs which discourages their use. They are good for code density and can be free (no additional EA calc cost). I would not remove them. Some of the addressing modes are not particularly common but simplify compiler support and complete addressing mode support. I don't like the idea of removing addressing modes based solely on frequency of use. Addressing modes are very powerful (multiplier effect) with an orthogonal ISA.
>

Addressing modes add cost in terms of encoding space and the necessary
logic to support them.


I left out postinc/postdec modes from the BSR1 ISA partly because they
may end up needing to update multiple GPRs in a single operation. With
BJX1, this required using a state machine, but without this mode,
"MOV.x" does not need any state, and can simply assert a "hold" status
(blocking the pipeline) until the memory access either completes or
reports an error condition.

The cost now is that if the assembler or emitter sees one of these
operations, it has to fake it, ex:
MOV.L R3, (-R4)
would be emitted as:
ADD #-4, R4
MOV.L R3, (R4)

But, as noted, they were fairly infrequent in generated code, so this
doesn't really seem to have much impact on the overall code footprint.


I am tuning this ISA more for footprint than performance, wanting to get
as much as I can out of 32kB or so of ROM, within the limits of core and
decoder complexity.

There are still PUSH/POP operations, which still update SP, but this is
partly because SP (along with DLR and PC and similar) are fed directly
through the EX unit and as such can be updated more directly without
going through the register-file.

Unlike SH, there is a RET operation which exists as special case
semantics for "POP PC" (which can save a few byte over "POP LR; RTS")


However, given that there are no special-case instructions to help
implement integer division or strcmp efficiently, it is much less clear
how well BSR1 would compare doing things like running Dhrystone (which
is disproportionately effected by things like strcmp and divider speed).

As-is, I am doing division via a shift/compare/subtract loop (not
necessarily the highest-performance option here, but basically works and
doesn't have a huge footprint).
I did it also as a thing for B64V, and Quake somehow didn't suffer too
badly from this in my tests despite this being kinda silly.


>> SHAD/SHLD isn't nearly as steep, ex:
>> Clk 0: EX stage, setup for SHAD/SHLD
>> Clk 1: Do the SHAD/SHLD (a big "case()")
>>
>> With the alternative being fixed shifts of 1/2/4*/8/16 bits (*: the 4
>> bit shift being an extended feature).
>>
>>
>> Funny enough, Quake didn't seem to mind so much when using a runtime
>> call for integer multiply, or the crappiness of doing shifts via a
>> computed branch into a series of SHLL or SHAR instructions... But, so
>> help you if that FDIV isn't fast...
>>
>> It seems possible though, that if floating-point were eliminated from
>> the software renderer, then much of the rest of the engine could live on
>> with a much slower FPU (or possibly FPU emulation).
>
> I heard a rumor about an integer only version of Quake for some console. There is always non-FP Doom too.
>

I had considered Doom as well.

It could make sense with BJX1, or if I later do a BSR1 core which uses
external DRAM.


I was wanting to target a cheaper FPGA, such as an XC6SLX9 or similar
for this. This gives me ~ 70 kB of Block-RAM to work with (with things
like external DRAM depending on which exact board I get).


The Arty S7 board I have been using thus far (with an XC7S50) has around
200kB, and a 256MB DRAM chip, but is somewhat more expensive (at
present, getting another one would cost ~ $140).

This would be overkill for a lathe, so I was looking to target something
a little cheaper for this (like a Mimas or other similar class board).

Ivan Godard

unread,
May 28, 2018, 12:53:38 AM5/28/18
to
Mill: predicated, base + optional index + displacement. Predicated is
true/false/always; true/false take a value from the belt, always does
not. index takes a belt value, optionally scaled by the width of the
operation (b/h/w/d/q/vector widths). Displacement is 0/1/2/4 bytes,
optionally ones complemented. Base is a value from the belt or one of
specRegs DP/FP/INP/OUTP/TLP.

Most compact encoding of a load is (Silver, belt 16):
*p:
opcode 4 bits (typical, depends on other ops in slot)
base 4 bits
no index 2 bits (implied by the belt count)
no disp 2 bits {implied by the byte count)
no comp 1 bit
width 3 bits
delay 4 bits
= 20 bits
largest is:
b?A[i].f : NaR
opcode 4 bits (typical, depends on other ops in slot)
predicate 4 bits
base 4 bits
yes index 2 bits (implied by the belt count)
index 4 bits
yes disp 2 bits (implied by the byte count)
displacement 8/16/32 bits
no comp 1 bit
width 3 bits
delay 4 bits
= 60 bits max
The three-input address adder is signed. While the displacement is 32
bits max, it (possibly ones complemented) is treated as a 64 bit value.
There is no absolute address mode, and no data references (load/store)
based on the PC.

There are skinny and svelte encodings of popular ops the reduce both the
instruction level and operation level entropy. If the instruction has
nothing but "belt <- *p" the whole instruction is ~11 bits including the
belt reference and delay.

Terje Mathisen

unread,
May 28, 2018, 2:08:11 AM5/28/18
to
<BG>

I have been waiting to see this view offered, I do agree that x86 asm
was very pleasant indeed.

The architecture is restrictive enough that you can often come up with
algorithms which you know are more or less optimal, simply because of
all the contraints.

Sort of like poets claiming that having very strict rules about how to
construct verses makes it possible to write better instead of worse poetry?

already...@yahoo.com

unread,
May 28, 2018, 5:03:19 AM5/28/18
to
On Monday, May 28, 2018 at 9:08:11 AM UTC+3, Terje Mathisen wrote:
> already...@yahoo.com wrote:
> > On Monday, May 28, 2018 at 12:02:38 AM UTC+3, matt...@gmail.com
> > wrote:
> >
> >> Sadly, I expect more developers like x86/x86_64 assembler than PPC
> >> assembler.
> >
> > Why sadly? What could be wrong when developers like more pleasant asm
> > coding experience better than less pleasant asm coding experience?
> >
> <BG>
>
> I have been waiting to see this view offered, I do agree that x86 asm
> was very pleasant indeed.

For the record, it's not my view.
Personally, I find programming in original 16-bit x86 asm unpleasant.
x386 is completely different story, I like it.
But even under x386, I find programming of x87 FPU part, how to say it... challenging in unsatisfactory way. I feel that here the creativity of programmer is spent in non-productive way.
And the micro-architecture that I like least happens to be the one you like most - P5.

>
> The architecture is restrictive enough that you can often come up with
> algorithms which you know are more or less optimal, simply because of
> all the contraints.
>
> Sort of like poets claiming that having very strict rules about how to
> construct verses makes it possible to write better instead of worse poetry?
>

I think, chess is better analogy of x86. You have 6 different sorts of pieces.
Except that in 16-bit x86 not 6, but all 8 register are different in some ways.

And yes, I like chess very much. But at he same time I am glad that it's not my profession.
As to the poetry, I can't say that for me sonnets as inherently superior over blank verses.

John Levine

unread,
May 28, 2018, 9:48:33 AM5/28/18
to
In article <4274b57d-9bbf-4035...@googlegroups.com>,
Well, it is unless you are in an environment where you care about code
size. Does anyone understand how all of the prefix byte and irregular
instruction encodings interact? (Leaving the manual open on your
screen all the time doesn't count.)

--
Regards,
John Levine, jo...@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Terje Mathisen

unread,
May 28, 2018, 11:22:11 AM5/28/18
to
John Levine wrote:
> In article <4274b57d-9bbf-4035...@googlegroups.com>,
> <already...@yahoo.com> wrote:
>> On Monday, May 28, 2018 at 12:02:38 AM UTC+3, matt...@gmail.com wrote:
>>
>>> Sadly, I expect more developers like x86/x86_64 assembler than PPC assembler.
>>
>> Why sadly?
>> What could be wrong when developers like more pleasant asm coding experience better than less pleasant asm coding experience?
>
> Well, it is unless you are in an environment where you care about code
> size. Does anyone understand how all of the prefix byte and irregular
> instruction encodings interact? (Leaving the manual open on your
> screen all the time doesn't count.)
>
Even if I don't remember every detail now, I more or less did so in the
years when I wrote most of my asm code.

Proof by example: I wrote the following code during an Easter ski
vacation, with no computer, just a list of the ~70 asm instructions that
only use MIME/Base64 ascii values.

This is a program encoded as a text file.

To use the program, remove all extra lines before and after
the text executable and save the remainder to disk as DU.COM.
The result is a normal executable .COM file!

*********** Remove this line and all lines above it! ***********
ZRYPQIQDYLRQRQRRAQX,2,NPPa,R0Gc,.0Gd,PPu.F2,QX=0+r+E=0=tG0-Ju E=
EE(-(-GNEEEEEEEEEEEEEEEF 5BBEEYQEEEE=DU.COM=======(c)TMathisen95
&&P4Vw4+Vww7AjAyoAwAAzJzAEAJP7JzBaANQ+wAIxJzAeAJX9JzIxANJ0JzKzAK
WwwAP4JzPwAJi8JzXxANA3wAa4JzavAKA9wAX8JzXyALy8au1+Wwe7CA0/G7LBW7
Www7v7h8CAi8A9A6x8CAP7G7AK0VKASzAza6vAAaLAh90TI9H9LAx90QH8I8Vww7
Agavy9AmAj6+B7V7gA6PAzG72+H9f7AtIxAvCzAkCzAgAkQ/AiYxXzYxQ/U+DaT+
Q/AkCZAMAUCZQ/AtBaAKAWAXANCZAQQ/YxPwPwSzXzQ/GwAXAQBaAMAHQ/AKARQ/
AkCzAgAkXzFwArAtQ/DaARAQAICZAMAKCZAMA4A5Q+=6C8E/L6BAOjyATwND4SzA


The first two lines is the primary bootstrap, it employs the absolute
minimum possible amount of self-modifying code (a single two-byte
backwards branch instruction) but can still survive quite a lot of text
reformatting, i.e. replacing line ends (currently CRLF) with any zero,
one or two-byte combination. It constructs the secondary bootstrap by
picking up pairs of Base64 chars until the '=' terminator, ignoring all
other characters, and combines them into arbitrary byte values.

The algorithm I found to do the latter does "0 xor a xor b - b - b"
since this was the first combination I tested which used only MIME
opcodes and still allowed all byte values to be reached.

Paul A. Clayton

unread,
May 28, 2018, 11:47:42 AM5/28/18
to
On Monday, May 28, 2018 at 2:08:11 AM UTC-4, Terje Mathisen wrote:
[snip]
> <BG>
>
> I have been waiting to see this view offered, I do agree that x86 asm
> was very pleasant indeed.

I am not that familiar with x86, but it seems to me
that the number of instructions is relatively high,
that the operations are not especially orthogonal
(perhaps particularly with condition code results?;
not being able to avoid setting the condition code
seems a relatively useless constraint other than for
code density), and the encoding (which can matter
for density or alignment optimizations) is somewhat
complex. The complexity might make the discovery of
a clever use more satisfying and perhaps the
complexity adds constraints that assist creation
(perhaps similar to a jigsaw puzzle with more
piece-shapes, where fit as well as image matching
constrain placement?)

Your previous posts on the subject imply that the
modest register count with semi-dedicated purposes
facilitated mental tracking of availability and
allocation.

Humans seem to need some complexity to find
intellectual enjoyment, but I suspect that much of
the bookkeeping aspects of a register rich ISA
could be handled with software assistance or by
iterative refinement.

Of course, the complexity of the microarchitecture
matters.

> The architecture is restrictive enough that you can often come up with
> algorithms which you know are more or less optimal, simply because of
> all the contraints.
>
> Sort of like poets claiming that having very strict rules about how to
> construct verses makes it possible to write better instead of worse poetry?

For poetry, constraints of meter, rhyme, etc. tend to force further thinking (related to your "know are more or less
optimal", knowing that something is not quite right, and
possibly related to complexity forcing actual thought) and
introduce "random" perturbations to move thinking off
regular courses. In my limited verse writing, sometimes
a rhyme (or even meter) requirement has introduced a
metaphor/association or word choice that would not come
otherwise.

BGB

unread,
May 28, 2018, 3:02:44 PM5/28/18
to
On 5/28/2018 10:47 AM, Paul A. Clayton wrote:
> On Monday, May 28, 2018 at 2:08:11 AM UTC-4, Terje Mathisen wrote:
> [snip]
>> <BG>
>>
>> I have been waiting to see this view offered, I do agree that x86 asm
>> was very pleasant indeed.
>
> I am not that familiar with x86, but it seems to me
> that the number of instructions is relatively high,
> that the operations are not especially orthogonal

As for x86 instruction counts being high:
Yep, pretty much. The x86 ISA has far more instruction forms than a
typical RISC or similar. Likewise, many opcodes are encoded through
layers of re-purposed prefixes.

The original ISA (8086):
Single-byte opcodes
Many with a Mod/RM byte
Many with a displacement
Many with an immediate
Some prefixes, like REP/REPNE, segment overrides, ...

The 286: Some bytes started being used to encode longer opcodes.
For example, "0F XX Mod/RM ..." vs just "XX Mod/RM", ...

The 386: Added a 32-bit mode, new Mod/RM scheme, and more prefix bytes.
Prefixes: address and data size overrides, FS/GS overrides, ...

By around the time MMX and SSE were being added, lacking much else to
do, they started tacking on the various prefix bytes to other operations
where they were previously not defined, which would give them new
meanings. This causes many SSE operations to effectively have a soup of
prefix bytes as part of their opcode field.

For x86-64, some less frequently used single-byte (INC/DEC) instruction
forms were dropped (forcing their two-byte encodings to be used), and
then were reused as the REX prefixes (needed for QWORD operations and to
access R8-R15).

By AVX, they took another instruction and redefined it so that certain
invalid encodings would be interpreted as a VEX prefix, with its Mod/RM
bits and similar encoding the equivalent of the chain of prefix bytes,
including the REX bits, and potentially an additional register argument, ...

How many instruction-forms exist? Several thousand last I checked...


So, now it is sort of a hairy mess on this front, and implementing a CPU
for x86 would probably be fairly non-trivial. If I were to do something
with x86 support, would probably just implement a RISC style core and
use an emulator to run any x86 code. Likely the emulator would decoded
the ISA in software and then JIT compile it into the native ISA (with
maybe several MB or so for translated instruction traces).


However, from the perspective of someone writing ASM code, it isn't
nearly so bad. The assembler can deal with most of the encoding details,
and most instructions present a fairly consistent interface.

Similarly, most decoding for most operations is basically the same once
you get to the Mod/RM byte.


The situation is much less friendly on a typical RISC, where one might
battle with which instruction forms exist for which combinations of
parameters.

There might also be other issues, like needing to deal with delay slots
and other funkiness. For example, a memory load might not take effect
for several instructions, or the effects of executing a branch
instruction might not take effect until one or two instructions later
(causing instructions after the branch to be executed), ...

Similarly, many have a habit of requiring loading constants from memory,
which may need to be placed within a certain distance of the code being
executed, ...

Some of this is a lot harder to gloss over with an assembler, so writing
ASM code is a bit more painful if compared with x86.

However, for the CPU it is easier, given the instruction format itself
is typically fixed-width and fairly regular.

Some partial exceptions exist, like Thumb, where the layout of the bits
within the various instruction forms is a bit chaotic. Most other RISC's
are a bit more regular here.


> (perhaps particularly with condition code results?;
> not being able to avoid setting the condition code
> seems a relatively useless constraint other than for
> code density), and the encoding (which can matter
> for density or alignment optimizations) is somewhat
> complex.

IMO: x86 style condition codes are needlessly inconvenient in some areas.

One alternative that is nicer IMO is simply having a True/False status
code, with only a small subset of instructions effecting it. But in this
case, now one needs comparison operators that perform a specific
comparison, and there are fewer possible conditions to branch on.


> The complexity might make the discovery of
> a clever use more satisfying and perhaps the
> complexity adds constraints that assist creation
> (perhaps similar to a jigsaw puzzle with more
> piece-shapes, where fit as well as image matching
> constrain placement?)
>

This puzzle aspect is probably more true of dealing with a lot of the
small RISC ISAs than when dealing with x86 IMO.

Bigger RISC ISA's (with 32-bit instruction words) are typically a little
more regular here, but with 16-bit instruction words there is often a
need to fight with instruction coding a bit more to make everything fit
nicely.

There are tradeoffs here (code density, performance, complexity, ...).


> Your previous posts on the subject imply that the
> modest register count with semi-dedicated purposes
> facilitated mental tracking of availability and
> allocation.
>
> Humans seem to need some complexity to find
> intellectual enjoyment, but I suspect that much of
> the bookkeeping aspects of a register rich ISA
> could be handled with software assistance or by
> iterative refinement.
>

Having a lot of registers is not really an issue for writing ASM, if you
don't need them you don't use them.

It isn't exactly hard to write comments to say which variable is in
which register, more so if there are enough registers that the same
variable can be kept in the same register the whole lifetime of the
function.

Likewise, most non-x86 archs simply name them by number, and use names
only for registers with special defined meanings. Even on x86-64, it may
make sense in some cases (such as compilers or JITs) to mostly abandon
the use of symbolic names in favor of identifying them as R0-R15.
Ex: R0=RAX, R1=RCX, R2=RDX, R3=RBX, ...


The tradeoff in an ISA would mostly relate to how one trades off the
usage of bits, vs how much of the time will be spent loading/storing
memory values, vs other issues.

But, as can be noted, 16 or 32 seem to be roughly about optimal in most
cases.

MitchAlsup

unread,
May 28, 2018, 4:44:37 PM5/28/18
to
When I left AMD in 2006 there were at least 1500 instructions, many with
the same spelling for the opcode. And this is one thing I do with my RISC
ISA spellings:

ADD R7,R8,immed16
and
ADD R7,R8,R9

instead of
ADDI R7,R8,immed16
and
ADD R7,R8,R9

One can say this adds some uncertainty, but it comes in handy when
# define immed16 R9
>
>
> So, now it is sort of a hairy mess on this front, and implementing a CPU
> for x86 would probably be fairly non-trivial. If I were to do something
> with x86 support, would probably just implement a RISC style core and
> use an emulator to run any x86 code. Likely the emulator would decoded
> the ISA in software and then JIT compile it into the native ISA (with
> maybe several MB or so for translated instruction traces).
>
>
> However, from the perspective of someone writing ASM code, it isn't
> nearly so bad. The assembler can deal with most of the encoding details,
> and most instructions present a fairly consistent interface.
>
> Similarly, most decoding for most operations is basically the same once
> you get to the Mod/RM byte.

SW decoding of x86 is actually pretty easy--you do it with 256 entry tables
(6 of them when I left AMD, probably 8 or 9 tables now) and each table entry
contains a termination bit, an opcode, and a carrier. In my AMD decoder
the termination bit was any positive table entry. If the table entry was zero
this was an Undefined inst, and if the table entry was negative, the negative
table entry was an index into an array of 256 entry tables. So the decoder was something like::

loop:
LDB R7,[Rpc]
LDW R8,[Rtable+R7<<2]
BZE R8,UNDEFINED
BGEZ R8,done
LDD Rtable,[R8+TABLEARRAY]
BA loop


>
> The situation is much less friendly on a typical RISC, where one might
> battle with which instruction forms exist for which combinations of
> parameters.
>
> There might also be other issues, like needing to deal with delay slots
> and other funkiness. For example, a memory load might not take effect
> for several instructions, or the effects of executing a branch
> instruction might not take effect until one or two instructions later
> (causing instructions after the branch to be executed), ...

These should all have been eradicated by now.
>
> Similarly, many have a habit of requiring loading constants from memory,
> which may need to be placed within a certain distance of the code being
> executed, ...

These should have been, but seem not to have been, eradicated by now.
>
> Some of this is a lot harder to gloss over with an assembler, so writing
> ASM code is a bit more painful if compared with x86.
>
> However, for the CPU it is easier, given the instruction format itself
> is typically fixed-width and fairly regular.
>
> Some partial exceptions exist, like Thumb, where the layout of the bits
> within the various instruction forms is a bit chaotic. Most other RISC's
> are a bit more regular here.
>
>
> > (perhaps particularly with condition code results?;
> > not being able to avoid setting the condition code
> > seems a relatively useless constraint other than for
> > code density), and the encoding (which can matter
> > for density or alignment optimizations) is somewhat
> > complex.
>
> IMO: x86 style condition codes are needlessly inconvenient in some areas.
>
> One alternative that is nicer IMO is simply having a True/False status
> code, with only a small subset of instructions effecting it. But in this
> case, now one needs comparison operators that perform a specific
> comparison, and there are fewer possible conditions to branch on.

T/F is one was of addressing this, another is a bit vector of all possible
comparisons--ala. M88K.

In my latest ISA, the compare instruction can compare integer or FP operands
and also sample whether memory interference has occurred (SW multi-instruction
ATOMIC stuff). One can also include 0<x<y (FORTRAN) of 0<=x<=y (C) for fast
boundary checks. All in all my CMP instruction delivers 20 individual bits.
I have also extended the Branch-on-comparison to have integer and FP forms
(things like isNAN(), isMINUSzero(),...)

BGB

unread,
May 28, 2018, 9:34:29 PM5/28/18
to
OK, IIRC I was going off a listing which also had a lot of AVX stuff and
similar.


> And this is one thing I do with my RISC
> ISA spellings:
>
> ADD R7,R8,immed16
> and
> ADD R7,R8,R9
>
> instead of
> ADDI R7,R8,immed16
> and
> ADD R7,R8,R9
>
> One can say this adds some uncertainty, but it comes in handy when
> # define immed16 R9


My BSR1 ISA is also doing something similar, as it is possible to tell
in most cases what the intended behavior is based on the parameters.

For example (vs the SH family):
BRA, BRA/N, BRAF, JMP -> BRA
BSR, BSR/N, BSRF, JSR -> BSR
MOV, LDC, STC, ... -> MOV
...

So, there are currently ~ 70 mnemonics, vs currently ~219 on the SH/BJX1
side of things (or ~150 if I exclude the BJX1 ops).


Quickly counting up from the listings from my ISAs:

BJX1-32 has ~ 400 I-forms in the 16-bit Base-ISA (mostly overlaps with
SH4), and ~ 180 in the 8Exx block.

BJX1-64C has ~ 336 I-forms in the 16-bit Base-ISA (superset of B64V),
and ~ 338 in the 8Exx/CExx block.

For B32V, has ~ 270 I-forms, more-or-less overlaps with the normal SH4
ISA, 16-bit only subset of BJX1-32, omits the FPU and various misc
I-forms from SH (such as MAC.W and MAC.L).

For B64V, has ~ 159 I-forms (B64V was a simplified and reorganized
version of the SH ISA, modified to be 64-bit).



For the BSR1 ISA, it currently has ~ 212 I-forms.

If I omit redundant I-forms which exist mostly for code-density reasons,
it drops to 147 I-forms. Could go lower, but this would start to come at
the cost of ISA features.


>>
>>
>> So, now it is sort of a hairy mess on this front, and implementing a CPU
>> for x86 would probably be fairly non-trivial. If I were to do something
>> with x86 support, would probably just implement a RISC style core and
>> use an emulator to run any x86 code. Likely the emulator would decoded
>> the ISA in software and then JIT compile it into the native ISA (with
>> maybe several MB or so for translated instruction traces).
>>
>>
>> However, from the perspective of someone writing ASM code, it isn't
>> nearly so bad. The assembler can deal with most of the encoding details,
>> and most instructions present a fairly consistent interface.
>>
>> Similarly, most decoding for most operations is basically the same once
>> you get to the Mod/RM byte.
>
> SW decoding of x86 is actually pretty easy--you do it with 256 entry tables
> (6 of them when I left AMD, probably 8 or 9 tables now) and each table entry
> contains a termination bit, an opcode, and a carrier. In my AMD decoder
> the termination bit was any positive table entry. If the table entry was zero
> this was an Undefined inst, and if the table entry was negative, the negative
> table entry was an index into an array of 256 entry tables. So the decoder was something like::
>
> loop:
> LDB R7,[Rpc]
> LDW R8,[Rtable+R7<<2]
> BZE R8,UNDEFINED
> BGEZ R8,done
> LDD Rtable,[R8+TABLEARRAY]
> BA loop
>

In an x86 emulator I once wrote (for a 486-like subset), I used a lookup
table for the first byte into a table of pattern strings (adapted from
my assembler/disassembler). While not necessarily the "best" option, it
basically worked.

The thing wasn't particularly fast in retrospect, but it was the first
to use an interpretation strategy I had used in most of my later VMs:
Decoding sequences of instructions into a "trace" consisting of a series
of "opcode" structures with function pointers to the functions
implementing the opcode behavior.

Typically, executing a trace involves calling into a function pointer,
which will hold an unrolled loop for calling the other function pointers.

This also allows a relatively simple JIT strategy:
Walk the opcode list in the trace, either emitting behavioral logic for
the opcode (if it is recognized), or spit out a call into the associated
function pointer (essentially resulting in call-threaded code).

The a pointer to the resulting JIT'ed function would be stored back into
the trace structure, and the emulator's main trampoline loop doesn't
need to care if it is dealing with interpreted or JIT compiled traces.


>
>>
>> The situation is much less friendly on a typical RISC, where one might
>> battle with which instruction forms exist for which combinations of
>> parameters.
>>
>> There might also be other issues, like needing to deal with delay slots
>> and other funkiness. For example, a memory load might not take effect
>> for several instructions, or the effects of executing a branch
>> instruction might not take effect until one or two instructions later
>> (causing instructions after the branch to be executed), ...
>
> These should all have been eradicated by now.

I don't think many new ISA designs use delay slots.

Being based on SH, BJX1 had inherited the use of branch delay slots.

In the design of BSR1, I have dropped the use of delay-slots.


>>
>> Similarly, many have a habit of requiring loading constants from memory,
>> which may need to be placed within a certain distance of the code being
>> executed, ...
>
> These should have been, but seem not to have been, eradicated by now.

Likewise.

BSR1 also drops these in favor of using a load-shift mechanism.

There are still PC-rel addressing modes, but these are more for
accessing things like global variables and similar, sort of like
RIP-relative addressing in x86-64.
ok.

BSR1 stays with the use a T/F flag (in the Status Register).


Basic compare ops:
CMPEQ, CMPGT, CMPHI
With branch ops:
BT, BF

So:
(a==b): CMPEQ a, b; BT lbl
(a!=b): CMPEQ a, b; BF lbl
(a> b): CMPGT b, a; BT lbl
(a< b): CMPGT a, b; BT lbl
(a>=b): CMPGT a, b; BF lbl
(a<=b): CMPGT b, a; BF lbl

With CMPHI used for unsigned (a>b), and is basically used in the same
way as CMPGT.

There are also CMPGE and CMPHS ops when dealing with DLR:
CMPEQ DLR, Rn
CMPGT DLR, Rn
CMPGE DLR, Rn
...

Mostly because flipping A and B is less of an option when dealing with
an immediate (and this is basically what the reused compiler logic had
expected).

There are MOVT and MOVNT instructions to copy the T/F status bit into a GPR.

thereis...@gmail.com

unread,
May 28, 2018, 10:55:54 PM5/28/18
to
On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> > Similarly, many have a habit of requiring loading constants from memory,
> > which may need to be placed within a certain distance of the code being
> > executed, ...
>
> These should have been, but seem not to have been, eradicated by now.

Constants require either an absurdly long instruction, or variable-length instructions.

I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."

Sadly, I can't speak of the validity of the claim, but I take you disagree?

> T/F is one was of addressing this, another is a bit vector of all possible
> comparisons--ala. M88K.

I thought everyone hated condition codes these days? :P

BGB

unread,
May 29, 2018, 1:27:32 AM5/29/18
to
On 5/28/2018 9:55 PM, thereis...@gmail.com wrote:
> On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
>> On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
>>> Similarly, many have a habit of requiring loading constants from memory,
>>> which may need to be placed within a certain distance of the code being
>>> executed, ...
>>
>> These should have been, but seem not to have been, eradicated by now.
>
> Constants require either an absurdly long instruction, or variable-length instructions.
>

Or, a series of short instructions each extending the value a little bit
at a time.

The logic for the operation is fairly simple:
BSR_UCMD_ALU_LDISH: begin
tCtlOutDlr = { ctlInDlr[23:0], immValRi[7:0] };
end


The load/shift mechanism can express an arbitrary sized constant via a
series of fixed-width instructions.

Even as such, given these instructions will take 1 cycle each, and I am
looking at 3 or 4 cycles for a typical memory access, on average I
expect the load/shift mechanism to work out faster than it would be to
fetch values from memory.


While these operations are limited to loading into a special register
(DLR), on the final operation the contents of DLR are "consumed" and
potentially transferred to another register (or used used to compute a
memory address or similar).


In my ISA:
MOV #0x1234, R9
MOV #0x123456, R10
MOV #0x12345678, R11
Can become instruction sequences:
A123 4894
A123 2645 48A6
A123 2645 2667 48B8

From the user and compiler perspective, it looks like a variable length
coding, but as far as the CPU is concerned, it is dealing with a series
of fixed-width instructions.

Ex:
26jj LDISH8 #imm8u //DLR=(DLR<<8)|Imm8u;
Ajjj LDIZ #imm12u //DLR=Imm12u;
Bjjj LDIN #imm12u //DLR=(~4095)|Imm12u;
48nj MOV DLR_i4, Rn //Rn=(DLR<<4)|Imm4u;
( ... in the current Verilog ... this is actually a LEA ... )
49nj ADD DLR_i4, Rn //Rn=Rn+((DLR<<4)|Imm4u);
( ... and so is this one ... )


> I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."
>

This is partly why I went back to fixed-width 16-bit instructions for my
newer ISA.

I did variable-width before, but it was kind of a pain to work with and
made latency issues harder. This combined with wanting to target a
smaller FPGA.

Also 16-bit because I wanted decent code density.


A bigger/fancier version would probably go back to variable-width, but
probably also add 64-bit support and go to 32 GPRs at roughly the same
time (to hopefully limit causing as much fragmentation this time).

already...@yahoo.com

unread,
May 29, 2018, 3:41:39 AM5/29/18
to
On Tuesday, May 29, 2018 at 5:55:54 AM UTC+3, thereis...@gmail.com wrote:
>
> I thought everyone hated condition codes these days? :P

Same as above (see "Developers loved and preferred the 68k ISA to ARM offerings"). A small minority hates condition codes, another small minority loves condition codes. An overwhelming majority does not care.

In the real world, outside of hate/love relationships, several ISAs with condition codes prosper. ISAs without condition codes are either dead or dying or not quite alive yet. But the reasons for that appear to have no relationship to presence/absence of condition codes.

The only area where ISAs without condition codes are dominant are commercial soft cores, but IMHO even that has nothing to do with presence/absence of condition codes and everything to do with poor understanding of this particular segment of the market by ARM Inc.

Megol

unread,
May 29, 2018, 8:00:46 AM5/29/18
to
On Tuesday, May 29, 2018 at 4:55:54 AM UTC+2, thereis...@gmail.com wrote:
> On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> > On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> > > Similarly, many have a habit of requiring loading constants from memory,
> > > which may need to be placed within a certain distance of the code being
> > > executed, ...
> >
> > These should have been, but seem not to have been, eradicated by now.
>
> Constants require either an absurdly long instruction, or variable-length instructions.
>
> I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."

For the most simplified processors perhaps, otherwise IMO just a matter of
design. RISC V is a hardline RISC updated with some wisdom from the "recent"
past, variable length instructions isn't a good fit for the design goal.

But they still have the compressed instruction extension which in fact gives
a simple variable length instruction set (16 and 32 bit instructions) and
they seem to expect high performance implementations to do instruction
fusion. Both of those complicate decode without providing all advantages of
a proper variable instruction length design.

Simple processors can have variable length without many problems but it
require proper design.

> Sadly, I can't speak of the validity of the claim, but I take you disagree?
>
> > T/F is one was of addressing this, another is a bit vector of all possible
> > comparisons--ala. M88K.
>
> I thought everyone hated condition codes these days? :P

That is not a type of condition codes., it is a comparison instruction giving a binary encoded comparison result that can be tested.
Instructions doesn't set or read condition codes, the comparison instruction
is dependent only on the input registers/imm values and produces a bitmask
written to a normal register.

No bottleneck like that with real condition codes, no complexities from
selective updates of condition codes etc.

matt...@gmail.com

unread,
May 29, 2018, 12:08:54 PM5/29/18
to
On Monday, May 28, 2018 at 8:48:33 AM UTC-5, John Levine wrote:
> In article <4274b57d-9bbf-4035...@googlegroups.com>,
> <already> wrote:
> >On Monday, May 28, 2018 at 12:02:38 AM UTC+3, matt...@gmail.com wrote:
> >
> >> Sadly, I expect more developers like x86/x86_64 assembler than PPC assembler.
> >
> >Why sadly?
> >What could be wrong when developers like more pleasant asm coding experience better than less pleasant asm coding experience?
>
> Well, it is unless you are in an environment where you care about code
> size. Does anyone understand how all of the prefix byte and irregular
> instruction encodings interact? (Leaving the manual open on your
> screen all the time doesn't count.)

I expect there are a few x86/x86_64 developers who understand practically everything about the ISA due to the amount of money involved. In contrast, most experienced 68k asm programmers should be able to look at asm code and determine instruction sizes with no manual. Even immediate and displacement sizes are generally 8, 16 or 32 bits which is easy to remember and calculate (usually not true for RISC). An ISA with a variable length encoding should be able to provide a pleasant experience with minimal memorization and less complex logic puzzles.

MitchAlsup

unread,
May 29, 2018, 12:34:07 PM5/29/18
to
On Monday, May 28, 2018 at 9:55:54 PM UTC-5, thereis...@gmail.com wrote:
> On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> > On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> > > Similarly, many have a habit of requiring loading constants from memory,
> > > which may need to be placed within a certain distance of the code being
> > > executed, ...
> >
> > These should have been, but seem not to have been, eradicated by now.
>
> Constants require either an absurdly long instruction, or variable-length instructions.
>
> I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."
>
> Sadly, I can't speak of the validity of the claim, but I take you disagree?

Yes, I disagree.
>
> > T/F is one was of addressing this, another is a bit vector of all possible
> > comparisons--ala. M88K.
>
> I thought everyone hated condition codes these days? :P

I still do, but if I end up in a situation where some value needs to be
returned, and that value has lost (32+) bits available, I will send back
basically every useful bit I can.

MitchAlsup

unread,
May 29, 2018, 12:41:01 PM5/29/18
to
On Monday, May 28, 2018 at 9:55:54 PM UTC-5, thereis...@gmail.com wrote:
> On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> > On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> > > Similarly, many have a habit of requiring loading constants from memory,
> > > which may need to be placed within a certain distance of the code being
> > > executed, ...
> >
> > These should have been, but seem not to have been, eradicated by now.
>
> Constants require either an absurdly long instruction, or variable-length instructions.
>
> I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."

I want to address this point::

THese days even simple implementations are reading out (1/2) a cache line of
data in an instruction fetch. Once you do this, you have an instruction buffer.
Once you have an instruction buffer, you have everything necessary to provide
constants to the data path except the read out multiplexers. Thus, the cost
of delivering constants is insignificant.

When one merges the I and D caches into one, this argument becomes even
stronger. One needs a wide fetch so that the number of fetch cycles is minimized so the LD/ST stream gets the majority of the access cycles.

thereis...@gmail.com

unread,
May 29, 2018, 8:15:08 PM5/29/18