Some More of the 68000's Greatest Mistakes

Quadibloc

unread,

May 25, 2018, 2:20:36 AM5/25/18

to

I could not resist this title, given that of the existing thread.

It is true that at the beginning, the 68000 "went wrong" by not offering an option like the 68008 in time for IBM to have considered it for the PC.

Given the similarities of the 8086 to the 8080, and thus the ease of converting programs from CP/M to PC-DOS that resulted, though... I think that although many at resolutely big-endian IBM would have liked the 68008, it still wouldn't be a slam dunk.

At the end of the life of the 68000 architecture, though, I can think of three actions on Motorola's part that were mistakes...

1) The 68060 was brought out to compete with the Pentium.

However, it made what seems to me to be a mistake... although a most reasonable
one. One that AMD keeps making as well.

Given that floating-point is an exotic data type, mainly of use to scientists
and FORTRAN programmers, and integer performance is what matters...

then AMD's decision with Bulldozer must have seemed reasonable at the time, to
de-emphasize floating-point vector performance to save die space...

and the decision with the 68060 to do the opposite of what Intel did with the
Pentium AND the opposite of what IBM did with the Model 91, and make the
*integer* unit pipelined, but *not* the *floating-point* unit must have seemed
equally sensible.

2) The decision to abandon the 68000 architecture when it still had
*customers*... Apple, with its Macintosh, and as well the Atari ST and the
Commodore Amiga... well, that branded Motorola as an unreliable supplier.

3) And then there's ColdFire. Maybe the 68000 had too many baroque addressing
modes, and even doing what Intel was about to do with the Pentium Pro, and AMD
did by buying another company's design for the... was it the K6? ... and
converting code into RISC-like micro-ops would not have saved it.

However, ColdFire went a step too far. One of the addressing modes abandoned was
16-bit displacement + base register + index register. Without that, one has to
compile code that uses arrays into long, cumbersome sequences involving extra
instructions. This is one of the most basic standard addressing modes - RX
format - not some exotic VAX-like foray into an exotic realm.

Had they not done this, one could have imagined 68000 customers converting to
the more streamlined ColdFire architecture, if only the features omitted were
ones they could live without.

However, as ColdFire was positioned as being for the embedded market, it is also
not as if beefy processors with that architecture would be available.

John Savard

matt...@gmail.com

unread,

May 25, 2018, 2:31:03 PM5/25/18

to

On Friday, May 25, 2018 at 1:20:36 AM UTC-5, Quadibloc wrote:
> I could not resist this title, given that of the existing thread.
>
> It is true that at the beginning, the 68000 "went wrong" by not offering an option like the 68008 in time for IBM to have considered it for the PC.
>
> Given the similarities of the 8086 to the 8080, and thus the ease of converting programs from CP/M to PC-DOS that resulted, though... I think that although many at resolutely big-endian IBM would have liked the 68008, it still wouldn't be a slam dunk.

I agree. IBM and Intel already had a business relationship and the 8088 was available in large quantities for cheap (likely due to poor performance and weak demand). Motorola probably could not have supplied large quantities of even the 68000 to one customer because they were selling so many to many different customers. Many of these customers were also higher end and higher margin CPU customers for workstations. The focus was on the high margin high end CPU market which later disappeared quickly with the RISC hype and propaganda. Had Motorola expanded more quickly into the embedded and mass production affordable CPU PC markets with the 68008, it would have helped the 68k family survive longer even without IBM choosing it for the IBM PC. It would be interesting to know if the 68008 would have been chosen for the Apple Macintosh, Atari ST and/or C= Amiga also if it had been available earlier. As it was, only the short lived Sinclair QL received the 68008 (first preemptive multitasking PC OS followed shortly after by the AmigaOS showing easier compiler support for an orthogonal CPU with many GP registers).

> At the end of the life of the 68000 architecture, though, I can think of three actions on Motorola's part that were mistakes...
>
> 1) The 68060 was brought out to compete with the Pentium.
>
> However, it made what seems to me to be a mistake... although a most reasonable
> one. One that AMD keeps making as well.
>
> Given that floating-point is an exotic data type, mainly of use to scientists
> and FORTRAN programmers, and integer performance is what matters...
>
> then AMD's decision with Bulldozer must have seemed reasonable at the time, to
> de-emphasize floating-point vector performance to save die space...
>
> and the decision with the 68060 to do the opposite of what Intel did with the
> Pentium AND the opposite of what IBM did with the Model 91, and make the
> *integer* unit pipelined, but *not* the *floating-point* unit must have seemed
> equally sensible.

I don't know that the 68060 deliberately "de-emphasized" floating point performance. The 68040 FPU design had already simplified the FPU by trapping many of the 6888x instructions and handling in software (the x86 FPU retained most legacy FPU instructions in hardware). I expect the lack of fully pipelined FPU for the 68060 was lack of time and/or a 2.5 million transistor budget. The 68060 design team probably realized they had to do more with less than the Pentium to compete. The 68k won the battle with the 68060 but the 68k lost the war.

Pentium@75MHz 80502, 3.3V, 0.6um, 3.2 million transistors, 9.5W max
68060@75MHz 3.3V, 0.6um, 2.5 million transistors, ~5.5W max *1
PPC 601@75MHz 3.3V, 0.6um, 2.8 million transistors, ~7.5W max *2

*1 estimate based on 68060@50MHz 3.9W max, 68060@66MHz 4.9W max
*2 estimate based on 601@66MHz 7W max, 601@80MHz 8W max

The 68060 had the best integer performance even though benchmarks often showed the Pentium to be competitive if not slightly ahead due to much better compiler support. The Pentium did have better theoretical FPU performance but likely required hand laid assembler to achieve it. The 68060 FPU is more compiler friendly and performs well on mixed code as integer instructions can often operate in parallel (although FPU instructions using immediates annoyingly can't probably due to another transistor saving strategy). From optimizing FPU code, I suspect the small 8kB data cache becomes the bottleneck on these older processors with games like Quake which became so important to sales at the time. My 68060@75MHz Amiga with Voodoo 4 (no T&L) can play Quake 512x384x16 at ~25 fps. It is obviously not as well optimized as the PC version but shows the Pentium FPU advantage was better for marketing than performance. Certainly at that time it was better to focus on integer performance but the 68060 still had a good FPU.

> 2) The decision to abandon the 68000 architecture when it still had
> *customers*... Apple, with its Macintosh, and as well the Atari ST and the
> Commodore Amiga... well, that branded Motorola as an unreliable supplier.

Apple was in the AIM (Apple, IBM and Motorola) agreement so already was on board with switching ISAs to PPC. Atari ST sales were dropping and the Amiga was grossly mismanaged. Commodore did try to license the 68k from Motorola to make a cheap SoC Amiga before they went bankrupt. I suspect it would have been a 68020 or 68030 instead of the 68060 though. A single chip Amiga with 68060 would have been awesome especially if Commodore could have continued to improve the 68060. Commodore had bought MOS (Chuck "6502" Peddle joined C= but they wasted his talents in typical C= fashion), was using FPGAs for custom chip development back then, and had added instructions to the PA-RISC for their Hombre gfx chipset so they had the technology to improve it if not sabotaged by management. Many of the C= engineers had bought into the RISC hype so there was no guarantee the 68k would have been further developed by C=. The 68060 was also better performance than the comparable PA-RISC CPU of the time at the same clock speed. I did the following 68060 comparison a while ago...

The PA-RISC 7100@99MHz (L1: 256kB ICache/256kB DCache) without SIMD could decode MPEG 320x240 video at 18.7 fps. My 68060@75MHz (L1: 8kB ICache/8kB DCache) using the old RiVA 0.50 decodes MPEG video between 18-22fps (average ~20fps). An update to the new RiVA 0.52 works now giving 21-29 fps (average is ~26fps with more 68060 optimization possible). Note that the PA-RISC 7100 was introduced in 1992 and used in technical and graphical workstations and computing servers while the 68060 was introduced in 1994 for desktop and embedded applications (less demanding and lower cost applications). The PA-RISC 7100LC@60MHz (L1: 32kB ICache/32kB DCache) introduced in 1994 with SIMD (initially 32 bit MAX but may have been upgraded to MAX-2 later?) could do 26fps decoding 320x240 MPEG. MAX not only improved the performance (finally better than the 68060 at MPEG fps) but improved the code density by replacing many RISC instructions allowing the cache sizes to be reduced tremendously. The PA-RISC 7100LC@80MHz (L1: 128kB ICache/128kB DCache) with MAX SIMD could do 33fps decoding 320x240 MPEG. As we can see, the PA-RISC had unimpressive performance even with an SIMD and lots of resources.

> 3) And then there's ColdFire. Maybe the 68000 had too many baroque addressing
> modes, and even doing what Intel was about to do with the Pentium Pro, and AMD
> did by buying another company's design for the... was it the K6? ... and
> converting code into RISC-like micro-ops would not have saved it.

The 68k doesn't need to waste energy moving to OoO and breaking down instructions as far as x86/x86_64 which means it can do a better Atom like CPU. The powerful addressing modes may reduce the clock speed some but there are indications that it would have very good single core performance/clock. A lower clock speed is bad for marketing but good for embedded applications.

> However, ColdFire went a step too far. One of the addressing modes abandoned was
> 16-bit displacement + base register + index register. Without that, one has to
> compile code that uses arrays into long, cumbersome sequences involving extra
> instructions. This is one of the most basic standard addressing modes - RX
> format - not some exotic VAX-like foray into an exotic realm.

I altered a 68k disassembler to analyzed Amiga 68k code and didn't see (bd16,An,Xi*SF) used too often. Maybe other operating systems use it more. Maybe compilers were already breaking it down into multiple instructions (because of the EA calc cost of this addressing mode?). Most modern large 68k programs use absolute addressing because there is no cheap (d32,An) or (d32,PC) when the programs become large and absolute addressing is cheaper than (bd,An,Xi*SF). This is a waste as many accesses would fit in (d16,An) or (d16,PC) which would improve code density. There is a simple and compact way to encode (d32,PC) but not (d32,An) which is why I suggested an ABI code model which merges all sections and allows most accesses to be PC relative (requires allowing PC relative writes which are currently illegal). PC relative addressing is more compact and saves a base address register. Certainly absolute addressing makes no sense when moving to a 64 bit 68k CPU where an efficient (d32,PC) is needed anyway.

Here are the EA calc times of the 68060.

Dn Data Register Direct 0(0/0)
An Address Register Direct 0(0/0)
(An) Address Register Indirect 0(0/0)
(An)+ Address Register Indirect with Postincrement 0(0/0)
–(An) Address Register Indirect with Predecrement 0(0/0)
(d16,An) Address Register Indirect with Displacement 0(0/0)
(d8,An,Xi*SF) Address Register Indirect with Index and Byte Displacement 0(0/0)
(bd,An,Xi*SF) Address Register Indirect with Index and Base 16/32 Bit Displacement 1(0/0)
([bd,An,Xn],od) Memory Indirect Preindexed Mode 3(1/0)
([bd,An],Xn,od) Memory Indirect Postindexed Mode 3(1/0)
(xxx).W Absolute Short 0(0/0)
(xxx).L Absolute Long 0(0/0)
(d16,PC) Program Counter with Displacement 0(0/0)
(d8,PC,Xi*SF) Program Counter with Index and Byte Displacement 0(0/0)
(bd,PC,Xi*SF) Program Counter with Index and Base 16/32 Bit Displacement 1(0/0)
#<data> Immediate 0(0/0)
([bd,PC,Xn],od) Program Counter Memory Indirect Preindexed Mode 3(1/0)
([bd,PC],Xn,od) Program Counter Memory Indirect Postindexed Mode 3(1/0)

I expect the powerful addressing modes could be done in fewer cycles today.
If (bd,An,Xi*SF) and (bd,PC,Xi*SF) could be reduced from 1 cycle to 0 cycles then it could be used more. This should reduce the double indirect mode cost to 1 cycle which would be great (compilers should split these instructions when other instructions can be scheduled in between).

> Had they not done this, one could have imagined 68000 customers converting to
> the more streamlined ColdFire architecture, if only the features omitted were
> ones they could live without.
>
> However, as ColdFire was positioned as being for the embedded market, it is also
> not as if beefy processors with that architecture would be available.

The big mistake of ColdFire was not to allow full 68k compatibility with traps to software. Motorola did a poor job of marketing the 68k for embedded but that was where it was really good and did catch on with a loyal following of 68k developers and fans which slowly went over to ARM due to ColdFire's lack of 68k compatibility and support. The CPU32 ISA with the MVS, MVZ, BYTESWAP and BITSWAP instructions would have been better than ColdFire. ColdFire aimed too low where the 68k can't compete with minimalist RISC and isn't powerful anymore but abandoned the higher end embedded market where the 68k code density and ease of use make it a good choice. Motorola tried to shove PPC down the throats of developers for high end embedded as well as desktop PC processors after the IAM agreement. The huge success of the 68k disappeared practically overnight due to Motorola incompetence. The 68060 was one of the greatest processors of its time but had an identity crisis so Motorola through the baby out with the bathwater. The greener pastures of PPC on the other side of the fence don't seem so green now.

Quadibloc

unread,

May 25, 2018, 3:56:40 PM5/25/18

to

On Friday, May 25, 2018 at 12:31:03 PM UTC-6, matt...@gmail.com wrote:

> The greener pastures of PPC on the other side of the fence don't seem so green
> now.

That may be, but in a way that explains why it wasn't a mistake at the time to
try hopping to the PowerPC. Yes, that didn't succeed; the PowerPC didn't become
a very popular architecture. But they didn't have much to lose: the 68000
architecture had a... *presence*... given the Macintosh, the Atari ST, and the
Amiga, but still, this presence was not remotely as big as what the x86 had with
the IBM PC and its clones.

Neither the Macintosh, the Atari ST, nor the Amiga could be cloned.

Only the Macintosh of those three was something that businesses took seriously;
the other two were strictly home computers, even if the Atari ST looked the
part, and the Amiga, in most of its versions, sort of _looked_ like an office
desktop.

Apple's tendency to keep everything proprietary and high-priced was established
back then, it isn't something that's new today.

So there was no 68000 platform with growth potential, no standard 68K box that
could rival the PC.

John Savard

matt...@gmail.com

unread,

May 25, 2018, 5:15:11 PM5/25/18

to

On Friday, May 25, 2018 at 2:56:40 PM UTC-5, Quadibloc wrote:
> On Friday, May 25, 2018 at 12:31:03 PM UTC-6, matt...@gmail.com wrote:
>
> > The greener pastures of PPC on the other side of the fence don't seem so green
> > now.
>
> That may be, but in a way that explains why it wasn't a mistake at the time to
> try hopping to the PowerPC. Yes, that didn't succeed; the PowerPC didn't become
> a very popular architecture. But they didn't have much to lose: the 68000
> architecture had a... *presence*... given the Macintosh, the Atari ST, and the
> Amiga, but still, this presence was not remotely as big as what the x86 had with
> the IBM PC and its clones.

It was no doubt disconcerting when the bread and butter profitable workstation makers jumped from 68k to RISC. With the 68k, it was still Amiga, Macintosh, Atari ST and should have been embedded vs x86 clones. With PPC, it was just Macintosh and PPC shoved down embedded developers throats vs x86 clones which turned into dead PPC vs x86 clones vs ARM for embedded. Now Motorola is a Chinese company which doesn't make CPUs and Freescale/NXP is a Dutch company which pays license Fees to ARM for most of their CPU designs. ARM won the war because they were willing to get dirty in the trenches while the mighty Motorola panicked and surrendered at the first opportunity.

> Neither the Macintosh, the Atari ST, nor the Amiga could be cloned.

There were Mac clones for a little while but Apple changed their mind. Apple also changed their mind about PPC. I expect more than a few business partners of Apple lost piles of money and some probably went bankrupt. Some people still do business with Apple and take them seriously though.

> Only the Macintosh of those three was something that businesses took seriously;
> the other two were strictly home computers, even if the Atari ST looked the
> part, and the Amiga, in most of its versions, sort of _looked_ like an office
> desktop.

The Amiga was the computer for desktop video with the Toaster ((near real time operations no competitor could match) so it found a niche (also some very good paint programs). The Atari ST wasn't bad for DTP, audio and databases. The Mac was really only DTP on the 68k. They did get some MS software but it was generally considered inferior to the PC clone versions.

> Apple's tendency to keep everything proprietary and high-priced was established
> back then, it isn't something that's new today.
>
> So there was no 68000 platform with growth potential, no standard 68K box that
> could rival the PC.

Apple had trouble back then too. MS bailed them out (bought something like 25% of Apple) or they likely would have gone bankrupt. Motorola could have bought Apple, Amiga or Atari ST for a song at the right time if they wanted to be vertically integrated as they would have benefited the most from an open 68k platform (clone makers would have bought 68k CPUs too). The desktop and gaming markets are cyclical while the embedded markets are consistent and defensive so I would have pushed more and faster into them with the 68k CPU that developers loved. I probably would have developed PPC or the 88k to have a RISC offering as well. Choice and happy customers are good.

MitchAlsup

unread,

May 25, 2018, 10:08:05 PM5/25/18

to

On Friday, May 25, 2018 at 2:56:40 PM UTC-5, Quadibloc wrote:

> On Friday, May 25, 2018 at 12:31:03 PM UTC-6, matt...@gmail.com wrote:
>
> > The greener pastures of PPC on the other side of the fence don't seem so green
> > now.
>
> That may be, but in a way that explains why it wasn't a mistake at the time to
> try hopping to the PowerPC. Yes, that didn't succeed; the PowerPC didn't become
> a very popular architecture. But they didn't have much to lose: the 68000
> architecture had a... *presence*... given the Macintosh, the Atari ST, and the
> Amiga, but still, this presence was not remotely as big as what the x86 had with
> the IBM PC and its clones.

When I was at Moto, we had a saying "Apple paid for the FAB, but you made
no profit on them"

THe same can be said for being a supplier to SAPRC, too.

Terje Mathisen

unread,

May 26, 2018, 3:11:11 AM5/26/18

to

matt...@gmail.com wrote:
> The 68060 had the best integer performance even though benchmarks
> often showed the Pentium to be competitive if not slightly ahead due
> to much better compiler support. The Pentium did have better
> theoretical FPU performance but likely required hand laid assembler
> to achieve it. The 68060 FPU is more compiler friendly and performs
> well on mixed code as integer instructions can often operate in
> parallel (although FPU instructions using immediates annoyingly can't
> probably due to another transistor saving strategy). From optimizing
> FPU code, I suspect the small 8kB data cache becomes the bottleneck
> on these older processors with games like Quake which became so
> important to sales at the time. My 68060@75MHz Amiga with Voodoo 4
> (no T&L) can play Quake 512x384x16 at ~25 fps. It is obviously not as
> well optimized as the PC version but shows the Pentium FPU advantage
> was better for marketing than performance. Certainly at that time it
> was better to focus on integer performance but the 68060 still had a
> good FPU.

Since I got my name into the Quake manual for the work I did helping
optimize the asm code, I still remember quite a bit from that time:

Comparing the original Quake (pure sw rendering) with a later version
supporting the Voodoo card is not even apples vs oranges!

Mike Abrash (with a little bit of help from me, maybe 5%?) managed to
triple (!) the speed of John Carmack's original C code (which is what
you would have to run on that 68K).

This code was extremely tightly coupled with the instruction latencies
and throughput of the Pentium, both integer and FPU, among many other
things it did a proper division for correct perspective once every 16
pixels, and the latency of that 32-bit FDIV (17 cycles afair) was very
carefully overlapped with other parts of the code.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

David Brown

unread,

May 26, 2018, 9:26:40 AM5/26/18

to

On 25/05/18 20:31, matt...@gmail.com wrote:
> On Friday, May 25, 2018 at 1:20:36 AM UTC-5, Quadibloc wrote:

<snip>

>>
>> However, as ColdFire was positioned as being for the embedded
>> market, it is also not as if beefy processors with that
>> architecture would be available.
>
> The big mistake of ColdFire was not to allow full 68k compatibility
> with traps to software. Motorola did a poor job of marketing the 68k
> for embedded but that was where it was really good and did catch on
> with a loyal following of 68k developers and fans which slowly went
> over to ARM due to ColdFire's lack of 68k compatibility and support.
> The CPU32 ISA with the MVS, MVZ, BYTESWAP and BITSWAP instructions
> would have been better than ColdFire. ColdFire aimed too low where
> the 68k can't compete with minimalist RISC and isn't powerful anymore
> but abandoned the higher end embedded market where the 68k code
> density and ease of use make it a good choice. Motorola tried to
> shove PPC down the throats of developers for high end embedded as
> well as desktop PC processors after the IAM agreement. The huge
> success of the 68k disappeared practically overnight due to Motorola
> incompetence. The 68060 was one of the greatest processors of its
> time but had an identity crisis so Motorola through the baby out with
> the bathwater. The greener pastures of PPC on the other side of the
> fence don't seem so green now.
>

The 68K architecture was, AFAIK, used in two embedded arenas - network
equipment, and automotive (engine controllers and the like). Networking
split into two main branches - chips for routers, firewalls, etc.,
requiring more general processing, and chips for switches and security
products where speed was the top requirement. For the general usage,
ColdFire became popular - it was the original target for ucLinux, and
the devices with MMU were a major choice for "normal" embedded Linux, as
well VxWorks and other OS's. But it was never fast enough for the heavy
switches - they used PPC cores.

In the automotive world, and a bit wider in industrial electronics, the
68332 was an immensely popular chip. (I loved it.) Motorola, then
Freescale, tried to kill it off for years and move customers over to
ColdFire or PPC, but failed - people kept buying the 68332 and its
immediate relatives like the 68376.

Motorola/Freescale's biggest problem with ColdFire is that it had too
many cores. It had the old 68K that wouldn't die, the ColdFire, the
PPC, MCore, and a range of smaller chips and DSP cores of various
strengths - but the popular market was moving to ARM. It went through a
few stages of mixes, like families of microcontrollers where you had the
same peripheral set and pinouts but could choose 8-bit 68xx cores or
32-bit ColdFire's. But in the end, it settled on ARM cores and PPC.

matt...@gmail.com

unread,

May 26, 2018, 12:26:17 PM5/26/18

to

On Saturday, May 26, 2018 at 2:11:11 AM UTC-5, Terje Mathisen wrote:

> matthey wrote:
> > The 68060 had the best integer performance even though benchmarks
> > often showed the Pentium to be competitive if not slightly ahead due
> > to much better compiler support. The Pentium did have better
> > theoretical FPU performance but likely required hand laid assembler
> > to achieve it. The 68060 FPU is more compiler friendly and performs
> > well on mixed code as integer instructions can often operate in
> > parallel (although FPU instructions using immediates annoyingly can't
> > probably due to another transistor saving strategy). From optimizing
> > FPU code, I suspect the small 8kB data cache becomes the bottleneck
> > on these older processors with games like Quake which became so
> > important to sales at the time. My 68060@75MHz Amiga with Voodoo 4
> > (no T&L) can play Quake 512x384x16 at ~25 fps. It is obviously not as
> > well optimized as the PC version but shows the Pentium FPU advantage
> > was better for marketing than performance. Certainly at that time it
> > was better to focus on integer performance but the 68060 still had a
> > good FPU.
>
> Since I got my name into the Quake manual for the work I did helping
> optimize the asm code, I still remember quite a bit from that time:
>
> Comparing the original Quake (pure sw rendering) with a later version
> supporting the Voodoo card is not even apples vs oranges!

Part of the code is shared and part is no longer needed between the SW only and GL versions of Quake. I have done 68k optimizations and bug fixes for SW and GL Quake I and Quake II as well as the Amiga GL 3D gfx drivers and 68k vbcc compiler (the game is a good performance benchmark and has historical significance). There is a surprising amount of 68k assembler code for some of the SW rendering Quake ports on the Amiga but only manages about half of the GL rendering frame rate.

> Mike Abrash (with a little bit of help from me, maybe 5%?) managed to
> triple (!) the speed of John Carmack's original C code (which is what
> you would have to run on that 68K).
>
> This code was extremely tightly coupled with the instruction latencies
> and throughput of the Pentium, both integer and FPU, among many other
> things it did a proper division for correct perspective once every 16
> pixels, and the latency of that 32-bit FDIV (17 cycles afair) was very
> carefully overlapped with other parts of the code.

The level of optimization and waste of human man hours because of the poor x86 ISA is most impressive for Quake. The original C code must have been poor to see a 3x performance boost too. FDIVs are indeed a good place to look for optimizations, first to try FMUL times an immediate reciprocal (Frank Wille and I added this optimization to the 68k assembler vasm used by vbcc) and then to do parallel integer instructions while the FDIV is calculating. I optimized a SW rendering function in Quake II that moved a good portion of the integer code under a couple of FDIVs and was proud of myself until I benchmarked something like a .2 fps performance increase. Examples like this made me think the DCache size was a bottleneck for the large amount of data but maybe the x86 FPU was just better performance with hand laid assembler code.

matt...@gmail.com

unread,

May 26, 2018, 2:21:56 PM5/26/18

to

On Saturday, May 26, 2018 at 8:26:40 AM UTC-5, David Brown wrote:

The uCLinux first target was the 68k (CPU32 on 68332, not 68328 DragonBall as wiki incorrectly states). I have Jeff Dionne's e-mail telling me this. He is involved with the open core J-core (SuperH ISA) CPU project where he wants to use mass produced embedded sensors for his business in Japan. He thinks the SuperH will scale from micro-contoller with DSP add-ons where a very simple SH-3 is well suited to powerful 64 bit CPUs. I tried to convince him a more robust CPU design like a 64 bit 68k (patents have also expired) with a SIMD unit would be better suited. He admits he can program the 68k like the wind compared to the SuperH and acknowledged major flaws in the SuperH ISA (see posts by BGB and me on this forum) but he likes the simple SuperH CPU design.

Motorola was slow to ramp up the features and performance of 68k CPUs after the AIM agreement. They did not want the 68k competing with the PPC (which they pushed for embedded too) so it was weakened and demoted to the low end embedded cellar where it is not particularly well suited. No 68k or ColdFire CPU from Motorola has exceeded the single core performance/MHz of the 68060 despite massive leaps in technology (the 68060 design was good but had lots of room for improvement). Low end FPGA 68k CPU cores today give better performance than Motorola/Freescale/NXP offerings. Developers loved and preferred the 68k ISA to ARM offerings but Motorola/Freescale let them slip away by not upgrading features and performance and shoving PPC down their throats instead. ARM has this *huge* market but nobody really wants to compete with them.

> In the automotive world, and a bit wider in industrial electronics, the
> 68332 was an immensely popular chip. (I loved it.) Motorola, then
> Freescale, tried to kill it off for years and move customers over to
> ColdFire or PPC, but failed - people kept buying the 68332 and its
> immediate relatives like the 68376.

The CPU32 ISA is pretty good for a simplified 68k ISA. That is what InnovASIC's FIDO CPU uses as well. Transistors are cheap for all but the lowest end embedded CPUs today and the decoder penalty of full 68k is practically nothing.

> Motorola/Freescale's biggest problem with ColdFire is that it had too
> many cores. It had the old 68K that wouldn't die, the ColdFire, the
> PPC, MCore, and a range of smaller chips and DSP cores of various
> strengths - but the popular market was moving to ARM. It went through a
> few stages of mixes, like families of microcontrollers where you had the
> same peripheral set and pinouts but could choose 8-bit 68xx cores or
> 32-bit ColdFire's. But in the end, it settled on ARM cores and PPC.

Yep, poor management and lack of understanding of their products by Motorola/Freescale. ColdFire could have been a subset of 68k able to execute 68k code with software traps while new ColdFire instructions could have been added to higher end 68k designs as most encodings are open. Instead they needed more CPU offerings which required more support. PPC never was well suited for embedded. It is a pain to program in assembler, doesn't have good code density, is complex for RISC and lacks embedded features like ARM has. MCore was the right idea for low end embedded but too late, simplified too far again and abandoned quickly.

already...@yahoo.com

unread,

May 27, 2018, 2:57:13 AM5/27/18

to

On Saturday, May 26, 2018 at 9:21:56 PM UTC+3, matt...@gmail.com wrote:
>
> Developers loved and preferred the 68k ISA to ARM offerings
>

I'd guess, the statement above is correct, but incomplete.

The complete statement would be: "Small minority of developers loved and preferred the 68k ISA to ARM, another loved small minority of developers and preferred the ARM ISA to 68k, while absolute majority of developers didn't care about ISA.

> PPC never was well suited for embedded.
> It is a pain to program in assembler,

True, but doesn't matter.

> doesn't have good code density,

Did you look at variant of PPC ISA implemented by e200 cores?
I never bothered to measure, but on paper it looks like its code density (compiled code, asm is irrelevent) should be excellent - at least as good as Coldfire, but likely somewhat better.
Of course, measurements are better than feelings.

>is complex for RISC

True, but doesn't matter.

>and lacks embedded features like ARM has

What features?
Intuitively, I would think that wider immediate field in PPC load/store instructions is useful for embedded.
IMHO, "ARM classic" is a reasonable embedded ISA, but not something to read home about. Quite comparable with "PPC classic". on the other hand, Thumb2 ISA is really quite good.

David Brown

unread,

May 27, 2018, 7:03:14 AM5/27/18

to

OK. The 68332 is an odd target for it - it had only a 16-bit external
databus, so it would be rather slow for uCLinux. I still have a
ColdFire 2 ucLinux board somewhere in the office.

> He is involved with the open core J-core (SuperH
> ISA) CPU project where he wants to use mass produced embedded sensors
> for his business in Japan. He thinks the SuperH will scale from
> micro-contoller with DSP add-ons where a very simple SH-3 is well
> suited to powerful 64 bit CPUs. I tried to convince him a more robust
> CPU design like a 64 bit 68k (patents have also expired) with a SIMD
> unit would be better suited. He admits he can program the 68k like
> the wind compared to the SuperH and acknowledged major flaws in the
> SuperH ISA (see posts by BGB and me on this forum) but he likes the
> simple SuperH CPU design.

The one thing that I see as a potential issue for bigger and faster 68K
devices is the limited number of registers - 8 general purpose data
registers and 7 address registers. As far as I have seen in the history
of 68k, and processors in general, there has been a trend towards using
more registers and fewer complex addressing modes. This would be more
noticeable for a 64-bit design with more pipelining, superscaling, etc.

Apart from that, I have always thought the 68K was a very nice ISA.

PPC took a while to get established for embedded work like industrial
and automotive applications. (For networking, it seemed to be
successful - especially the 64-bit version. But I have not worked in
that area myself.)

The first PPC microcontrollers from Freescale were devices like the
MPC555 and MPC565. These were seen as direct successors to the 68332 -
that is how we used them. (I preferred the ColdFire MCF5234 as a
replacement to the 68332, but it was not available until a little
later). The MPC5xx suffered from poorer, more limited and more
expensive development tools compared to the 68332, but its key problem
for such systems was interrupt handling. It was very inefficient, and
difficult to get right - as you say, it was not fun to program in assembly.

But the modern PPC microcontroller cores (like the e200z6), good
interrupt controllers, and newer tools make these far nicer to work
with. I did a couple of PPC microcontroller projects a few years ago,
and was mostly happy with them.

already...@yahoo.com

unread,

May 27, 2018, 7:59:11 AM5/27/18

to

No, 64-bit PPC never was successful outside of IBM servers, where it was called POWER.
In networking most successful PPC was 32-bit PowerQuick series, esp. PowerQuick1. The same chips that you mentioned below.

>
> The first PPC microcontrollers from Freescale were devices like the
> MPC555 and MPC565. These were seen as direct successors to the 68332 -
> that is how we used them. (I preferred the ColdFire MCF5234 as a
> replacement to the 68332, but it was not available until a little
> later). The MPC5xx suffered from poorer, more limited and more
> expensive development tools compared to the 68332,

We used Diab Data tools. Worked fine, as far as I remember. Licensing was a bit nasty, but nothing extraordinary. Same for price.

> but its key problem
> for such systems was interrupt handling. It was very inefficient, and
> difficult to get right - as you say, it was not fun to program in assembly.

Somehow, for us it never was a problem. May be, because we never needed especially fast interrupt response.

>
> But the modern PPC microcontroller cores (like the e200z6), good
> interrupt controllers, and newer tools make these far nicer to work
> with. I did a couple of PPC microcontroller projects a few years ago,
> and was mostly happy with them.

I didn't touch PPC MCUs since the middle of the previous decade. The last one was IBM 405 (or 440? I don't remember). I liked Freescale gear much better, but it was probably an overkill outside of the range that could take full advantage of communication co-processor. IBM's was more general-purpose, less ambitious. But I didn't like it, less so a core, more so a peripherals.

Today I don't have much use for the class of MCUs that are equipped with e200z6 or it's peers (ARM Cortex-R5?) . For most of our tasks e200z1 would probably be insufficient, while e200z3 would be an overkill. But even if there was something in the middle, I see little reason to use e200 over MCUs based on ARM Cortex-M4. Variety for sake of variety? Thanks, it's not my way of thinking.

Terje Mathisen

unread,

May 27, 2018, 11:20:48 AM5/27/18

to

matt...@gmail.com wrote:
> On Saturday, May 26, 2018 at 2:11:11 AM UTC-5, Terje Mathisen wrote:
>> Mike Abrash (with a little bit of help from me, maybe 5%?) managed
>> to triple (!) the speed of John Carmack's original C code (which is
>> what you would have to run on that 68K).
>>
>> This code was extremely tightly coupled with the instruction
>> latencies and throughput of the Pentium, both integer and FPU,
>> among many other things it did a proper division for correct
>> perspective once every 16 pixels, and the latency of that 32-bit
>> FDIV (17 cycles afair) was very carefully overlapped with other
>> parts of the code.
>
> The level of optimization and waste of human man hours because of the
> poor x86 ISA is most impressive for Quake. The original C code must

"Waste"?

When you have a breakthrough game that more or less created an entire
industry, with millions and millions of users and a few orders of
magnitude more hours spent playing it, the fact that a few man-years was
spent writing and optimizing it really doesn't matter imho. :-)

> have been poor to see a 3x performance boost too. FDIVs are indeed a
> good place to look for optimizations, first to try FMUL times an
> immediate reciprocal (Frank Wille and I added this optimization to
> the 68k assembler vasm used by vbcc) and then to do parallel integer
> instructions while the FDIV is calculating. I optimized a SW
> rendering function in Quake II that moved a good portion of the
> integer code under a couple of FDIVs and was proud of myself until I
> benchmarked something like a .2 fps performance increase. Examples
> like this made me think the DCache size was a bottleneck for the
> large amount of data but maybe the x86 FPU was just better
> performance with hand laid assembler code.

The Pentium FPU is and was very hard to compile for, but a very nice
(even if somewhat mind-bendingly) puzzle to figure out for an x86 asm
hacker.

BGB

unread,

May 27, 2018, 12:44:55 PM5/27/18

to

The 68k's ability to have 48+ bit instructions and ability to have
multiple memory accesses in a single instruction seemed likely very
problematic IMO.

On a previous topic seen earlier, IME, (Base+(Index+Disp)*Sc) and
similar addressing modes ended up very rarely used in my tests, and it
seem are fairly infrequent.

Most commonly used/useful cases IME:
(Reg)
(Reg, Disp*Sc)
(SP, Disp*Sc): Very common
(Reg, Reg*Sc)
(PC, Disp)
And, much less commonly:
(Reg+) / @Reg+
(Reg-) / @-Reg
(Reg, (Reg+Disp)*Sc)
... others ...

FWIW: That part of the software renderer was one area which posed a
major problem for my BJX1 effort:
The use of floating point FDIV operations right in the middle of the
rasterization loop basically ruined attempts to use alternatives to a
fast hardware divide and still get plausible performance.

It would have been necessary to modify the renderer then in order to
eliminate any inner-loop floating-point ops, to allow a slower divisor
to be usable (ex: one using iterative Newton-Raphson for the reciprocal).

But, even then, timing was still hard, and I (probably) would have just
ultimately reduced the clock to the FPU to 1/4 (ex: 25MHz) hopefully so
that I could get FMUL and similar to pass timing.

The alternative being to probably split it up into internal pipelines,
where I do a trick similar to the integer multiply of doing 16*16->32
multiplies and then adding the results together afterwards:
(AA, AB)*(BA, BB)
Clk 0: EX (sets up multiplier vars)
Clk 1: A=AA*BA; B=AB*BA; C=AA*BB; D=AB*BB
Also a few more values for signed multiply:
A[31]?(~B[31:16]):0 and similar.
Clk 2: Add intermediate results.
Clk 3: Store results back to MACH:MACL.

This being because a direct 32*32->64 multiply also seems to fail timing
even when done by itself.

This makes only providing a 16-bit unsigned multiply (MULU.W) in the ISA
tempting (since this can be done directly), except for the issue of
every integer multiply now needing to be a runtime call to fake the
common case of a 32*32->32 multiply (lame).

SHAD/SHLD isn't nearly as steep, ex:
Clk 0: EX stage, setup for SHAD/SHLD
Clk 1: Do the SHAD/SHLD (a big "case()")

With the alternative being fixed shifts of 1/2/4*/8/16 bits (*: the 4
bit shift being an extended feature).

Funny enough, Quake didn't seem to mind so much when using a runtime
call for integer multiply, or the crappiness of doing shifts via a
computed branch into a series of SHLL or SHAR instructions... But, so
help you if that FDIV isn't fast...

It seems possible though, that if floating-point were eliminated from
the software renderer, then much of the rest of the engine could live on
with a much slower FPU (or possibly FPU emulation).

But, I suspect this particular effort (trying to make an FPGA CPU core
capable of running Quake) may not continue much more as-is (unless maybe
I get a much higher-end FPGA dev-board).

The narrower scope (BSR1) effort (namely focusing on microcontroller
tasks with a more "cleaned-up" ISA relative to SH) at least seems more
doable.

Have noted that I seem to be seeing code-density of around 3-5kB per
kLOC of C (tending towards the lower end with plain integer code; and
towards the higher-end with a lot of "long long" similar thrown in).

Not all perfect though, as while doing a lot of stuff via 'DLR' seems to
be working out in-general, some instruction forms end up existing which
would not have needed to exist if the register could be addressed like a
GPR.

OTOH, cutting more registers off of the GPR space would leave fewer GPRs
available and would have required modifying the C ABI.

I guess it mostly is a matter for those who judge ISA complexity mostly
by counting the number of superficial instruction forms and similar,
while ignoring any special-case behaviors (It is like claiming the
MSP430 only has 27 instruction forms... Yeah, about that...).

BGB

unread,

May 27, 2018, 3:45:04 PM5/27/18

to

Mentioned already, yeah, complex addressing modes seem to be fairly
infrequent vs simpler modes.

For example, if one takes away cases relative to SP, then usage of Rm+
and -Rn addressing nearly drops off the table (there aren't nearly
enough "*cs++" and similar operations in typical C code to keep them
worthwhile).

It seems nicer to have a few simpler modes which can adequately emulate
more complex modes as needed, for example, in my newer ISA:
(Reg)
(Reg, DLR)
(PC, DLR)
(DLR)
(DLR_i4) //(code density)
(PC, DLR_i4) //(code density)

Pretty much all of the addressing modes (from SH and BJX1) can be
emulated via compound sequences (typically a 32-bit instruction pair).
(Reg), 16-bit (1 op)
(Reg, Reg*Sc), 32-bit (2-op)
(Reg, Disp13s), 32-bit (2-op)
(PC, disp17s), 32-bit (2-op)
(Abs16), 32-bit (2-op)
(Reg+), 32-bit (2-op)
(-Reg), 32-bit (2-op)
(Reg, R0), 48-bit (3-op, *)
*: Need to fake non-scaled R0 (to emulate SH behavior).
In the compiler output, R0 cases are now fairly rare though.
(Reg, Reg, Disp13s), 48-bit (3-op)
...

Most of these, as can be noted, are done via the DLR register.

Had spec'ed some modes with GBR and TBR as base registers, but these
have since been demoted to compound sequences.

While functionally not that different from the way R0 was used in SH, it
does carry the advantage that it's use is much more specific; so it
isn't needing to fight with also being used as a function return value
and as an implicit source/destination for various other operations
(forcing a lot of hackish fallback cases).

For immediate/displacement values, can now have the high-level compiler
logic mostly ignore value ranges (far fewer special and fallback cases
needed). Similarly, the produced sequences can potentially still be
decoded as larger variable-width instructions (without some of the
drawbacks of an "actual" variable-length instruction encoding for
simpler cores).

As for GPRs, 16 or 32 seem about optimal in my tests.

16 GPRs: works pretty well in general, but sometimes register pressure
is enough to start causing thrashing (particularly if working with
values which require GPR pairs).
32 GPRs: may be helpful in higher register pressure situations, but only
a minority of functions seem to benefit significantly.
64 GPRs: from what I can tell, there is "hardly ever" enough register
pressure to justify this.

OTOH: with 8 GPRs, thrashing is a serious problem.

Granted, 8 or 16 GPRs is better for a 16-bit instruction coding, as 32
GPRs would leave few bits left over for the opcode field.

So, I suspect 16 GPRs is probably optimal for an ISA with 16-bit
instructions, and 32 GPRs for an ISA with 24 or 32-bit instructions.

Some of my BJX1 variants had 32 GPRs.

BSR1 currently only does 16 GPRs. If I spec a version with
variable-width 16/32 instruction coding, it is likely it would also
expand back to 32 GPRs.

But, granted, this expanded version probably wouldn't be for a small
microcontroller use-case (larger microcontroller? or to compete with
things I am currently doing with a RasPi?...).

matt...@gmail.com

unread,

May 27, 2018, 3:51:16 PM5/27/18

to

On Sunday, May 27, 2018 at 1:57:13 AM UTC-5, already...@yahoo.com wrote:
> On Saturday, May 26, 2018 at 9:21:56 PM UTC+3, matt...@gmail.com wrote:
> >
> > Developers loved and preferred the 68k ISA to ARM offerings
> >
>
> I'd guess, the statement above is correct, but incomplete.
>
> The complete statement would be: "Small minority of developers loved and preferred the 68k ISA to ARM, another loved small minority of developers and preferred the ARM ISA to 68k, while absolute majority of developers didn't care about ISA.

It would be difficult to do an unbiased poll. It didn't matter as ARM evolved (perhaps too much with all the modes and variations) and the 68k did not. ARM has good support while Motorola/Fresscale anti-marketed the 68k. There was no choice for most developers as cut down '90s 68k designs could not meet their requirements and needs after awhile. The ISA is less important today with most embedded code being compiled but it is still important to be able to debug and look for compiler inefficiencies in compiler generated assembler code.

> > PPC never was well suited for embedded.
> > It is a pain to program in assembler,
>
> True, but doesn't matter.

The PPC only started to catch on for embedded when compilers became more common for embedded. I expect most embedded PPC CPUs today are high end only due to the difficulty of debugging and optimizing.

> > doesn't have good code density,
>
> Did you look at variant of PPC ISA implemented by e200 cores?
> I never bothered to measure, but on paper it looks like its code density (compiled code, asm is irrelevent) should be excellent - at least as good as Coldfire, but likely somewhat better.
> Of course, measurements are better than feelings.

PPC Book E VLE? I couldn't find much analysis or real world data on it. NXP claims a 30% overall code size reduction with a <10% execution path increase from normal PPC code. The 68020/CPU32 ISA code is generally 35%-50% smaller than PPC code and ColdFire code 0%-5% worse so it could be approaching the code density but likely falls short. Vince Weaver obtained an embedded board which supports the PPC Book E VLE so maybe someday he will add the results to his code density web site.

PPC did not encode the lower 2 bits of displacements in branches making decompress on fetch (DF) compression challenging. IBM's CodePack for the PPC claimed a 60% code compression but it was a Decompress on Cache Fill (DCF) dictionary based compression between the L1 ICache and memory. The L1 ICache held uncompressed instructions so it did not benefit from a reduced L1 size or improved instruction fetch bandwidth from L1. DCF is usually considered to be less efficient than DF and CodePack often gave reduced performance. CodePack did allow the full PPC instruction set including all 32 GP registers though. Most DF based RISC compression formats like ARM Thumb, PPC book E VLE, RVC32C, RV64C, MIPS16, MicroMIPS and SPARC16 reduce or restrict the number of accessible registers which increases the number of instructions, load/stores and program size (16 GP registers is the sweet spot). There are more factors to look at than just code compression obviously. This is why I suggested Vince Weaver add categories for the number of instructions, average instruction size, data size, number of branch instructions and number of memory access instructions.

> >is complex for RISC
>
>
> True, but doesn't matter.
>
> >and lacks embedded features like ARM has
>
> What features?
> Intuitively, I would think that wider immediate field in PPC load/store instructions is useful for embedded.
> IMHO, "ARM classic" is a reasonable embedded ISA, but not something to read home about. Quite comparable with "PPC classic". on the other hand, Thumb2 ISA is really quite good.

ARM has specialized ISA extensions for about everything embedded like security, DSP, SIMD, byte code support, etc. ARMv8 AArch64 is more standardized and indeed much like PPC but better. It is a little too complex and heavy for many embedded applications and has only modestly better code density than PPC. Thumb2 is like a lower end and lighter RISC version of the 68k and is in several ways a better ISA than SuperH and ColdFire which were based on the 68k.

matt...@gmail.com

unread,

May 27, 2018, 5:02:38 PM5/27/18

to

On Sunday, May 27, 2018 at 6:03:14 AM UTC-5, David Brown wrote:
> The one thing that I see as a potential issue for bigger and faster 68K
> devices is the limited number of registers - 8 general purpose data
> registers and 7 address registers. As far as I have seen in the history
> of 68k, and processors in general, there has been a trend towards using
> more registers and fewer complex addressing modes. This would be more
> noticeable for a 64-bit design with more pipelining, superscaling, etc.

16 GP registers is optimal. The following paper predicted an overall less than 2% performance increase above 12 GP registers on an x86_64.

https://link.springer.com/content/pdf/10.1007/11688839_14.pdf

A paper called "High-Performance Extendable Instruction Set Computing" found a MIPS CPU load/stores increased about 14% from 16->8 GP registers but only about 2% from 27->16 GP registers.

The sweet spot is 16 GP registers. It is important to use them efficiently though and the 68k can be improved. Some ideas.

1) Suppress the frame pointer by using the stack pointer (vbcc compiler with 68k target does this by default)
2) Merge executable sections or place in memory pools with proximity and use PC relative addressing where possible including in libraries
3) Open up address register sources so an intermediate register is not needed
4) MOVEQ trash register is no longer needed with a simple immediate (pseudo)addressing mode which auto compresses immediates
5) Fast bit field instructions use fewer registers

There are many more small ways to use 68k registers more efficiently. These would require ISA and ABI changes. The 68k can already access all 16 GP registers without a code size increase (x86_64 when accessing the upper 8 GP registers) and CISC is a register miser compared to RISC. A few more registers would be nice but not worthwhile to introduce prefixes or tiered registers.

> PPC took a while to get established for embedded work like industrial
> and automotive applications. (For networking, it seemed to be
> successful - especially the 64-bit version. But I have not worked in
> that area myself.)
>
> The first PPC microcontrollers from Freescale were devices like the
> MPC555 and MPC565. These were seen as direct successors to the 68332 -
> that is how we used them. (I preferred the ColdFire MCF5234 as a
> replacement to the 68332, but it was not available until a little
> later). The MPC5xx suffered from poorer, more limited and more
> expensive development tools compared to the 68332, but its key problem
> for such systems was interrupt handling. It was very inefficient, and
> difficult to get right - as you say, it was not fun to program in assembly.
>
> But the modern PPC microcontroller cores (like the e200z6), good
> interrupt controllers, and newer tools make these far nicer to work
> with. I did a couple of PPC microcontroller projects a few years ago,
> and was mostly happy with them.

If you needed modern performance or features then you had no choice but to move away from the 68k and ColdFire to PPC or ARM. PPC is not a bad architecture and has some good ideas. However, it is unfriendly, boring and practically requires a good compiler and source level debugger. Sadly, I expect more developers like x86/x86_64 assembler than PPC assembler.

already...@yahoo.com

unread,

May 27, 2018, 5:40:27 PM5/27/18

to

On Monday, May 28, 2018 at 12:02:38 AM UTC+3, matt...@gmail.com wrote:

> Sadly, I expect more developers like x86/x86_64 assembler than PPC assembler.

Why sadly?
What could be wrong when developers like more pleasant asm coding experience better than less pleasant asm coding experience?

matt...@gmail.com

unread,

May 27, 2018, 5:46:24 PM5/27/18

to

On Sunday, May 27, 2018 at 11:44:55 AM UTC-5, BGB wrote:
> The 68k's ability to have 48+ bit instructions and ability to have
> multiple memory accesses in a single instruction seemed likely very
> problematic IMO.

I don't think my 64 bit 68k ISA will increase the maximum instruction length at least. It sure is nice to have one instruction for immediates and load+calc+store rather than a chain of dependent instructions. MOVE mem,mem is common and I believe can be done in 1 cycle in most cases. It is CISC so multi-cycle instructions are tolerable. It all needs more resources but so does a 64 bit CPU. I prefer less obfuscation and a more friendly ISA with the complexity of 64 bit.

> On a previous topic seen earlier, IME, (Base+(Index+Disp)*Sc) and
> similar addressing modes ended up very rarely used in my tests, and it
> seem are fairly infrequent.
>
> Most commonly used/useful cases IME:
> (Reg)
> (Reg, Disp*Sc)
> (SP, Disp*Sc): Very common
> (Reg, Reg*Sc)
> (PC, Disp)
> And, much less commonly:
> (Reg+) / @Reg+
> (Reg-) / @-Reg
> (Reg, (Reg+Disp)*Sc)
> ... others ...

You are probably talking about the frequency the instructions appear in the code and not the run time frequency. Post-increment and pre-decrement are commonly used in loops. Primitive compilers have trouble generating them and they have an additional cost in some CPU designs which discourages their use. They are good for code density and can be free (no additional EA calc cost). I would not remove them. Some of the addressing modes are not particularly common but simplify compiler support and complete addressing mode support. I don't like the idea of removing addressing modes based solely on frequency of use. Addressing modes are very powerful (multiplier effect) with an orthogonal ISA.

That is what the 68000 does. Lots of partial sum 16*16->32 multiplies.

> SHAD/SHLD isn't nearly as steep, ex:
> Clk 0: EX stage, setup for SHAD/SHLD
> Clk 1: Do the SHAD/SHLD (a big "case()")
>
> With the alternative being fixed shifts of 1/2/4*/8/16 bits (*: the 4
> bit shift being an extended feature).
>
>
> Funny enough, Quake didn't seem to mind so much when using a runtime
> call for integer multiply, or the crappiness of doing shifts via a
> computed branch into a series of SHLL or SHAR instructions... But, so
> help you if that FDIV isn't fast...
>
> It seems possible though, that if floating-point were eliminated from
> the software renderer, then much of the rest of the engine could live on
> with a much slower FPU (or possibly FPU emulation).

I heard a rumor about an integer only version of Quake for some console. There is always non-FP Doom too.

MitchAlsup

unread,

May 27, 2018, 7:46:27 PM5/27/18

to

I went the other direction: the key data addressing mode in the MY 66000
ISA is :: [Rbase+Rindex<<SC+Disp]

When Rbase == R0 then IP is used in lieu of any base register
When Rindex == R0 then there is no indexing (or scaling)
Disp comes in 3 flavors:: Disp16, Disp32, and Disp64

The assembler/linker is task with choosing the appropriate instruction form
from the following:

MEM Rd,[Rbase+Disp16]
MEM Rd,[Rbase+Rindex<<SC]
MEM Rd,[Rbase+Rindex<<SC+Disp32]
MEM Rd,[Rbase+Rindex<<Ec+Disp64]

Earlier RISC machines typically only had the first 2 variants. My experience
with x86-64 convinced me that adding the last 2 variants was of low cost
to the HW and of value to the SW.

In a low end machine, the displacement will be coming out of the decoder
and this adds nothing to the AGEN latency or data path width. The 2 gates
of delay (3-input adder) is accommodated by the 2 gates of delay associated
the scaling of the Rindex register (Rbase+Disp)+(Rindex<<SC) without adding
any delay to AGEN.

Any high end machine these days will have 3-operand FMAC instructions. Those
few memory references that need 3 operands are easily serviced on those paths.

Having SW create immediates and displacements by executing instructions is
simply BAD FORM*. Immediates and displacements should never pass through the
data cache nor consume registers from the file(s), nor should they be found
in memory that may be subject to malicious intent.

(*) or lazy architecting--of which there is way too much.

The same issues were involved in adding 32-bit and 64-bit immediates to the
calculation parts of the ISA.

DIV R7,12345678901234,R19
is handled as succinctly as:
DIV R7,R19,12345678901234

Almost like somebody actually tried to encode it that way.

BGB

unread,

May 28, 2018, 12:01:03 AM5/28/18

to

This design was partly motivated by working within the limits of a
fixed-width 16-bit instruction coding, trying to save encoding space,
and not wanting to have 3 register read ports, ...

I was able to get everything done with 2 register read ports, and with
some other registers (PC, SP, DLR, ...) routed directly through the
execute unit (they form sort of a "loop" between the register-file and
execute unit).

> Any high end machine these days will have 3-operand FMAC instructions. Those
> few memory references that need 3 operands are easily serviced on those paths.
>

Yeah, what I am aiming for right now probably wont even have an FPU.

There is an FPU in the spec, but this is a more intended for future
expansion.

There are some 3-register arithmetic ops, but internally these are also
done as 2-instruction sequences.

> Having SW create immediates and displacements by executing instructions is
> simply BAD FORM*. Immediates and displacements should never pass through the
> data cache nor consume registers from the file(s), nor should they be found
> in memory that may be subject to malicious intent.
>
> (*) or lazy architecting--of which there is way too much.
>

The BSR1 ISA does not load immediate or displacement values from memory,
but rather they are typically composed inline via a load-shift sequence
via a hard-wired special-purpose register.

One cost is that the whole Axxx/Bxxx space (or about 1/8 of the total
encoding space), was used to load a 13-bit value into the DLR register.

Likewise, 26xx will tack on an additional 8 bits (or,
"DLR=(DLR<<8)|Imm8"), and these can be chained as-needed.

So, After "A123 2645 2667" DLR will hold the value 0x1234567.

If I extend it: "A123 2645 2667 48B8", it magically becomes,
essentially, "MOV 0x12345678, R11".

Granted, this takes 4 cycles, and is internally done via a 4-instruction
sequence, but oh well.

This is quite different from SH, which mostly relied on PC-relative
memory loads.

> The same issues were involved in adding 32-bit and 64-bit immediates to the
> calculation parts of the ISA.
>
> DIV R7,12345678901234,R19
> is handled as succinctly as:
> DIV R7,R19,12345678901234
>
> Almost like somebody actually tried to encode it that way.
>

OK.

There is no DIV instruction, but it is possible to encode:
"XOR R3, 0x12345, R9"
As:
"A123 2645 5C93"

BGB

unread,

May 28, 2018, 12:16:02 AM5/28/18

to

On 5/27/2018 4:46 PM, matt...@gmail.com wrote:
> On Sunday, May 27, 2018 at 11:44:55 AM UTC-5, BGB wrote:
>> The 68k's ability to have 48+ bit instructions and ability to have
>> multiple memory accesses in a single instruction seemed likely very
>> problematic IMO.
>
> I don't think my 64 bit 68k ISA will increase the maximum instruction length at least. It sure is nice to have one instruction for immediates and load+calc+store rather than a chain of dependent instructions. MOVE mem,mem is common and I believe can be done in 1 cycle in most cases. It is CISC so multi-cycle instructions are tolerable. It all needs more resources but so does a 64 bit CPU. I prefer less obfuscation and a more friendly ISA with the complexity of 64 bit.
>

A 64-bit ISA need not be all that much more complicated than a 32-bit
ISA, and interestingly the width of the ALU and GPRs doesn't really seem
to effect overall cost all that much (at least if compared with the
number of GPRs or the number of register file ports; and excluding
shift/multiply which scale sharply).

As-is, a move between two memory locations will take ~ 8 cycles with my
current design, possibly memory accesses could be made cheaper, but
doing so would add cost and complexity.

>> On a previous topic seen earlier, IME, (Base+(Index+Disp)*Sc) and
>> similar addressing modes ended up very rarely used in my tests, and it
>> seem are fairly infrequent.
>>
>> Most commonly used/useful cases IME:
>> (Reg)
>> (Reg, Disp*Sc)
>> (SP, Disp*Sc): Very common
>> (Reg, Reg*Sc)
>> (PC, Disp)
>> And, much less commonly:
>> (Reg+) / @Reg+
>> (Reg-) / @-Reg
>> (Reg, (Reg+Disp)*Sc)
>> ... others ...
>
> You are probably talking about the frequency the instructions appear in the code and not the run time frequency. Post-increment and pre-decrement are commonly used in loops. Primitive compilers have trouble generating them and they have an additional cost in some CPU designs which discourages their use. They are good for code density and can be free (no additional EA calc cost). I would not remove them. Some of the addressing modes are not particularly common but simplify compiler support and complete addressing mode support. I don't like the idea of removing addressing modes based solely on frequency of use. Addressing modes are very powerful (multiplier effect) with an orthogonal ISA.
>

Addressing modes add cost in terms of encoding space and the necessary
logic to support them.

I left out postinc/postdec modes from the BSR1 ISA partly because they
may end up needing to update multiple GPRs in a single operation. With
BJX1, this required using a state machine, but without this mode,
"MOV.x" does not need any state, and can simply assert a "hold" status
(blocking the pipeline) until the memory access either completes or
reports an error condition.

The cost now is that if the assembler or emitter sees one of these
operations, it has to fake it, ex:
MOV.L R3, (-R4)
would be emitted as:
ADD #-4, R4
MOV.L R3, (R4)

But, as noted, they were fairly infrequent in generated code, so this
doesn't really seem to have much impact on the overall code footprint.

I am tuning this ISA more for footprint than performance, wanting to get
as much as I can out of 32kB or so of ROM, within the limits of core and
decoder complexity.

There are still PUSH/POP operations, which still update SP, but this is
partly because SP (along with DLR and PC and similar) are fed directly
through the EX unit and as such can be updated more directly without
going through the register-file.

Unlike SH, there is a RET operation which exists as special case
semantics for "POP PC" (which can save a few byte over "POP LR; RTS")

However, given that there are no special-case instructions to help
implement integer division or strcmp efficiently, it is much less clear
how well BSR1 would compare doing things like running Dhrystone (which
is disproportionately effected by things like strcmp and divider speed).

As-is, I am doing division via a shift/compare/subtract loop (not
necessarily the highest-performance option here, but basically works and
doesn't have a huge footprint).

I did it also as a thing for B64V, and Quake somehow didn't suffer too
badly from this in my tests despite this being kinda silly.

>> SHAD/SHLD isn't nearly as steep, ex:
>> Clk 0: EX stage, setup for SHAD/SHLD
>> Clk 1: Do the SHAD/SHLD (a big "case()")
>>
>> With the alternative being fixed shifts of 1/2/4*/8/16 bits (*: the 4
>> bit shift being an extended feature).
>>
>>
>> Funny enough, Quake didn't seem to mind so much when using a runtime
>> call for integer multiply, or the crappiness of doing shifts via a
>> computed branch into a series of SHLL or SHAR instructions... But, so
>> help you if that FDIV isn't fast...
>>
>> It seems possible though, that if floating-point were eliminated from
>> the software renderer, then much of the rest of the engine could live on
>> with a much slower FPU (or possibly FPU emulation).
>
> I heard a rumor about an integer only version of Quake for some console. There is always non-FP Doom too.
>

I had considered Doom as well.

It could make sense with BJX1, or if I later do a BSR1 core which uses
external DRAM.

I was wanting to target a cheaper FPGA, such as an XC6SLX9 or similar
for this. This gives me ~ 70 kB of Block-RAM to work with (with things
like external DRAM depending on which exact board I get).

The Arty S7 board I have been using thus far (with an XC7S50) has around
200kB, and a 256MB DRAM chip, but is somewhat more expensive (at
present, getting another one would cost ~ $140).

This would be overkill for a lathe, so I was looking to target something
a little cheaper for this (like a Mimas or other similar class board).

Ivan Godard

unread,

May 28, 2018, 12:53:38 AM5/28/18

to

Mill: predicated, base + optional index + displacement. Predicated is
true/false/always; true/false take a value from the belt, always does
not. index takes a belt value, optionally scaled by the width of the
operation (b/h/w/d/q/vector widths). Displacement is 0/1/2/4 bytes,
optionally ones complemented. Base is a value from the belt or one of
specRegs DP/FP/INP/OUTP/TLP.

Most compact encoding of a load is (Silver, belt 16):
*p:
opcode 4 bits (typical, depends on other ops in slot)
base 4 bits
no index 2 bits (implied by the belt count)
no disp 2 bits {implied by the byte count)
no comp 1 bit
width 3 bits
delay 4 bits
= 20 bits
largest is:
b?A[i].f : NaR
opcode 4 bits (typical, depends on other ops in slot)
predicate 4 bits
base 4 bits
yes index 2 bits (implied by the belt count)
index 4 bits
yes disp 2 bits (implied by the byte count)
displacement 8/16/32 bits
no comp 1 bit
width 3 bits
delay 4 bits
= 60 bits max
The three-input address adder is signed. While the displacement is 32
bits max, it (possibly ones complemented) is treated as a 64 bit value.
There is no absolute address mode, and no data references (load/store)
based on the PC.

There are skinny and svelte encodings of popular ops the reduce both the
instruction level and operation level entropy. If the instruction has
nothing but "belt <- *p" the whole instruction is ~11 bits including the
belt reference and delay.

Terje Mathisen

unread,

May 28, 2018, 2:08:11 AM5/28/18

to

<BG>

I have been waiting to see this view offered, I do agree that x86 asm
was very pleasant indeed.

The architecture is restrictive enough that you can often come up with
algorithms which you know are more or less optimal, simply because of
all the contraints.

Sort of like poets claiming that having very strict rules about how to
construct verses makes it possible to write better instead of worse poetry?

already...@yahoo.com

unread,

May 28, 2018, 5:03:19 AM5/28/18

to

On Monday, May 28, 2018 at 9:08:11 AM UTC+3, Terje Mathisen wrote:
> already...@yahoo.com wrote:
> > On Monday, May 28, 2018 at 12:02:38 AM UTC+3, matt...@gmail.com
> > wrote:
> >
> >> Sadly, I expect more developers like x86/x86_64 assembler than PPC
> >> assembler.
> >
> > Why sadly? What could be wrong when developers like more pleasant asm
> > coding experience better than less pleasant asm coding experience?
> >
> <BG>
>
> I have been waiting to see this view offered, I do agree that x86 asm
> was very pleasant indeed.

For the record, it's not my view.
Personally, I find programming in original 16-bit x86 asm unpleasant.
x386 is completely different story, I like it.
But even under x386, I find programming of x87 FPU part, how to say it... challenging in unsatisfactory way. I feel that here the creativity of programmer is spent in non-productive way.
And the micro-architecture that I like least happens to be the one you like most - P5.

>
> The architecture is restrictive enough that you can often come up with
> algorithms which you know are more or less optimal, simply because of
> all the contraints.
>
> Sort of like poets claiming that having very strict rules about how to
> construct verses makes it possible to write better instead of worse poetry?
>

I think, chess is better analogy of x86. You have 6 different sorts of pieces.
Except that in 16-bit x86 not 6, but all 8 register are different in some ways.

And yes, I like chess very much. But at he same time I am glad that it's not my profession.
As to the poetry, I can't say that for me sonnets as inherently superior over blank verses.

John Levine

unread,

May 28, 2018, 9:48:33 AM5/28/18

to

In article <4274b57d-9bbf-4035...@googlegroups.com>,

Well, it is unless you are in an environment where you care about code
size. Does anyone understand how all of the prefix byte and irregular
instruction encodings interact? (Leaving the manual open on your
screen all the time doesn't count.)

--
Regards,
John Levine, jo...@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Terje Mathisen

unread,

May 28, 2018, 11:22:11 AM5/28/18

to

John Levine wrote:
> In article <4274b57d-9bbf-4035...@googlegroups.com>,
> <already...@yahoo.com> wrote:
>> On Monday, May 28, 2018 at 12:02:38 AM UTC+3, matt...@gmail.com wrote:
>>
>>> Sadly, I expect more developers like x86/x86_64 assembler than PPC assembler.
>>
>> Why sadly?
>> What could be wrong when developers like more pleasant asm coding experience better than less pleasant asm coding experience?
>
> Well, it is unless you are in an environment where you care about code
> size. Does anyone understand how all of the prefix byte and irregular
> instruction encodings interact? (Leaving the manual open on your
> screen all the time doesn't count.)
>

Even if I don't remember every detail now, I more or less did so in the
years when I wrote most of my asm code.

Proof by example: I wrote the following code during an Easter ski
vacation, with no computer, just a list of the ~70 asm instructions that
only use MIME/Base64 ascii values.

This is a program encoded as a text file.

To use the program, remove all extra lines before and after
the text executable and save the remainder to disk as DU.COM.
The result is a normal executable .COM file!

*********** Remove this line and all lines above it! ***********
ZRYPQIQDYLRQRQRRAQX,2,NPPa,R0Gc,.0Gd,PPu.F2,QX=0+r+E=0=tG0-Ju E=
EE(-(-GNEEEEEEEEEEEEEEEF 5BBEEYQEEEE=DU.COM=======(c)TMathisen95
&&P4Vw4+Vww7AjAyoAwAAzJzAEAJP7JzBaANQ+wAIxJzAeAJX9JzIxANJ0JzKzAK
WwwAP4JzPwAJi8JzXxANA3wAa4JzavAKA9wAX8JzXyALy8au1+Wwe7CA0/G7LBW7
Www7v7h8CAi8A9A6x8CAP7G7AK0VKASzAza6vAAaLAh90TI9H9LAx90QH8I8Vww7
Agavy9AmAj6+B7V7gA6PAzG72+H9f7AtIxAvCzAkCzAgAkQ/AiYxXzYxQ/U+DaT+
Q/AkCZAMAUCZQ/AtBaAKAWAXANCZAQQ/YxPwPwSzXzQ/GwAXAQBaAMAHQ/AKARQ/
AkCzAgAkXzFwArAtQ/DaARAQAICZAMAKCZAMA4A5Q+=6C8E/L6BAOjyATwND4SzA

The first two lines is the primary bootstrap, it employs the absolute
minimum possible amount of self-modifying code (a single two-byte
backwards branch instruction) but can still survive quite a lot of text
reformatting, i.e. replacing line ends (currently CRLF) with any zero,
one or two-byte combination. It constructs the secondary bootstrap by
picking up pairs of Base64 chars until the '=' terminator, ignoring all
other characters, and combines them into arbitrary byte values.

The algorithm I found to do the latter does "0 xor a xor b - b - b"
since this was the first combination I tested which used only MIME
opcodes and still allowed all byte values to be reached.

Paul A. Clayton

unread,

May 28, 2018, 11:47:42 AM5/28/18

to

On Monday, May 28, 2018 at 2:08:11 AM UTC-4, Terje Mathisen wrote:
[snip]

> <BG>
>
> I have been waiting to see this view offered, I do agree that x86 asm
> was very pleasant indeed.

I am not that familiar with x86, but it seems to me
that the number of instructions is relatively high,
that the operations are not especially orthogonal
(perhaps particularly with condition code results?;
not being able to avoid setting the condition code
seems a relatively useless constraint other than for
code density), and the encoding (which can matter
for density or alignment optimizations) is somewhat
complex. The complexity might make the discovery of
a clever use more satisfying and perhaps the
complexity adds constraints that assist creation
(perhaps similar to a jigsaw puzzle with more
piece-shapes, where fit as well as image matching
constrain placement?)

Your previous posts on the subject imply that the
modest register count with semi-dedicated purposes
facilitated mental tracking of availability and
allocation.

Humans seem to need some complexity to find
intellectual enjoyment, but I suspect that much of
the bookkeeping aspects of a register rich ISA
could be handled with software assistance or by
iterative refinement.

Of course, the complexity of the microarchitecture
matters.

> The architecture is restrictive enough that you can often come up with
> algorithms which you know are more or less optimal, simply because of
> all the contraints.
>
> Sort of like poets claiming that having very strict rules about how to
> construct verses makes it possible to write better instead of worse poetry?

For poetry, constraints of meter, rhyme, etc. tend to force further thinking (related to your "know are more or less
optimal", knowing that something is not quite right, and
possibly related to complexity forcing actual thought) and
introduce "random" perturbations to move thinking off
regular courses. In my limited verse writing, sometimes
a rhyme (or even meter) requirement has introduced a
metaphor/association or word choice that would not come
otherwise.

BGB

unread,

May 28, 2018, 3:02:44 PM5/28/18

to

On 5/28/2018 10:47 AM, Paul A. Clayton wrote:
> On Monday, May 28, 2018 at 2:08:11 AM UTC-4, Terje Mathisen wrote:
> [snip]
>> <BG>
>>
>> I have been waiting to see this view offered, I do agree that x86 asm
>> was very pleasant indeed.
>
> I am not that familiar with x86, but it seems to me
> that the number of instructions is relatively high,
> that the operations are not especially orthogonal

As for x86 instruction counts being high:
Yep, pretty much. The x86 ISA has far more instruction forms than a
typical RISC or similar. Likewise, many opcodes are encoded through
layers of re-purposed prefixes.

The original ISA (8086):
Single-byte opcodes
Many with a Mod/RM byte
Many with a displacement
Many with an immediate
Some prefixes, like REP/REPNE, segment overrides, ...

The 286: Some bytes started being used to encode longer opcodes.
For example, "0F XX Mod/RM ..." vs just "XX Mod/RM", ...

The 386: Added a 32-bit mode, new Mod/RM scheme, and more prefix bytes.
Prefixes: address and data size overrides, FS/GS overrides, ...

By around the time MMX and SSE were being added, lacking much else to
do, they started tacking on the various prefix bytes to other operations
where they were previously not defined, which would give them new
meanings. This causes many SSE operations to effectively have a soup of
prefix bytes as part of their opcode field.

For x86-64, some less frequently used single-byte (INC/DEC) instruction
forms were dropped (forcing their two-byte encodings to be used), and
then were reused as the REX prefixes (needed for QWORD operations and to
access R8-R15).

By AVX, they took another instruction and redefined it so that certain
invalid encodings would be interpreted as a VEX prefix, with its Mod/RM
bits and similar encoding the equivalent of the chain of prefix bytes,
including the REX bits, and potentially an additional register argument, ...

How many instruction-forms exist? Several thousand last I checked...

So, now it is sort of a hairy mess on this front, and implementing a CPU
for x86 would probably be fairly non-trivial. If I were to do something
with x86 support, would probably just implement a RISC style core and
use an emulator to run any x86 code. Likely the emulator would decoded
the ISA in software and then JIT compile it into the native ISA (with
maybe several MB or so for translated instruction traces).

However, from the perspective of someone writing ASM code, it isn't
nearly so bad. The assembler can deal with most of the encoding details,
and most instructions present a fairly consistent interface.

Similarly, most decoding for most operations is basically the same once
you get to the Mod/RM byte.

The situation is much less friendly on a typical RISC, where one might
battle with which instruction forms exist for which combinations of
parameters.

There might also be other issues, like needing to deal with delay slots
and other funkiness. For example, a memory load might not take effect
for several instructions, or the effects of executing a branch
instruction might not take effect until one or two instructions later
(causing instructions after the branch to be executed), ...

Similarly, many have a habit of requiring loading constants from memory,
which may need to be placed within a certain distance of the code being
executed, ...

Some of this is a lot harder to gloss over with an assembler, so writing
ASM code is a bit more painful if compared with x86.

However, for the CPU it is easier, given the instruction format itself
is typically fixed-width and fairly regular.

Some partial exceptions exist, like Thumb, where the layout of the bits
within the various instruction forms is a bit chaotic. Most other RISC's
are a bit more regular here.

> (perhaps particularly with condition code results?;
> not being able to avoid setting the condition code
> seems a relatively useless constraint other than for
> code density), and the encoding (which can matter
> for density or alignment optimizations) is somewhat
> complex.

IMO: x86 style condition codes are needlessly inconvenient in some areas.

One alternative that is nicer IMO is simply having a True/False status
code, with only a small subset of instructions effecting it. But in this
case, now one needs comparison operators that perform a specific
comparison, and there are fewer possible conditions to branch on.

> The complexity might make the discovery of
> a clever use more satisfying and perhaps the
> complexity adds constraints that assist creation
> (perhaps similar to a jigsaw puzzle with more
> piece-shapes, where fit as well as image matching
> constrain placement?)
>

This puzzle aspect is probably more true of dealing with a lot of the
small RISC ISAs than when dealing with x86 IMO.

Bigger RISC ISA's (with 32-bit instruction words) are typically a little
more regular here, but with 16-bit instruction words there is often a
need to fight with instruction coding a bit more to make everything fit
nicely.

There are tradeoffs here (code density, performance, complexity, ...).

> Your previous posts on the subject imply that the
> modest register count with semi-dedicated purposes
> facilitated mental tracking of availability and
> allocation.
>
> Humans seem to need some complexity to find
> intellectual enjoyment, but I suspect that much of
> the bookkeeping aspects of a register rich ISA
> could be handled with software assistance or by
> iterative refinement.
>

Having a lot of registers is not really an issue for writing ASM, if you
don't need them you don't use them.

It isn't exactly hard to write comments to say which variable is in
which register, more so if there are enough registers that the same
variable can be kept in the same register the whole lifetime of the
function.

Likewise, most non-x86 archs simply name them by number, and use names
only for registers with special defined meanings. Even on x86-64, it may
make sense in some cases (such as compilers or JITs) to mostly abandon
the use of symbolic names in favor of identifying them as R0-R15.
Ex: R0=RAX, R1=RCX, R2=RDX, R3=RBX, ...

The tradeoff in an ISA would mostly relate to how one trades off the
usage of bits, vs how much of the time will be spent loading/storing
memory values, vs other issues.

But, as can be noted, 16 or 32 seem to be roughly about optimal in most
cases.

MitchAlsup

unread,

May 28, 2018, 4:44:37 PM5/28/18

to

When I left AMD in 2006 there were at least 1500 instructions, many with
the same spelling for the opcode. And this is one thing I do with my RISC
ISA spellings:

ADD R7,R8,immed16
and
ADD R7,R8,R9

instead of
ADDI R7,R8,immed16
and
ADD R7,R8,R9

One can say this adds some uncertainty, but it comes in handy when
# define immed16 R9

>
>
> So, now it is sort of a hairy mess on this front, and implementing a CPU
> for x86 would probably be fairly non-trivial. If I were to do something
> with x86 support, would probably just implement a RISC style core and
> use an emulator to run any x86 code. Likely the emulator would decoded
> the ISA in software and then JIT compile it into the native ISA (with
> maybe several MB or so for translated instruction traces).
>
>
> However, from the perspective of someone writing ASM code, it isn't
> nearly so bad. The assembler can deal with most of the encoding details,
> and most instructions present a fairly consistent interface.
>
> Similarly, most decoding for most operations is basically the same once
> you get to the Mod/RM byte.

SW decoding of x86 is actually pretty easy--you do it with 256 entry tables
(6 of them when I left AMD, probably 8 or 9 tables now) and each table entry
contains a termination bit, an opcode, and a carrier. In my AMD decoder
the termination bit was any positive table entry. If the table entry was zero
this was an Undefined inst, and if the table entry was negative, the negative
table entry was an index into an array of 256 entry tables. So the decoder was something like::

loop:
LDB R7,[Rpc]
LDW R8,[Rtable+R7<<2]
BZE R8,UNDEFINED
BGEZ R8,done
LDD Rtable,[R8+TABLEARRAY]
BA loop

>
> The situation is much less friendly on a typical RISC, where one might
> battle with which instruction forms exist for which combinations of
> parameters.
>
> There might also be other issues, like needing to deal with delay slots
> and other funkiness. For example, a memory load might not take effect
> for several instructions, or the effects of executing a branch
> instruction might not take effect until one or two instructions later
> (causing instructions after the branch to be executed), ...

These should all have been eradicated by now.

>
> Similarly, many have a habit of requiring loading constants from memory,
> which may need to be placed within a certain distance of the code being
> executed, ...

These should have been, but seem not to have been, eradicated by now.

>
> Some of this is a lot harder to gloss over with an assembler, so writing
> ASM code is a bit more painful if compared with x86.
>
> However, for the CPU it is easier, given the instruction format itself
> is typically fixed-width and fairly regular.
>
> Some partial exceptions exist, like Thumb, where the layout of the bits
> within the various instruction forms is a bit chaotic. Most other RISC's
> are a bit more regular here.
>
>
> > (perhaps particularly with condition code results?;
> > not being able to avoid setting the condition code
> > seems a relatively useless constraint other than for
> > code density), and the encoding (which can matter
> > for density or alignment optimizations) is somewhat
> > complex.
>
> IMO: x86 style condition codes are needlessly inconvenient in some areas.
>
> One alternative that is nicer IMO is simply having a True/False status
> code, with only a small subset of instructions effecting it. But in this
> case, now one needs comparison operators that perform a specific
> comparison, and there are fewer possible conditions to branch on.

T/F is one was of addressing this, another is a bit vector of all possible
comparisons--ala. M88K.

In my latest ISA, the compare instruction can compare integer or FP operands
and also sample whether memory interference has occurred (SW multi-instruction
ATOMIC stuff). One can also include 0<x<y (FORTRAN) of 0<=x<=y (C) for fast
boundary checks. All in all my CMP instruction delivers 20 individual bits.
I have also extended the Branch-on-comparison to have integer and FP forms
(things like isNAN(), isMINUSzero(),...)

BGB

unread,

May 28, 2018, 9:34:29 PM5/28/18

to

OK, IIRC I was going off a listing which also had a lot of AVX stuff and
similar.

> And this is one thing I do with my RISC
> ISA spellings:
>
> ADD R7,R8,immed16
> and
> ADD R7,R8,R9
>
> instead of
> ADDI R7,R8,immed16
> and
> ADD R7,R8,R9
>
> One can say this adds some uncertainty, but it comes in handy when
> # define immed16 R9

My BSR1 ISA is also doing something similar, as it is possible to tell
in most cases what the intended behavior is based on the parameters.

For example (vs the SH family):
BRA, BRA/N, BRAF, JMP -> BRA
BSR, BSR/N, BSRF, JSR -> BSR
MOV, LDC, STC, ... -> MOV
...

So, there are currently ~ 70 mnemonics, vs currently ~219 on the SH/BJX1
side of things (or ~150 if I exclude the BJX1 ops).

Quickly counting up from the listings from my ISAs:

BJX1-32 has ~ 400 I-forms in the 16-bit Base-ISA (mostly overlaps with
SH4), and ~ 180 in the 8Exx block.

BJX1-64C has ~ 336 I-forms in the 16-bit Base-ISA (superset of B64V),
and ~ 338 in the 8Exx/CExx block.

For B32V, has ~ 270 I-forms, more-or-less overlaps with the normal SH4
ISA, 16-bit only subset of BJX1-32, omits the FPU and various misc
I-forms from SH (such as MAC.W and MAC.L).

For B64V, has ~ 159 I-forms (B64V was a simplified and reorganized
version of the SH ISA, modified to be 64-bit).

For the BSR1 ISA, it currently has ~ 212 I-forms.

If I omit redundant I-forms which exist mostly for code-density reasons,
it drops to 147 I-forms. Could go lower, but this would start to come at
the cost of ISA features.

>>
>>
>> So, now it is sort of a hairy mess on this front, and implementing a CPU
>> for x86 would probably be fairly non-trivial. If I were to do something
>> with x86 support, would probably just implement a RISC style core and
>> use an emulator to run any x86 code. Likely the emulator would decoded
>> the ISA in software and then JIT compile it into the native ISA (with
>> maybe several MB or so for translated instruction traces).
>>
>>
>> However, from the perspective of someone writing ASM code, it isn't
>> nearly so bad. The assembler can deal with most of the encoding details,
>> and most instructions present a fairly consistent interface.
>>
>> Similarly, most decoding for most operations is basically the same once
>> you get to the Mod/RM byte.
>
> SW decoding of x86 is actually pretty easy--you do it with 256 entry tables
> (6 of them when I left AMD, probably 8 or 9 tables now) and each table entry
> contains a termination bit, an opcode, and a carrier. In my AMD decoder
> the termination bit was any positive table entry. If the table entry was zero
> this was an Undefined inst, and if the table entry was negative, the negative
> table entry was an index into an array of 256 entry tables. So the decoder was something like::
>
> loop:
> LDB R7,[Rpc]
> LDW R8,[Rtable+R7<<2]
> BZE R8,UNDEFINED
> BGEZ R8,done
> LDD Rtable,[R8+TABLEARRAY]
> BA loop
>

In an x86 emulator I once wrote (for a 486-like subset), I used a lookup
table for the first byte into a table of pattern strings (adapted from
my assembler/disassembler). While not necessarily the "best" option, it
basically worked.

The thing wasn't particularly fast in retrospect, but it was the first
to use an interpretation strategy I had used in most of my later VMs:
Decoding sequences of instructions into a "trace" consisting of a series
of "opcode" structures with function pointers to the functions
implementing the opcode behavior.

Typically, executing a trace involves calling into a function pointer,
which will hold an unrolled loop for calling the other function pointers.

This also allows a relatively simple JIT strategy:
Walk the opcode list in the trace, either emitting behavioral logic for
the opcode (if it is recognized), or spit out a call into the associated
function pointer (essentially resulting in call-threaded code).

The a pointer to the resulting JIT'ed function would be stored back into
the trace structure, and the emulator's main trampoline loop doesn't
need to care if it is dealing with interpreted or JIT compiled traces.

>
>>
>> The situation is much less friendly on a typical RISC, where one might
>> battle with which instruction forms exist for which combinations of
>> parameters.
>>
>> There might also be other issues, like needing to deal with delay slots
>> and other funkiness. For example, a memory load might not take effect
>> for several instructions, or the effects of executing a branch
>> instruction might not take effect until one or two instructions later
>> (causing instructions after the branch to be executed), ...
>
> These should all have been eradicated by now.

I don't think many new ISA designs use delay slots.

Being based on SH, BJX1 had inherited the use of branch delay slots.

In the design of BSR1, I have dropped the use of delay-slots.

>>
>> Similarly, many have a habit of requiring loading constants from memory,
>> which may need to be placed within a certain distance of the code being
>> executed, ...
>
> These should have been, but seem not to have been, eradicated by now.

Likewise.

BSR1 also drops these in favor of using a load-shift mechanism.

There are still PC-rel addressing modes, but these are more for
accessing things like global variables and similar, sort of like
RIP-relative addressing in x86-64.

ok.

BSR1 stays with the use a T/F flag (in the Status Register).

Basic compare ops:
CMPEQ, CMPGT, CMPHI
With branch ops:
BT, BF

So:
(a==b): CMPEQ a, b; BT lbl
(a!=b): CMPEQ a, b; BF lbl
(a> b): CMPGT b, a; BT lbl
(a< b): CMPGT a, b; BT lbl
(a>=b): CMPGT a, b; BF lbl
(a<=b): CMPGT b, a; BF lbl

With CMPHI used for unsigned (a>b), and is basically used in the same
way as CMPGT.

There are also CMPGE and CMPHS ops when dealing with DLR:
CMPEQ DLR, Rn
CMPGT DLR, Rn
CMPGE DLR, Rn
...

Mostly because flipping A and B is less of an option when dealing with
an immediate (and this is basically what the reused compiler logic had
expected).

There are MOVT and MOVNT instructions to copy the T/F status bit into a GPR.

thereis...@gmail.com

unread,

May 28, 2018, 10:55:54 PM5/28/18

to

On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> > Similarly, many have a habit of requiring loading constants from memory,
> > which may need to be placed within a certain distance of the code being
> > executed, ...
>
> These should have been, but seem not to have been, eradicated by now.

Constants require either an absurdly long instruction, or variable-length instructions.

I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."

Sadly, I can't speak of the validity of the claim, but I take you disagree?

> T/F is one was of addressing this, another is a bit vector of all possible
> comparisons--ala. M88K.

I thought everyone hated condition codes these days? :P

BGB

unread,

May 29, 2018, 1:27:32 AM5/29/18

to

On 5/28/2018 9:55 PM, thereis...@gmail.com wrote:
> On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
>> On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
>>> Similarly, many have a habit of requiring loading constants from memory,
>>> which may need to be placed within a certain distance of the code being
>>> executed, ...
>>
>> These should have been, but seem not to have been, eradicated by now.
>
> Constants require either an absurdly long instruction, or variable-length instructions.
>

Or, a series of short instructions each extending the value a little bit
at a time.

The logic for the operation is fairly simple:
BSR_UCMD_ALU_LDISH: begin
tCtlOutDlr = { ctlInDlr[23:0], immValRi[7:0] };
end

The load/shift mechanism can express an arbitrary sized constant via a
series of fixed-width instructions.

Even as such, given these instructions will take 1 cycle each, and I am
looking at 3 or 4 cycles for a typical memory access, on average I
expect the load/shift mechanism to work out faster than it would be to
fetch values from memory.

While these operations are limited to loading into a special register
(DLR), on the final operation the contents of DLR are "consumed" and
potentially transferred to another register (or used used to compute a
memory address or similar).

In my ISA:
MOV #0x1234, R9
MOV #0x123456, R10
MOV #0x12345678, R11
Can become instruction sequences:
A123 4894
A123 2645 48A6
A123 2645 2667 48B8

From the user and compiler perspective, it looks like a variable length
coding, but as far as the CPU is concerned, it is dealing with a series
of fixed-width instructions.

Ex:
26jj LDISH8 #imm8u //DLR=(DLR<<8)|Imm8u;
Ajjj LDIZ #imm12u //DLR=Imm12u;
Bjjj LDIN #imm12u //DLR=(~4095)|Imm12u;
48nj MOV DLR_i4, Rn //Rn=(DLR<<4)|Imm4u;
( ... in the current Verilog ... this is actually a LEA ... )
49nj ADD DLR_i4, Rn //Rn=Rn+((DLR<<4)|Imm4u);
( ... and so is this one ... )

> I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."
>

This is partly why I went back to fixed-width 16-bit instructions for my
newer ISA.

I did variable-width before, but it was kind of a pain to work with and
made latency issues harder. This combined with wanting to target a
smaller FPGA.

Also 16-bit because I wanted decent code density.

A bigger/fancier version would probably go back to variable-width, but
probably also add 64-bit support and go to 32 GPRs at roughly the same
time (to hopefully limit causing as much fragmentation this time).

already...@yahoo.com

unread,

May 29, 2018, 3:41:39 AM5/29/18

to

On Tuesday, May 29, 2018 at 5:55:54 AM UTC+3, thereis...@gmail.com wrote:
>
> I thought everyone hated condition codes these days? :P

Same as above (see "Developers loved and preferred the 68k ISA to ARM offerings"). A small minority hates condition codes, another small minority loves condition codes. An overwhelming majority does not care.

In the real world, outside of hate/love relationships, several ISAs with condition codes prosper. ISAs without condition codes are either dead or dying or not quite alive yet. But the reasons for that appear to have no relationship to presence/absence of condition codes.

The only area where ISAs without condition codes are dominant are commercial soft cores, but IMHO even that has nothing to do with presence/absence of condition codes and everything to do with poor understanding of this particular segment of the market by ARM Inc.

Megol

unread,

May 29, 2018, 8:00:46 AM5/29/18

to

On Tuesday, May 29, 2018 at 4:55:54 AM UTC+2, thereis...@gmail.com wrote:
> On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> > On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> > > Similarly, many have a habit of requiring loading constants from memory,
> > > which may need to be placed within a certain distance of the code being
> > > executed, ...
> >
> > These should have been, but seem not to have been, eradicated by now.
>
> Constants require either an absurdly long instruction, or variable-length instructions.
>
> I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."

For the most simplified processors perhaps, otherwise IMO just a matter of
design. RISC V is a hardline RISC updated with some wisdom from the "recent"
past, variable length instructions isn't a good fit for the design goal.

But they still have the compressed instruction extension which in fact gives
a simple variable length instruction set (16 and 32 bit instructions) and
they seem to expect high performance implementations to do instruction
fusion. Both of those complicate decode without providing all advantages of
a proper variable instruction length design.

Simple processors can have variable length without many problems but it
require proper design.

> Sadly, I can't speak of the validity of the claim, but I take you disagree?
>
> > T/F is one was of addressing this, another is a bit vector of all possible
> > comparisons--ala. M88K.
>
> I thought everyone hated condition codes these days? :P

That is not a type of condition codes., it is a comparison instruction giving a binary encoded comparison result that can be tested.
Instructions doesn't set or read condition codes, the comparison instruction
is dependent only on the input registers/imm values and produces a bitmask
written to a normal register.

No bottleneck like that with real condition codes, no complexities from
selective updates of condition codes etc.

matt...@gmail.com

unread,

May 29, 2018, 12:08:54 PM5/29/18

to

On Monday, May 28, 2018 at 8:48:33 AM UTC-5, John Levine wrote:
> In article <4274b57d-9bbf-4035...@googlegroups.com>,

> <already> wrote:
> >On Monday, May 28, 2018 at 12:02:38 AM UTC+3, matt...@gmail.com wrote:
> >
> >> Sadly, I expect more developers like x86/x86_64 assembler than PPC assembler.
> >
> >Why sadly?
> >What could be wrong when developers like more pleasant asm coding experience better than less pleasant asm coding experience?
>
> Well, it is unless you are in an environment where you care about code
> size. Does anyone understand how all of the prefix byte and irregular
> instruction encodings interact? (Leaving the manual open on your
> screen all the time doesn't count.)

I expect there are a few x86/x86_64 developers who understand practically everything about the ISA due to the amount of money involved. In contrast, most experienced 68k asm programmers should be able to look at asm code and determine instruction sizes with no manual. Even immediate and displacement sizes are generally 8, 16 or 32 bits which is easy to remember and calculate (usually not true for RISC). An ISA with a variable length encoding should be able to provide a pleasant experience with minimal memorization and less complex logic puzzles.

MitchAlsup

unread,

May 29, 2018, 12:34:07 PM5/29/18

to

On Monday, May 28, 2018 at 9:55:54 PM UTC-5, thereis...@gmail.com wrote:
> On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> > On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> > > Similarly, many have a habit of requiring loading constants from memory,
> > > which may need to be placed within a certain distance of the code being
> > > executed, ...
> >
> > These should have been, but seem not to have been, eradicated by now.
>
> Constants require either an absurdly long instruction, or variable-length instructions.
>
> I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."
>
> Sadly, I can't speak of the validity of the claim, but I take you disagree?

Yes, I disagree.

>
> > T/F is one was of addressing this, another is a bit vector of all possible
> > comparisons--ala. M88K.
>
> I thought everyone hated condition codes these days? :P

I still do, but if I end up in a situation where some value needs to be
returned, and that value has lost (32+) bits available, I will send back
basically every useful bit I can.

MitchAlsup

unread,

May 29, 2018, 12:41:01 PM5/29/18

to

On Monday, May 28, 2018 at 9:55:54 PM UTC-5, thereis...@gmail.com wrote:
> On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> > On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> > > Similarly, many have a habit of requiring loading constants from memory,
> > > which may need to be placed within a certain distance of the code being
> > > executed, ...
> >
> > These should have been, but seem not to have been, eradicated by now.
>
> Constants require either an absurdly long instruction, or variable-length instructions.
>
> I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."

I want to address this point::

THese days even simple implementations are reading out (1/2) a cache line of
data in an instruction fetch. Once you do this, you have an instruction buffer.
Once you have an instruction buffer, you have everything necessary to provide
constants to the data path except the read out multiplexers. Thus, the cost
of delivering constants is insignificant.

When one merges the I and D caches into one, this argument becomes even
stronger. One needs a wide fetch so that the number of fetch cycles is minimized so the LD/ST stream gets the majority of the access cycles.

thereis...@gmail.com

unread,

May 29, 2018, 8:15:08 PM5/29/18

to

On Tuesday, May 29, 2018 at 1:27:32 AM UTC-4, BGB wrote:
> On 5/28/2018 9:55 PM, thereis...@gmail.com wrote:
> > On Monday, May 28, 2018 at 4:44:37 PM UTC-4, MitchAlsup wrote:
> >> On Monday, May 28, 2018 at 2:02:44 PM UTC-5, BGB wrote:
> >>> Similarly, many have a habit of requiring loading constants from memory,
> >>> which may need to be placed within a certain distance of the code being
> >>> executed, ...
> >>
> >> These should have been, but seem not to have been, eradicated by now.
> >
> > Constants require either an absurdly long instruction, or variable-length instructions.
> >
>
> Or, a series of short instructions each extending the value a little bit
> at a time.

From context, I thought that was already discarded as an option.

(It also requires a free register, adding to register pressure.)

thereis...@gmail.com

unread,

May 29, 2018, 8:18:42 PM5/29/18

to

On Tuesday, May 29, 2018 at 12:41:01 PM UTC-4, MitchAlsup wrote:
> On Monday, May 28, 2018 at 9:55:54 PM UTC-5, thereis...@gmail.com wrote:
> > Constants require either an absurdly long instruction, or variable-length instructions.
> >
> > I can't speak of other architects, but I recall the RISC-V ISA document's rationale being (paraphrased) "Variable-length instructions are too much complexity for simple implementations."
>
> I want to address this point::
>
> THese days even simple implementations are reading out (1/2) a cache line of
> data in an instruction fetch. Once you do this, you have an instruction buffer.
> Once you have an instruction buffer, you have everything necessary to provide
> constants to the data path except the read out multiplexers. Thus, the cost
> of delivering constants is insignificant.

I suspect their idea of "simple implementation" was "student project," not "microcontroller in some product for money."

BGB

unread,

May 29, 2018, 8:44:53 PM5/29/18

to

It uses a hard-wired special-purpose register for this task, rather than
a GPR, and thus does not contribute to overall register pressure.

BGB

unread,

May 29, 2018, 11:41:18 PM5/29/18

to

If you mean what I am working on?...

It is more of a personal hobby project (as a possible option for the
motor controller in doing a one-off CNC lathe retrofit; connected via
SPI to a Raspberry Pi which would run the CNC control program).

As for use beyond this: Who knows?...

I am still waiting for the (manual) lathe to show up, which has been on
back-order for a while (several months). Once it arrives, will fabricate
some mounting hardware and stick some motors and similar on it.

I can also do the drive the steppers directly off the RasPi, but the
timing I am getting is inconsistent and noisy enough (due to "Linux
stuff") that I am not particularly sure that it wont result in lost
steps (not a good thing for a CNC machine).

Ideally, I want motor timing which stays reliably within +/- 15us or so,
which as-is isn't really happening on the RasPi. While the "average
case" is pretty good (due to the RasPi being pretty fast), there are
still frequent spikes of 100us or more, and periodic spikes of well over
500us.

These spikes just aren't really acceptable for the use-case because of
things like inertia and similar.

I also have a RasPi2, but even with multiple cores, there is still an
issue with latency spikes. Though, the spikes I am seeing are still a
lot smaller than I am seeing on my main Windows PC (where I am seeing
frequent spikes into the millisecond range).

Other options are:
Try to make it work with an MSP430, but I am running into issues
regarding the performance and resource constraints on the MSP430;
Buy a slightly higher-end MSP430 (*1);
Buy a more appropriate 32-bit microcontroller (*2).

*1: The current MSP430 chips I have on-hand are MP430G2232's, with 2kB
of ROM, 256B of RAM, and clocked up to 16MHz. Some higher-end MSP430's
have more ROM and RAM and are clocked at 25MHz.

I am basically out of ROM space in the 2232, so can't fit in the
ramp-up/ramp-down logic, and the speed I can get out of the SPI link
(~10kB/s) isn't really fast enough to allow driving motor ramping
directly off the RasPi either.

I have an device with an ATmega328P, which while it has a lot more ROM
(32kB), and thus can handle a lot more code, still leaves something to
be desired on the performance front (doesn't really perform all that
well at arithmetic-intensive code vs the 430).

*2: If I were doing this for a commercial project, and had more time
constraints, or sense, would probably just go buy an MSP432 or some
other Cortex-M based dev-board (but, this would make it too easy; and an
MSP432 Launchpad and Mimas or similar are basically about the same
price; Nevermind if the MSP432 is more likely to be the
technically-superior option in this case...).

As-is, if my CPU core works at all, it is probably still going to
outperform the MSP430G2xxx and ATmega chips at this task. And, will at
least be "some" sort of "practical" use for my screwing around with
FPGA's and ISA design.

Even if, yes, performance of the this particular ISA design is likely to
be a little weak.

At the moment, I am still fiddling with it and working on debugging the
prototype core.

Granted... Haas apparently manages to run their CNC controllers pretty
well using 68EC020 CPUs (with battery-backed SRAM for non-volatile
storage), so maybe I am missing something?...

Bruce Hoult

unread,

May 30, 2018, 12:40:28 AM5/30/18

to

Student projects, definitely.

A lot of the tiniest microcontrollers are leaving out support for 16 bit instructions because the programs they will run are tiny and the space savings in the program can be less than the size of the decoder.

This is especially true for the people who just want a soft core to do some trivial task on an FPGA.

And for the people who are cramming a couple of thousand cores onto an FPGA.

I was surprised when some people complained bitterly that Fedora and Debian have "unilaterally" decided to use the 16/32 instruction set for the kernel, standard libraries, and tools in their RISC-V 64 distributions. The complainers are allegedly developing supercomputer or datacentre processors with big superscalar OOO cores and they claim they don't want to do variable-width instructions with wide dispatch. Of course that does make it harder, but I'd have assumed the benefits in getting more code into any given size of L1 cache were big enough to be worth it. Definitely for commercial/database kind of things .. maybe not so much for scientific.

The complainers have refused to identify themselves (stealth mode, don't you know?) and were unimpressed by my argument that someone with the budget to build any real silicon (let alone a supercomputer) could easily afford to compile their own distribution -- or throw $10k and a couple of build machines (PC class) at Fedora or the Debian maintainer. So I have my doubts that they are genuine.

Stephen Fuld

unread,

May 30, 2018, 2:20:08 AM5/30/18

to

Or use some more appropriate RTOS than Linux. I don't know what is
available for RasPi, but there might be something. A benefit if there
is, is that it will probably use less resources than a full blown Linux.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

May 30, 2018, 3:31:01 AM5/30/18

to

There is FreeRTOS, but apparently this doesn't support networking or
similar (I was primarily controlling the RasPi via SSH).

To get an RTOS with network support, AFAICT I would need something more
like QNX, but QNX is proprietary (and doesn't seem to have a RasPi port
available).

There is FreeBSD, but I don't know if this would provide any real
advantage over Linux in this case.

If using the RasPi2, figure a way to give a program exclusive control
over one of the cores and effectively disable preemptive multitasking
for it if possible. There were some system calls for this ("sched_*" and
such), but when I messed with them before they didn't appear to do all
that much. More so, would prefer using the older RasPi B+ rather than my
newer RasPi2.

Granted, there is always the "get a Cortex-M" device route, if it starts
to look like my current "do something full custom on an FPGA" route is
unworkable.

I might go this route if the lathe shows up and I still don't have my
Verilog code working to an acceptable degree.

Terje Mathisen

unread,

May 30, 2018, 4:00:54 AM5/30/18

to

MitchAlsup wrote:
> On Monday, May 28, 2018 at 9:55:54 PM UTC-5, thereis...@gmail.com

>> I can't speak of other architects, but I recall the RISC-V ISA
>> document's rationale being (paraphrased) "Variable-length
>> instructions are too much complexity for simple implementations."
>
> I want to address this point::
>
> THese days even simple implementations are reading out (1/2) a cache
> line of data in an instruction fetch. Once you do this, you have an
> instruction buffer. Once you have an instruction buffer, you have
> everything necessary to provide constants to the data path except the
> read out multiplexers. Thus, the cost of delivering constants is
> insignificant.

I agree.

IMHO immediates should always be a part of the instruction stream!

Loading a word or two extra from straight line code is _really_ cheap,
while anything that has to divert to anywhere else in memory provides an
opportunity for a cache miss.

As you note, the instruction fetch _will_, at one or more levels, work
on full cache lines, so grabbing one or two 32 or even 64 bit immediates
has an excellent chance of being effectively free. Even if you straddle
cache lines, streaming access to a series of lines is faster than
skipping around in memory.

>
> When one merges the I and D caches into one, this argument becomes
> even stronger. One needs a wide fetch so that the number of fetch
> cycles is minimized so the LD/ST stream gets the majority of the
> access cycles.

Fetching 32/64-bit immediates from a constant memory pool competes for
both ram and cache bandwidth.

Terje Mathisen

unread,

May 30, 2018, 4:06:47 AM5/30/18

to

OTOH, it provides a single pressure point as soon as you need to
construct several constants at the same time. History has shown that any
instruction with fixed and/or hidden target registers end up as a speed
limiter.

The high half of an N*N -> 2N multiply is a very common example.

already...@yahoo.com

unread,

May 30, 2018, 4:46:01 AM5/30/18

to

On Wednesday, May 30, 2018 at 11:00:54 AM UTC+3, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Monday, May 28, 2018 at 9:55:54 PM UTC-5, thereis...@gmail.com
> >> I can't speak of other architects, but I recall the RISC-V ISA
> >> document's rationale being (paraphrased) "Variable-length
> >> instructions are too much complexity for simple implementations."
> >
> > I want to address this point::
> >
> > THese days even simple implementations are reading out (1/2) a cache
> > line of data in an instruction fetch. Once you do this, you have an
> > instruction buffer. Once you have an instruction buffer, you have
> > everything necessary to provide constants to the data path except the
> > read out multiplexers. Thus, the cost of delivering constants is
> > insignificant.
>
> I agree.
>
> IMHO immediates should always be a part of the instruction stream!
>
> Loading a word or two extra from straight line code is _really_ cheap,
> while anything that has to divert to anywhere else in memory provides an
> opportunity for a cache miss.
>
> As you note, the instruction fetch _will_, at one or more levels, work
> on full cache lines, so grabbing one or two 32 or even 64 bit immediates
> has an excellent chance of being effectively free. Even if you straddle
> cache lines, streaming access to a series of lines is faster than
> skipping around in memory.

So, where would you stop?
8bit integer - Yes
16bit integer - on 16-bit CPU obviously yes, what about wider CPUs?
32bit integer - Yes
64bit integer - The same 'Yes" as as 8b and 32b or much more limited, as only GPR load?
32bit Floating point ?
64bit Floating point ?
128bit SIMD ?
256bit SIMD ?
512bit SIMD ?

already...@yahoo.com

unread,

May 30, 2018, 4:53:48 AM5/30/18

to

Methinks, history has shown an opposite: fixed register is not a problem, except for decode bandwidth wasted by move to/from it. Only true (RAW) dependencies are problem and fixed registers typically don't create excessive RAW dependencies. Unlike partially-accessed registers.
I didn't understand the "hidden target registers" part. Can you give an example?

>
> The high half of an N*N -> 2N multiply is a very common example.
>

Very common example of what?

Terje Mathisen

unread,

May 30, 2018, 6:08:14 AM5/30/18

to

already...@yahoo.com wrote:
> On Wednesday, May 30, 2018 at 11:00:54 AM UTC+3, Terje Mathisen wrote:

>> As you note, the instruction fetch _will_, at one or more levels,
>> work on full cache lines, so grabbing one or two 32 or even 64 bit
>> immediates has an excellent chance of being effectively free. Even
>> if you straddle cache lines, streaming access to a series of lines
>> is faster than skipping around in memory. >
>
> So, where would you stop?
> 8bit integer - Yes
> 16bit integer - on 16-bit CPU obviously yes, what about wider CPUs?
> 32bit integer - Yes
> 64bit integer - The same 'Yes" as as 8b and 32b or much more limited, as only GPR load?
> 32bit Floating point ?
> 64bit Floating point ?
> 128bit SIMD ?
> 256bit SIMD ?
> 512bit SIMD ?

I would seriously consider all of these, even the last which is a
Larrabee-style full cache line:

The decision comes down to what is the instruction encoding pressure
from allowing these vs the savings from not having to go off to the side?

Terje Mathisen

unread,

May 30, 2018, 6:16:43 AM5/30/18

to

already...@yahoo.com wrote:
> On Wednesday, May 30, 2018 at 11:06:47 AM UTC+3, Terje Mathisen wrote:
> Methinks, history has shown an opposite: fixed register is not a
> problem, except for decode bandwidth wasted by move to/from it.
> Only true (RAW) dependencies are problem and fixed registers
> typically don't create excessive RAW dependencies. Unlike
> partially-accessed registers.
> I didn't understand the "hidden target registers" part. Can you
> give an example?

This is more obscure, it is typically used to carry internal state for
operations that last across multiple instructions. It often means that
any kind of interrupt will force the entire sequence to be restarted.

I.e. LL/SC

>
>>
>> The high half of an N*N -> 2N multiply is a very common example.
>>
>
> Very common example of what?

If you need to synthesize a wide multiply using a series of a narrower
operations, the partial MULs all have to use the same high target, so
they cannot overlap.

The only way to fix this is by virtualizing the target reg and depend on
OoO hardware to effectively make them all independent.

On x86 we have fixed target regs for the widening MUL, as well as the
shift count (CL) for a variable shift.

Bruce Hoult

unread,

May 30, 2018, 6:40:51 AM5/30/18

to

Sure, of course going elsewhere to a constant pool is bad.

But I can't see what's so bad about using two 32 bit instructions to assemble a 32 bit literal.

You can even on a high end implementation recognize such a pair as if they were a single 64 bit instruction, while low end implementations just run them serially.

already...@yahoo.com

unread,

May 30, 2018, 6:55:29 AM5/30/18

to

On Wednesday, May 30, 2018 at 1:16:43 PM UTC+3, Terje Mathisen wrote:
> already...@yahoo.com wrote:
> > On Wednesday, May 30, 2018 at 11:06:47 AM UTC+3, Terje Mathisen wrote:
> > Methinks, history has shown an opposite: fixed register is not a
> > problem, except for decode bandwidth wasted by move to/from it.
> > Only true (RAW) dependencies are problem and fixed registers
> > typically don't create excessive RAW dependencies. Unlike
> > partially-accessed registers.
> > I didn't understand the "hidden target registers" part. Can you
> > give an example?
>
> This is more obscure, it is typically used to carry internal state for
> operations that last across multiple instructions. It often means that
> any kind of interrupt will force the entire sequence to be restarted.
>
> I.e. LL/SC

O.k.
Like x86 REP STOS (without hidden states) vs ARM (but not aarch64) "store multiple" with hidden states. Or all AVX512 loads/stores (and arithmetic?) updating mask register to avoid hidden states.
But I don't see a relationship to a trick, proposed by BGB. Not that I understood his trick entirely...

> >
> >>
> >> The high half of an N*N -> 2N multiply is a very common example.
> >>
> >
> > Very common example of what?
>
> If you need to synthesize a wide multiply using a series of a narrower
> operations, the partial MULs all have to use the same high target, so
> they cannot overlap.
>
> The only way to fix this is by virtualizing the target reg and depend on
> OoO hardware to effectively make them all independent.

I wouldn't call it "virtualizing". It's a bog-standard register renaming, and it tends to work very well.

>
> On x86 we have fixed target regs for the widening MUL, as well as the
> shift count (CL) for a variable shift.

Personally, I found the later to be bigger performance obstacle than the former.
I had seen cases where BMI (only this part, no other part of BMI was used) improved performance of the kernel by ~20%.
Of course, at the whole application level both of them are peanuts.

David Brown

unread,

May 30, 2018, 6:57:38 AM5/30/18

to

Linux by itself is not an RTOS. But using the kernel preemption
options, a multi-core cpu and appropriate process control (using the
right scheduling policies, pinning to a cpu, pinning memory, etc.) you
can get very tight scheduling guarantees. You can't get quite the level
of repeatability that you would on a microcontroller (bare metal or with
a real RTOS), but you can get good enough for many purposes.

> There is FreeRTOS, but apparently this doesn't support networking or
> similar (I was primarily controlling the RasPi via SSH).

AFAIK there is no Pi port of FreeRTOS. It is aimed at microcontrollers
rather than bigger cpus. And while it has full support for networking
(at least too different network stacks) and two or three SSL
implementations, a secure shell implementation does not make much sense
on a single-process (multi-threaded) OS for small embedded systems.

>
> To get an RTOS with network support, AFAICT I would need something more
> like QNX, but QNX is proprietary (and doesn't seem to have a RasPi port
> available).
>
> There is FreeBSD, but I don't know if this would provide any real
> advantage over Linux in this case.
>

FreeBSD does not provide as much real-time support as Linux.

>
> If using the RasPi2, figure a way to give a program exclusive control
> over one of the cores and effectively disable preemptive multitasking
> for it if possible. There were some system calls for this ("sched_*" and
> such), but when I messed with them before they didn't appear to do all
> that much. More so, would prefer using the older RasPi B+ rather than my
> newer RasPi2.
>
>
> Granted, there is always the "get a Cortex-M" device route, if it starts
> to look like my current "do something full custom on an FPGA" route is
> unworkable.
>
> I might go this route if the lathe shows up and I still don't have my
> Verilog code working to an acceptable degree.

FPGA for controlling a simple motor is /massive/ overkill. Even a
microcontroller is unnecessary. (Of course, these may be a lot of fun.)

I'd look at a dedicated motor controller chip (or module if you are
lazy) with an SPI or UART interface, and connect that to the Pi running
Linux.

already...@yahoo.com

unread,

May 30, 2018, 7:01:55 AM5/30/18

to

Can you provide more extended answer?
Which of those immediates you will make 1st-class citizen, i.e. immediate that can serve as operands of arithmetic/logical instructions and for which of the cases 2nd-class citizenship (i.e. the only supported operation is register load) is sufficient?

Terje Mathisen

unread,

May 30, 2018, 7:08:19 AM5/30/18

to

Bruce Hoult wrote:
> On Wednesday, May 30, 2018 at 8:00:54 PM UTC+12, Terje Mathisen

>> Fetching 32/64-bit immediates from a constant memory pool competes
>> for both ram and cache bandwidth.
>
> Sure, of course going elsewhere to a constant pool is bad.
>
> But I can't see what's so bad about using two 32 bit instructions to
> assemble a 32 bit literal.
>
> You can even on a high end implementation recognize such a pair as if
> they were a single 64 bit instruction, while low end implementations
> just run them serially.

That is the easy case.

Try storing a 64-bit immediate to an absolute 64-bit address:

If you have to generate 4 32-bit instructions for each of those 64-bit
constants, then you are well past the point where using a few bits in
the main opcode to indicate that these immediates follow would be a big
saving.

Expecting the decoder to combine 9 32-bit instructions (36 bytes) in
order to combine them into what could very easily have been a single
20-byte instruction is a tough sell.

Bruce Hoult

unread,

May 30, 2018, 7:19:37 AM5/30/18

to

But how often do you actually need to do that, and how much will it affect the overall size and performance of your program?

I submit that this is a case so rare that it's completely pointless to optimize for it.

already...@yahoo.com

unread,

May 30, 2018, 7:36:21 AM5/30/18

to

64-bit case is rare, but store of 32-bit immediate to 32-bit absolute address is very common in embedded control field.

Paul A. Clayton

unread,

May 30, 2018, 7:43:43 AM5/30/18

to

On Wednesday, May 30, 2018 at 4:00:54 AM UTC-4, Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Monday, May 28, 2018 at 9:55:54 PM UTC-5, thereis...@gmail.com
>>> I can't speak of other architects, but I recall the RISC-V ISA
>>> document's rationale being (paraphrased) "Variable-length
>>> instructions are too much complexity for simple implementations."
>>
>> I want to address this point::
>>
>> THese days even simple implementations are reading out (1/2) a cache
>> line of data in an instruction fetch. Once you do this, you have an
>> instruction buffer. Once you have an instruction buffer, you have
>> everything necessary to provide constants to the data path except the
>> read out multiplexers. Thus, the cost of delivering constants is
>> insignificant.
>
> I agree.

Some really simple implementations do not have cache!
(Some microcontrollers have a buffer between slow
flash and the core which may have similar
considerations to an instruction buffer between
cache and the decoders.)

I favor VLE and I think RISC-V has made mistakes
not only in encoding the parcels (such that Compressed
instructions have few encoding similarities to
ordinary instructions) but also in de-emphasizing VLE
(the "base" ISA is fixed-length and traps on
misaligned instruction addresses [I wonder if the
"simple" implementations do this or just mask the
low bit]).

(Aligned 64-bit instructions have also been proposed,
which seems very odd.)

> IMHO immediates should always be a part of the instruction stream!
>
> Loading a word or two extra from straight line code is _really_ cheap,
> while anything that has to divert to anywhere else in memory provides an
> opportunity for a cache miss.

I agree that the prefetching benefit is significant.
Some of the prefetching benefit might be grasped by
a side channel mechanism similar to the Mill's side
channel for branch prediction metadata.

Two benefits of a side channel for immediates would be
that such could be used for non-constant data and that
redundant larger values would not take additional
storage.

Widening instruction fetch vs. providing a third
channel vs. widening data fetch seem to have various
tradeoffs in terms of utilization and variability of
use. Instruction buffering is somewhat natural, so
wider than necessary on average fetch is attractive.
However, I am not certain that such is the best
design for all systems (or even the best "average"
design).

[snip]

>> When one merges the I and D caches into one, this argument becomes
>> even stronger. One needs a wide fetch so that the number of fetch
>> cycles is minimized so the LD/ST stream gets the majority of the
>> access cycles.

I am not certain if intelligent banking/way selection
could not reduce the penalty for a shared capacity cache.
It is not something I have considered significantly, but
it seems plausible that banking could avoid conflict
issues. (This also gets into issues of natural banking
given cache size (and so SRAM array size). One does
not generally share L1 cache when it is large.)

> Fetching 32/64-bit immediates from a constant memory pool competes for
> both ram and cache bandwidth.

The RAM bandwidth issue might be blamed on bad
memory system design (including commodity DRAM features),
though practicality might not allow good design.

The average cache bandwidth would be the same, of
course.

With respect to using a special register to form
immediates, this is mainly an encoding aspect. The
immediate loading instructions can be fused and
effectively provide longer direct immediates in a
wide fetch implementation; stripping two-bit opcodes
(like EISC) would be annoying but not, I suspect,
horrible.

MitchAlsup

unread,

May 30, 2018, 11:37:37 AM5/30/18

to

In my case, I stopped with 16-bit immediates, 32-bit immediates, and 64-bit
immediates. There are a couple of instruction that get 12-bit immediates
because I did not want to burn the 16-bit opcode space. Being a basic RISC
design point gave 16-bit immediates--not for free, but at low cost--mostly
in an entropy sense.

I did not do 128, 256 or 512 bit immediates, because I am working on a more
CRAY 1-like vector extension that does not need wide immediates, but can
still be processed multi-lanes wide as a vector stream. Thus, my goal is
real vectors not fake vectors (ala SIMD).

MitchAlsup

unread,

May 30, 2018, 11:40:02 AM5/30/18

to

How many instructions does it take to assemble a 64-bit literal? 4 ?!?

Niklas Holsti

unread,

May 30, 2018, 12:17:39 PM5/30/18

to

Perhaps RTEMS? https://devel.rtems.org/wiki/TBR/UserManual/SupportedCPUs

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

BGB

unread,

May 30, 2018, 12:44:58 PM5/30/18

to

I tried this a little bit before, but it would require using my RasPi2
(multicore), rather than the original RasPi (single-core).

When I was messing with the "sched_*" system calls in the past, but IIRC
I didn't see much change in behavior. Granted, could just require more
fiddly to make it work.

Granted, "just use the RasPi2" could probably still be a viable
option... Or, maybe buy more RasPi2's, or a RasPi3.

>> There is FreeRTOS, but apparently this doesn't support networking or
>> similar (I was primarily controlling the RasPi via SSH).
>
> AFAIK there is no Pi port of FreeRTOS. It is aimed at microcontrollers
> rather than bigger cpus. And while it has full support for networking
> (at least too different network stacks) and two or three SSL
> implementations, a secure shell implementation does not make much sense
> on a single-process (multi-threaded) OS for small embedded systems.
>

OK. The stuff I read said that there was a port, but with no networking
interface. I was using SSH mostly because I can control stuff from a PC
via PuTTY (vs needing to connect a display and keyboard to the RPi).

OTOH, direct display could allow a fancier CNC UI than I can get with
ANSI codes and similar over PuTTY... (and could avoid needing to spend
some of the CPU time running sshd).

>>
>> To get an RTOS with network support, AFAICT I would need something more
>> like QNX, but QNX is proprietary (and doesn't seem to have a RasPi port
>> available).
>>
>> There is FreeBSD, but I don't know if this would provide any real
>> advantage over Linux in this case.
>>
>
> FreeBSD does not provide as much real-time support as Linux.
>
>>
>> If using the RasPi2, figure a way to give a program exclusive control
>> over one of the cores and effectively disable preemptive multitasking
>> for it if possible. There were some system calls for this ("sched_*" and
>> such), but when I messed with them before they didn't appear to do all
>> that much. More so, would prefer using the older RasPi B+ rather than my
>> newer RasPi2.
>>
>>
>> Granted, there is always the "get a Cortex-M" device route, if it starts
>> to look like my current "do something full custom on an FPGA" route is
>> unworkable.
>>
>> I might go this route if the lathe shows up and I still don't have my
>> Verilog code working to an acceptable degree.
>
> FPGA for controlling a simple motor is /massive/ overkill. Even a
> microcontroller is unnecessary. (Of course, these may be a lot of fun.)
>
> I'd look at a dedicated motor controller chip (or module if you are
> lazy) with an SPI or UART interface, and connect that to the Pi running
> Linux.
>

I am using motor driver modules, which accept a Pulse/Direction
interface. Main issue is the pulses need to be accurately timed
(operating at ~ 5 - 30kHz or so) or else the motor isn't happy.

The motors + motor drivers are actually one of the bigger costs of the
project (apart from the $1.5k or so for the lathe itself).

Basically, going to be using some 670 oz-in NEMA23's (w/ 48V 4A/ph motor
drivers) for X/Z axes, and a 1920 oz-in NEMA34 with a 100V 8A/ph driver
(1.5 kW peak) for the C axis/spindle (theoretically, should be able to
get ~ 800-900W on the spindle).

This is basically another ~ $1k or so for the motors and similar.

But, yeah, point being, the project is already expensive enough that
spending an additional $30 or so for a microcontroller board or
lower-end FPGA board in this case isn't a huge loss.

On the CNC Mill, I had ended up going with a more "off the shelf"
solution (a FlashCut CNC controller box), but still ended up needing to
get some external driver modules and similar because the box seemingly
doesn't have a powerful enough power-supply for me to get satisfactory
results with the built-in motor drivers.

Chris

unread,

May 30, 2018, 12:56:02 PM5/30/18

to

On 05/30/18 04:43, BGB wrote:

>> I suspect their idea of "simple implementation" was "student project,"
>> not "microcontroller in some product for money."
>>
>
> If you mean what I am working on?...
>
> It is more of a personal hobby project (as a possible option for the
> motor controller in doing a one-off CNC lathe retrofit; connected via
> SPI to a Raspberry Pi which would run the CNC control program).
>
> As for use beyond this: Who knows?...

Sounds like an interesting project, but for any sort of machine tool or
safety critical project, there must be some sort of hardware + software
watchdog to stop the machine for abnormal situations. Also, there is a
real possibility of noise based problems, so for example, the controller
part needs to be isolated, opto couplers etc, from the power handling
part. Separate psu's and good line filtering for the control end. Have
seen quite a few disasters of that type in the past.

As for the software, if you don't need networking, a state driven loop
for the basic machine function control, then perhaps a serial port for
external control from a pc, whatever. for the higher level functions.

For what your are describing, even an old 8 bit micro like a
6502 should have more than enough throughput for that sort of
application, though my choice would be Cortex M3 or similar.
Keep it simple, think systems engineering, layers, etc...

Regards,

Chris

BGB

unread,

May 30, 2018, 1:07:37 PM5/30/18

to

FWIW: This is how many it would take in my BJX1-64C ISA...

Though, there was an upside:
Literals needing the full 64-bits were rare;
Most of these would include large runs of zeroes or ones, which could be
"compressed out" using shorter instructions.

If I do a 64-bit extended-BSR1 or "BJX2" ISA, it is likely it would be a
similar situation.

But, as noted, 4 instructions isn't really likely to be any worse than
using a memory fetch (the common main alternative). Memory fetch would
probably still will be used if dealing with Abs64 addressing or similar
(rather than using 25-bit PC-rel or a similar scheme; but this would
probably only be used for things like DLL/SO linkage or similar).

BGB

unread,

May 30, 2018, 4:45:50 PM5/30/18

to

On 5/30/2018 5:16 AM, Terje Mathisen wrote:
> already...@yahoo.com wrote:
>> On Wednesday, May 30, 2018 at 11:06:47 AM UTC+3, Terje Mathisen wrote:
>> Methinks, history has shown an opposite: fixed register is not a
> > problem, except for decode bandwidth wasted by move to/from it.
> > Only true (RAW) dependencies are problem and fixed registers
> > typically don't create excessive RAW dependencies. Unlike
> > partially-accessed registers.
>> I didn't understand the "hidden target registers" part. Can you
> > give an example?
>
> This is more obscure, it is typically used to carry internal state for
> operations that last across multiple instructions. It often means that
> any kind of interrupt will force the entire sequence to be restarted.
>

In this ISA, the registers involved are part of the set which will be
bank-swapped on an interrupt:
R0..R7; PC, LR, SR, SP, DLR, DHR, GBR, TBR

In this sense, there are two copies of all these registers (the A and B
sets), and which is active depends on whether we are in an ISR.

The remaining registers are either callee save in the C ABI, or are
globally-used system registers.

This means that the operation should not be effected by an interrupt (in
the case of a simple core).

As for performance, this is intended mostly for a fairly simple core (no
superscalar or OoO), so shouldn't be too much of an issue.

Though, as one provision to performance, I had made the state of DLR
become essentially "undefined" following any operation which consumes it.

They are preserved on interrupt though mostly because as-is this makes
it simpler for the implementation.

I may consider doing another more performance-oriented ISA design, but
this is looking like it would likely be its own thing.

> I.e. LL/SC
>>
>>>
>>> The high half of an N*N -> 2N multiply is a very common example.
>>>
>>
>> Very common example of what?
>
> If you need to synthesize a wide multiply using a series of a narrower
> operations, the partial MULs all have to use the same high target, so
> they cannot overlap.
>
> The only way to fix this is by virtualizing the target reg and depend on
> OoO hardware to effectively make them all independent.
>

Yeah, though probably not really an issue in this case.

Though, yes, BSR1 also uses fixed-register outputs for the multiply op,
among other things.

> On x86 we have fixed target regs for the widening MUL, as well as the
> shift count (CL) for a variable shift.
>

Likely, variable shift will be similar (also uses DLR as an input), if I
include a variable-shift op (this will mostly depend on LUT budget and
similar).

Have reserved the encoding space for it, and does technically already
exist in the emulator. FWIW: I already have an FPU in the emulator as
well, but probably wont include one in the actual FPGA implementation in
this case.

But, yeah, either case is probably better than using a shift-slide...
SHAR R0
SHAR R0
SHAR R0
SHAR R0
...
SHAR R0
RTS

Gets back on shift-slide: "Whee!"

BGB

unread,

May 30, 2018, 4:55:30 PM5/30/18

to

On 5/30/2018 11:55 AM, Chris wrote:
> On 05/30/18 04:43, BGB wrote:
>
>>> I suspect their idea of "simple implementation" was "student project,"
>>> not "microcontroller in some product for money."
>>>
>>
>> If you mean what I am working on?...
>>
>> It is more of a personal hobby project (as a possible option for the
>> motor controller in doing a one-off CNC lathe retrofit; connected via
>> SPI to a Raspberry Pi which would run the CNC control program).
>>
>> As for use beyond this: Who knows?...
>
> Sounds like an interesting project, but for any sort of machine tool or
> safety critical project, there must be some sort of hardware + software
> watchdog to stop the machine for abnormal situations. Also, there is a
> real possibility of noise based problems, so for example, the controller
> part needs to be isolated, opto couplers etc, from the power handling
> part. Separate psu's and good line filtering for the control end. Have
> seen quite a few disasters of that type in the past.
>

The stepper driver modules in this case are already opto-isolated.

I am using pre-made modules for the actual motor-driving parts in this
case (will be dealing with relatively large stepper motors; namely
NEMA23 and NEMA34).

But, as noted, the modules are controlled via a series of step pulses.

> As for the software, if you don't need networking, a state driven loop
> for the basic machine function control, then perhaps a serial port for
> external control from a pc, whatever. for the higher level functions.
>

High-level functions were intended to run off a RasPi, generally with
the UI via SSH+PuTTY or similar.

Could connect up a monitor, just would need a preferably small/cheap one
with an HDMI input.

Could also use my RasPi2, which has multiple cores, so should
theoretically also be more usable for this.

>
> For what your are describing, even an old 8 bit micro like a
> 6502 should have more than enough throughput for that sort of
> application, though my choice would be Cortex M3 or similar.
> Keep it simple, think systems engineering, layers, etc...
>

Except that I am needing to drive the pulses often at 10 to 30 kHz in
some cases (mostly for the spindle motor), for a particular number of
steps at a particular frequency.

Things like an ATmega can't really seem to keep up with this
effectively, so a 6502 most likely wouldn't either.

The MSP430 is at least fast-enough to drive pulses at around 16kHz
(using the 32kHz internal timer), but, the ones I have don't have enough
ROM space to fit the speed ramping logic and similar, which is a problem
in this case.

The ATmega would probably be fast enough if I weren't planning on
dealing with the spindle. I also want a C-axis; which means either using
a stepper motor or a servo; servos are less sensitive to timing pulses,
but an 0.75kW servo isn't cheap.

Pretty much any Cortex-M device would probably be sufficient though.

But, yeah, I was going the FPGA route partly because I felt like trying
to do so (and am otherwise basically just waiting for the lathe to stop
being on back-order / procrastinating from other stuff I could be doing
/ ...). Also allows hardware-accelerating the pulse-generation part and
thus can probably run this off of a 1MHz tick or similar (for extra
smooth pulse timing).

FWIW: I was originally going to go for a smaller/cheaper lathe as a base
(like one of the Harbor Freight / Central Machinery ones), but my dad
got involved and was like "No, those ones are too terrible to be worth
it", so got a bigger and more expensive lathe.

My actual "even more original" idea was to make something probably out
of some welded steel tubing and all-thread and similar, but project creep...

MitchAlsup

unread,

May 30, 2018, 6:50:17 PM5/30/18

to

On Wednesday, May 30, 2018 at 5:16:43 AM UTC-5, Terje Mathisen wrote:
> already...@yahoo.com wrote:
> > On Wednesday, May 30, 2018 at 11:06:47 AM UTC+3, Terje Mathisen wrote:
> > Methinks, history has shown an opposite: fixed register is not a
> > problem, except for decode bandwidth wasted by move to/from it.
> > Only true (RAW) dependencies are problem and fixed registers
> > typically don't create excessive RAW dependencies. Unlike
> > partially-accessed registers.
> > I didn't understand the "hidden target registers" part. Can you
> > give an example?
>
> This is more obscure, it is typically used to carry internal state for
> operations that last across multiple instructions. It often means that
> any kind of interrupt will force the entire sequence to be restarted.
>
> I.e. LL/SC

In MY 66000 ISA, it is possible to perform a series of instructions that
end up smelling like they were all executed in one step (as seen by out-
side observers.) Up to 8 cache lines can participate in an ATOMIC event
so that one can move an element in a concurrent data structure from one
place in the CDS to another place in a single ATOMIC event--preventing
any outside observer for failing to find that element within the CDS.

BOOLEAN MoveElement( Element *fr, Element *to )
{
esmLOCK( fn = fr->next );
fp = fr->prev;
esmLOCK( tn = to->next );
esmLOCK( fn );
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}

The esmINTERFERENCE() is a conditional branch instruction that takes a look
at the <not visible> interference bit and make a decision to abandon or
complete the ATOMIC event.

This kind of utility is a good reason to abandon conditionc odes.

Bruce Hoult

unread,

May 30, 2018, 8:17:07 PM5/30/18

to

Four on Aarch64. RISC-V64 would need six (lui/addi/slli/lui/addi/or) but gcc 7.2.0 uses a constant pool instead.

long foo(long a){
return a ^ 0xfedcba9876543210;
}

rv64:

foo:
lui a5,%hi(.LC0)
ld a5,%lo(.LC0)(a5)
xor a0,a0,a5
ret
.LC0:
.dword -81985529216486896

Arm64:

foo:
mov x1, 12816
movk x1, 0x7654, lsl 16
movk x1, 0xba98, lsl 32
movk x1, 0xfedc, lsl 48
eor x0, x0, x1
ret

I believe people who object to this do so on aesthetic grounds, not based on actual performance on real-world programs.

Ivan Godard

unread,

May 30, 2018, 8:43:29 PM5/30/18

to

foo:
con(d(-81985529216486896), xor(b0, b1), retn(b0);
// one instruction, one cycle.
// Same for quad (128 bit) except "q" for "d".
// ~50 bits exclusive of the constant, which depends on
// numeric value, not on final size e.g 128-bit -1 encodes in
// zero bits

Bruce Hoult

unread,

May 30, 2018, 9:56:50 PM5/30/18

to

Working hardware and benchmarks, please. Even an FPGA bitstream would be something. In my lifetime, preferably.

Bruce Hoult

unread,

May 30, 2018, 11:04:34 PM5/30/18

to

I guess x86 must rule in the embedded controller field then, because ARM (all flavours) sucks at doing that.

void regupdate(){
*(int*)0x876543210 = 0xfedcba09;
}

Thumb2:

00000012 <regupdate>:
12: f243 2310 movw r3, #12816 ; 0x3210
16: f2c7 6354 movt r3, #30292 ; 0x7654
1a: f64b 2209 movw r2, #47625 ; 0xba09
1e: f6cf 62dc movt r2, #65244 ; 0xfedc
22: 601a str r2, [r3, #0]
24: 4770 bx lr

aarch64:

0000000000000018 <regupdate>:
18: d2864200 mov x0, #0x3210
1c: f2aeca80 movk x0, #0x7654, lsl #16
20: f2c00100 movk x0, #0x8, lsl #32
24: 52974121 mov w1, #0xba09
28: 72bfdb81 movk w1, #0xfedc, lsl #16
2c: b9000001 str w1, [x0]
30: d65f03c0 ret

rv32c:

0000000e <regupdate>:
e: 76543737 lui a4,0x76543
12: fedcc7b7 lui a5,0xfedcc
16: a0978793 addi a5,a5,-1527
1a: 20f72823 sw a5,528(a4)
1e: 8082 ret

i386:

0000000a <regupdate>:
a: c7 05 10 32 54 76 09 ba dc fe movl $0xfedcba09,0x76543210
14: c3 ret

Boom!

Two instructions and 11 bytes for x86 vs six instructions and 20 bytes for ARM Thumb2 or seven instructions and 28 bytes for ARM64. RISC-V somewhere in the middle ground (closer to Thumb) with five instructions and 18 bytes.

That will be why all the embedded stuff is x86, I guess.

Bruce Hoult

unread,

May 31, 2018, 1:49:46 AM5/31/18

to

On Thursday, May 31, 2018 at 3:04:34 PM UTC+12, Bruce Hoult wrote:
> On Wednesday, May 30, 2018 at 11:36:21 PM UTC+12, already...@yahoo.com wrote:
> > On Wednesday, May 30, 2018 at 2:19:37 PM UTC+3, Bruce Hoult wrote:
> > > On Wednesday, May 30, 2018 at 11:08:19 PM UTC+12, Terje Mathisen wrote:

> > > > Bruce Hoult wrote:0000000000000018 <regupdate>:
18: d2864200 mov x0, #0x3210 // #12816

1c: f2aeca80 movk x0, #0x7654, lsl #16

20: 52975301 mov w1, #0xba98 // #47768
24: 72bfdb81 movk w1, #0xfedc, lsl #16
28: b9000001 str w1, [x0]
2c: d65f03c0 ret

Oops! I screwed up my constants which make aarch64 worse than it should be:

0000000000000018 <regupdate>:
18: d2864200 mov x0, #0x3210
1c: f2aeca80 movk x0, #0x7654, lsl #16

20: 52975301 mov w1, #0xba98
24: 72bfdb81 movk w1, #0xfedc, lsl #16
28: b9000001 str w1, [x0]
2c: d65f03c0 ret

Six instructions and 24 bytes. Still the biggest.

Incidentally, PowerPC is the same in this (six and 24), while MIPS and ALpha are both five instructions and 20 bytes -- the same as RISC-V but with four bytes for the ret instead of two. m68000 can do it in two instructions and 12 bytes (one more than i386). sh4 uses four instructions and 8 bytes of code plus 8 bytes of constant pool.

BGB

unread,

May 31, 2018, 2:15:18 AM5/31/18

to

Hmm (for my ISA variants, hand-generated):

BJX1 (generic):
00: 8EFE_E1DC MOV #0xFEDC, R1
04: 8EBA_9109 LDSH16 #0xBA09, R1
08: 8E76_E054 MOV #0x7654, R0
0C: 8E32_9010 LDSH16 #0x3210, R0
10: 2012 MOV.L R1, @R0
12: 006B RTS/N

Cost: 20 bytes, 6 instructions.

BSR1:
00: BFED LDIN #0xFED, DLR
02: 26CB LDISH #0xCB, DLR
04: 26A0 LDISH #0xA0, DLR
06: 4819 MOV DLR_9, R1
08: A765 LDIN #0x765, DLR
0A: 2643 LDISH #0x43, DLR
0C: 2621 LDISH #0x21, DLR
0E: 4800 MOV DLR_0, R0
10: 0201 MOV R1, (R0)
12: 3010 RTS

Cost: 20 bytes, 10 instructions.

Could encode it in 18 bytes though if it were a PC-relative address (no
I-form for "MOV.L Rm, (DLR_i4)").

My C compiler is kind of dumb here though, so compiled code would end up
roughly twice this size due to prolog/epilog sequences.

In a very early spec for a hypothetical BSR1 based "BJX2" ISA:
00: FE10_FEDC MOV #0xFEDC, R1
04: FE12_BA09 LDISH16 #0xBA09, R1
08: FA76_5432 LDIZ #0x765432, DLR
0C: F004_A110 MOV R1, (DLR_x10)
10: 3010 RTS

Here, 18 bytes, 5 instructions.

For reference, B32V (B32V/B64V specific):
00: E0FE MOV #0xFE, R0
02: 87DC LDISH #0xDC, R0
04: 87BA LDISH #0xBA, R0
06: 8709 LDISH #0x09, R0
08: 6103 MOV R0, R1
0A: E076 MOV #0x76, R0
0C: 8754 LDISH #0x54, R0
0E: 8732 LDISH #0x32, R0
10: 8710 LDISH #0x10, R0
12: 2012 MOV.L R1, @R0
14: 006B RTS/N

So, 22 bytes, 11 instructions.

or, SH (canonical SH, also valid in B32V and BJX1):
00: D102 MOV.L (PC, 8), R1
02: D003 MOV.L (PC, 12), R0
04: 2012 MOV.L R1, @R0
06: 000B RTS
08: 0009 NOP //delay-slot
0A: 0009 NOP //pad (need 4B alignment)
0C: FEDC .short 0xFEDC
0E: BA09 .short 0xBA09
10: 7654 .short 0x7654
12: 3210 .short 0x3210

So 20 bytes, 6 instructions, but requires 3 memory accesses.

It is also possible to do a load-shift sequence in plain SH, but the
result would require 36 bytes and 18 instructions.

Hrmm...

Yeah, none of my stuff really competes well with x86 either it seems...

Bruce Hoult

unread,

May 31, 2018, 2:29:05 AM5/31/18

to

The sh4 compiler I tried gave...

0000000c <regupdate>:
c: 01 d1 mov.l 14 <regupdate+0x8>,r1 ! 76543210
e: 02 d2 mov.l 18 <regupdate+0xc>,r2 ! fedcba98
10: 0b 00 rts
12: 22 21 mov.l r2,@r1
14: 10 32
16: 54 76
18: 98 ba
1a: dc fe

So basically the same as yours, but with a branch delay slot. (I don't actually know the architecture, but I can type "sudo apt-get gcc-sh4-linux-gnu" with the best of them...)

already...@yahoo.com

unread,

May 31, 2018, 3:48:33 AM5/31/18

to

According to the measurements (code size only) that I did 10 years ago on real-world code base, x386 scores rather well.
And I would guess that efficient stores of immediates to absolute addresses is one of the main reasons for its good showing.
Still, overall Thumb2 is significantly denser. Which probably means that while this case is important, other cases are also important.
Classic ARM32 scored surprisingly badly.

https://www.realworldtech.com/forum/?threadid=86001&curpostid=86094

As to who rules in embedded controllers, of course right now x386 sucks, but it has more to do with the fact that since mid-90s AMD lost interest in the field.
And Intel is usual Intel - internal politics prevail over anything else.
Couple of years ago they demonstrated a potential for a competent MCU (Quark D2000, it sucks as a single product, but not a bad starting point for a family) and killed it several months thereafter.

David Brown

unread,

May 31, 2018, 3:53:40 AM5/31/18

to

Advanced scheduling is easier, more efficient and more accurate when you
have multiple cores, but can in theory be done on one core.

A second Pi2 or Pi3 would be a fraction of the price of messing around
with FPGAs, which you suggested as a possible solution.

>
>>> There is FreeRTOS, but apparently this doesn't support networking or
>>> similar (I was primarily controlling the RasPi via SSH).
>>
>> AFAIK there is no Pi port of FreeRTOS. It is aimed at microcontrollers
>> rather than bigger cpus. And while it has full support for networking
>> (at least too different network stacks) and two or three SSL
>> implementations, a secure shell implementation does not make much sense
>> on a single-process (multi-threaded) OS for small embedded systems.
>>
>
> OK. The stuff I read said that there was a port, but with no networking
> interface.

It is quite possible that someone has made a partial port, but there is
no official one. FreeRTOS is open source, so there are many bits and
pieces around if you look wide enough, but usually you want to stick to
the ports mentioned on the FreeRTOS webside.

> I was using SSH mostly because I can control stuff from a PC
> via PuTTY (vs needing to connect a display and keyboard to the RPi).
>

FreeRTOS will not support a display or keyboard either. It is a small
embedded RTOS - it is not a large posix OS, and it does not have the
myriad of utilities, drivers, libraries, programs that are needed to
support a full ssh server or a display and keyboard.

You can, of course, /make/ display support on an FreeRTOS based system.
There are plenty of embedded systems with displays or screens, running
FreeRTOS. But it is a totally different world from using a display on a
Linux system.

Let me put it this way. If you want to use a Raspberry Pi, use Linux.
If you don't want to use Linux, don't use the Pi. This should be your
base line.

There are lots of ways to do this. I'd recommend making a post in
comp.arch.embedded, which is a more suitable choice of group. These
kinds of timings are easily handled by simple microcontrollers, but you
may still want to get a ready-made motor controller chip with a serial
interface. You can then control it from the Pi, which no longer needs
precise timings, but which gives you a convenient control interface,
networking, ssh, screen, etc.

>
> The motors + motor drivers are actually one of the bigger costs of the
> project (apart from the $1.5k or so for the lathe itself).
>
> Basically, going to be using some 670 oz-in NEMA23's (w/ 48V 4A/ph motor
> drivers) for X/Z axes, and a 1920 oz-in NEMA34 with a 100V 8A/ph driver
> (1.5 kW peak) for the C axis/spindle (theoretically, should be able to
> get ~ 800-900W on the spindle).
>
> This is basically another ~ $1k or so for the motors and similar.
>

If you want to discuss choice of motors and the ways to drive them, then
again comp.arch.embedded would be a better place to do it. The "right"
choice is going to depend highly on the requirements for power, speed,
stability, size, efficiency, control, accuracy, etc.

>
> But, yeah, point being, the project is already expensive enough that
> spending an additional $30 or so for a microcontroller board or
> lower-end FPGA board in this case isn't a huge loss.
>

Forget the FPGA. There are various choices of architecture for
controlling motors, but unless you already have an FPGA in the system,
or are coordinating dozens of motors with microsecond timing, or you are
a total FPGA addict and understand nothing else - then an FPGA is the
/wrong/ choice here.

BGB

unread,

May 31, 2018, 3:56:52 AM5/31/18

to

Yeah, that works.

I just forgot to make effective use of the delay-slot, and used
different registers.

So, it can be noted that SH can do it in 16 bytes in this case.

But, yeah, this does point out something:
My notation was in 16-bit words, but given the targets are all LE in
this case, the actual bytes are swapped around in memory.

Also noticed minor mistakes in some of the other listings, ...

Similarly, BJX1 and B32V are SH derived, but have a lot of tweaks as
well. And my B32V example is actually specific to just B32V (I noted
that a few of the I-forms are different in B64V; for example, MOV has
been moved), ...

( Note that while still SH-based, B64V and BJX1-64C were reorganized a
bit to allow for a "less terrible" 64-bit ISA variant than my first
attempt; which suffered nastiness in the form of needing to twiddle
mode-bits to bank-in and bank-out parts of the instruction set. Was able
to free up enough encoding space via the reorg to allow all the needed
instructions to exist at the same time; Though some less commonly used
parts of the original SH ISA were either faked via instruction-sequences
or dropped entirely as a result. )

However, BSR1 is not based on SH, hence why the instruction coding is
almost entirely different.

David Brown

unread,

May 31, 2018, 4:08:00 AM5/31/18

to

On 30/05/18 22:58, BGB wrote:
> On 5/30/2018 11:55 AM, Chris wrote:

<snip>

>>
>> For what your are describing, even an old 8 bit micro like a
>> 6502 should have more than enough throughput for that sort of
>> application, though my choice would be Cortex M3 or similar.
>> Keep it simple, think systems engineering, layers, etc...
>>
>
> Except that I am needing to drive the pulses often at 10 to 30 kHz in
> some cases (mostly for the spindle motor), for a particular number of
> steps at a particular frequency.
>
>
> Things like an ATmega can't really seem to keep up with this
> effectively, so a 6502 most likely wouldn't either.

An ATmega will handle this fine - for one or two motors, but not for
many at once due to the limited number of timers. It would be a poor
choice of microcontroller, compared to cheaper and better alternatives,
but it would work (and the Arduino boards are convenient for hobby
developers).

A 6502 would also work if it had the right peripherals, and might also
be a suitable choice - if you are still living in the previous century.

>
> The MSP430 is at least fast-enough to drive pulses at around 16kHz
> (using the 32kHz internal timer), but, the ones I have don't have enough
> ROM space to fit the speed ramping logic and similar, which is a problem
> in this case.

An msp430 will also work fine. If you can't fit the speed control and
ramping logic into an msp430, then you have either picked the wrong
model or you have misunderstood how to write such code on a microcontroller.

>
> The ATmega would probably be fast enough if I weren't planning on
> dealing with the spindle. I also want a C-axis; which means either using
> a stepper motor or a servo; servos are less sensitive to timing pulses,
> but an 0.75kW servo isn't cheap.
>
>
> Pretty much any Cortex-M device would probably be sufficient though.

Indeed.

But a better route for you would be dedicated motor controller chips.
TI has a selection to get you started, and there are other
manufacturers. These are nothing more sophisticated than a
microcontroller of a suitable style that come with software in place -
they will save you an enormous amount of development time. And it means
you can pick the type of motor based on what is good for the purpose,
rather than based on what you think you can control. For example, BLDC
or PMSM take a more effort to control but give much smoother and more
efficient motor drives than steppers.

>
> But, yeah, I was going the FPGA route partly because I felt like trying
> to do so (and am otherwise basically just waiting for the lathe to stop
> being on back-order / procrastinating from other stuff I could be doing
> / ...). Also allows hardware-accelerating the pulse-generation part and
> thus can probably run this off of a 1MHz tick or similar (for extra
> smooth pulse timing).

You are joking, right? You want to use an /FPGA/ so that you can get 1
MHz ticks? A $1 microcontroller will generate pulses with 20 ns
accuracy (50 MHz ticks). An FPGA can do it faster, and is useful when
you need to run lots of these in parallel.

Bruce Hoult

unread,

May 31, 2018, 5:53:25 AM5/31/18

to

On Thursday, May 31, 2018 at 7:48:33 PM UTC+12, already...@yahoo.com wrote:
> On Thursday, May 31, 2018 at 6:04:34 AM UTC+3, Bruce Hoult wrote:
> > On Wednesday, May 30, 2018 at 11:36:21 PM UTC+12, already...@yahoo.com wrote:
> > > On Wednesday, May 30, 2018 at 2:19:37 PM UTC+3, Bruce Hoult wrote:

Here's current data (this year) on the total size of SPEC CPU2006 on various CPU architectures, with I believe everything compiled with the same current version of gcc with the same flags.

http://hoult.org/spec_size.png

Observations:

-- in 32 bit, Thumb2 and rv32c are neck and neck; x86 and MIPS16e are about 25% bigger, fixed-length 32 bit instruction ARM and RISC-V are about 10% - 15% bigger than x86, and MIPS32 is .. massive. PowerPC is not shown but I think from other experience would be maybe midway between ARM and MIPS.

-- in 64 bit 16/32 RISC-V (rv64c) is BY FAR the smallest code. Aarch64 is next, by a hair, with x86_64 pretty much the same and both of them 30% bigger than rv64c. Pure fixed length rv64 is about 10% bigger than x86_64 & arm64. MIPS64 again is massive. Not shown, PPC64 and Alpha (with the later added byte load/store etc) would I think again be half way between rv64 and MIPS, and ia64 would be much worse than MIPS.

ARM have done a very nice job with Aarch64 as a fixed length 32 bit instruction RISC. It soundly beats every other fixed 32 bit instruction length ISA for size (including ARM32), and even pips x86_64 which suffers pretty badly from all the extra low information-content REX prefix bytes.

ARM have, I believe, screwed up with Aarch64 by throwing away the lessons learned with Thumb2, namely that 16/32 instructions are a damned good idea enabling market-leading code size at very minimal added decode complexity.

Note 1: the scales on the 32 bit and 64 bit charts are not quite the same -- I suspect MIPS64 code is not in fact smaller than MIPS32 code. rv64c code is very slightly bigger than rv32c code mostly due to the loss of several opcodes from the 16 bit encodings, namely +/-2KB offset function call (little used outside of small embedded systems) and load/store single precision float to make way for addiw and load/store 64 bit int respectively.

Note 2: the RISC-V code uses (as noted) "millicode" functions to save/restore registers in function prologues/epilogues in lieu of the push/pop multiple instructions in the 32 bit ARM instruction sets. This saves about 4% in code size at a similar increase in execution time. The user can choose whether to make this trade-off or not (gcc has a flag). The result of turning this off would be to make rv32c code a little bigger than Thumb2, but still significantly smaller than everything else.

Disclaimer: several months ago I joined SiFive, who made this chart and manufacture RISC-V processors. I believe the results to be fair and honest, and earlier versions of this chart (from Berkeley) are one of the reasons I became interested in RISC-V and eventually joined the company.

already...@yahoo.com

unread,

May 31, 2018, 6:53:30 AM5/31/18

to

The differences vs my measurements are as expected. Workstation-style code contains fewer long immediates and almost no absolute addresses. That leads to relatively better showing both for fixed-length ARM32 and for Thumb2.
MIPS32-to-x386 ratio in their test is very close to Nios2-to-x386 ratio in my test.

What I don't understand is why RV32 is doing so well. I would expect it to be a little worse than Nios2 or MIPS32r6. All 3 ISAs are very similar, but feature-for-feature, RV32 is either equal to another two or worse than them, never better. May be, the difference vs Nios2 in maturity of compiler? Nios2 compiler was pretty new back in 2008. And since then most of investments didn't went into improving code density. As to difference vs MIPS32, I am not sure. May be, they didn't compile for Release 6?

already...@yahoo.com

unread,

May 31, 2018, 7:20:59 AM5/31/18

to

On Thursday, May 31, 2018 at 12:53:25 PM UTC+3, Bruce Hoult wrote:
>
> Disclaimer: several months ago I joined SiFive, who made this chart and manufacture RISC-V processors. I believe the results to be fair and honest, and earlier versions of this chart (from Berkeley) are one of the reasons I became interested in RISC-V and eventually joined the company.

So, by now you know a power characteristics of SiFive U54-MC (is now renamed to U540-C000?) ?
7 months a go you said "we'll have to wait a month or two"...

Bruce Hoult

unread,

May 31, 2018, 8:26:40 AM5/31/18

to

Hmm .. well, I *have* one...

http://hoult.org/unleashed.jpg

I could say one or two things about the *performance* characteristics. In general, at 1.5 GHz it's noticably faster than a 1.2 GHz A53, but noticeably slower than a 1.5 GHz A53. Pretty much splits the difference, depending on the exact program.

In general use as a Linux workstation it feels much more like an Odroid C2 (1.536 GHz A53) than like a Raspberry Pi 3 (1.2 GHz A53). No doubt the gig Ethernet (like the Odroid, unline the Pi) and 8 GB of DDR4 (vs 2 GB on Odroid and 1 GB on Pi) help with that.

Given that it's only a single-issue CPU vs dual issue for the A53, that's pretty nice.

As I'm working from home in New Zealand I don't have anything to measure the power use. It doesn't get hot and the fan is almost certainly not actually needed -- it's just there on the initial run of boards to be really sure.

Anton Ertl

unread,

May 31, 2018, 9:52:29 AM5/31/18

to

Bruce Hoult <bruce...@gmail.com> writes:
>-- in 32 bit, Thumb2 and rv32c are neck and neck; x86 and MIPS16e are about=
> 25% bigger, fixed-length 32 bit instruction ARM and RISC-V are about 10% -=

> 15% bigger than x86, and MIPS32 is .. massive.

Quite a while ago (Sep 2 13:33:55 1999) I did some code size
measurements and posted them here:

|Here's a table of sizes of just the .text segment (strings seem to
|reside in .rdata, .rodata on all machines except HP/UX; there I
|subtracted the $LIT$ size from the text size):
|
| main.o engine.o gcc version
|IA32 Linux 4692 22192 2.7.2.1 -m486
|IA32 Linux 4321 19780 2.7.2.3
|MIPS Ultrix 4656 22928 2.4.5
|MIPS Irix 7504 25360 egcs-1.1.2?
|Alpha DU 7296 24368 2.7.2.2
|Alpha Linux 6976 24736 egcs-1.0.3
|SPARC Solaris 4908 21012 2.8.1
|HPPA HP/UX 5656 19000 2.8.1

MIPS Ultrix produces relatively small code for main.o, while MIPS IRIX
produces the biggest code for main.o. Possible causes:
Position-independent code on MIPS IRIX (but not on Ultrix),
differences in compiler settings that affect code size (e.g., when to
inline and how much to unroll).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

MitchAlsup

unread,

May 31, 2018, 12:57:33 PM5/31/18

to

MY 66000

foo: XOR R2,R2,0xfedcba9876543210
JMP R1

MitchAlsup

unread,

May 31, 2018, 12:58:49 PM5/31/18

to

MY 66000

regupdate:
STW 0xfedcba09,[0x876543210]
RET

matt...@gmail.com

unread,

May 31, 2018, 1:03:26 PM5/31/18

to

On Thursday, May 31, 2018 at 4:53:25 AM UTC-5, Bruce Hoult wrote:
> Here's current data (this year) on the total size of SPEC CPU2006 on various CPU architectures, with I believe everything compiled with the same current version of gcc with the same flags.
>
> http://hoult.org/spec_size.png
>
> Observations:
>
> -- in 32 bit, Thumb2 and rv32c are neck and neck; x86 and MIPS16e are about 25% bigger, fixed-length 32 bit instruction ARM and RISC-V are about 10% - 15% bigger than x86, and MIPS32 is .. massive. PowerPC is not shown but I think from other experience would be maybe midway between ARM and MIPS.

Vince Weaver's total executable size results put RV32IMC and x86 in a virtual tie. Thumb2 tries to compete with the 68k but can't.

1) 68k
2) Thumb2 +6%
3) Thumb1 +8%
4) RISCV32IMC +13%
5) x86 +13%
6) SH-3 +16%
7) RISCV64IMC +19%
8) x86_64 +20%
9) AArch64 +28%
10) ARM EABI +36%
11) PowerPC +36%
12) SPARC +43%
13) MIPS +54%

It's convenient how the 68k is excluded from most comparisons even in a thread about the 68k. The ISA, ABI and 68k compiler backends have plenty of room to improve too. The x86/x86_64 ISA started with an exhausted encoding space so adding new instructions decreases code density. The 68k has free room where adding instructions can increase code density.

> -- in 64 bit 16/32 RISC-V (rv64c) is BY FAR the smallest code. Aarch64 is next, by a hair, with x86_64 pretty much the same and both of them 30% bigger than rv64c. Pure fixed length rv64 is about 10% bigger than x86_64 & arm64. MIPS64 again is massive. Not shown, PPC64 and Alpha (with the later added byte load/store etc) would I think again be half way between rv64 and MIPS, and ia64 would be much worse than MIPS.

It looks like RISCV64IMC and x86_64 are in a virtual tie here.

1) RISCV64IMC
2) x86_64 +1%
3) AArch64 +7%

Maybe there is just a lack of 64 bit competition that even cares about code density. I expect there is substantial room for improvement here.

> ARM have done a very nice job with Aarch64 as a fixed length 32 bit instruction RISC. It soundly beats every other fixed 32 bit instruction length ISA for size (including ARM32), and even pips x86_64 which suffers pretty badly from all the extra low information-content REX prefix bytes.
>
> ARM have, I believe, screwed up with Aarch64 by throwing away the lessons learned with Thumb2, namely that 16/32 instructions are a damned good idea enabling market-leading code size at very minimal added decode complexity.

AArch64 does have good code density for a 32 bit fixed length encoding but it is less competitive for lower end embedded than Thumb2. Also, did the enhancements give enough single core performance boost to compete with x86_64 for the mid to high end CPU market?

Anton Ertl

unread,

May 31, 2018, 5:56:23 PM5/31/18

to

matt...@gmail.com writes:
>It's convenient how the 68k is excluded from most comparisons even in a thr=
>ead about the 68k.

I have old results that include the 68k: From
<2002Dec1...@a0.complang.tuwien.ac.at>:

|I produced numbers for gzip and grep; for gzip I used the following
|commands:
|
|for i in alpha arm hppa i386 ia64 m68k mips mipsel powerpc s390 sparc; do wget http://ftp.<your_mirror>.debian.org/debian/pool/main/g/gzip/gzip_1.3.5-1_$i.deb; done
|for i in gzip_1.3.5-1_*.deb; do ar x $i; tar xfz data.tar.gz ./bin/gzip; echo "`size bin/gzip|cut -f 1-3`" $i|grep -v text; done
|
|(GNU size 2.9.5 on i386 did not work on alpha and ia64 binaries, but
|GNU size 2.10.91 on alpha did work).
|
|Here are the results:
|
| gzip_1.3.5-1 grep_2.4.2_3
| text data bss text data bss
| 73691 7760 329088 67542 3913 2152 alpha
| 55384 3012 332960 52012 1224 1896 arm
| 60045 2832 329000 60643 988 1956 hppa
| 44258 3100 329264 43772 1212 2068 i386
| 112417 7344 329144 109004 3768 2200 ia64
| 40566 3012 328956 39030 1212 1916 m68k
| 86846 3432 329088 78918 1212 2004 mips
| 86766 3432 329088 78886 1212 2004 mipsel
| 58312 3356 328992 55886 1504 1940 powerpc
| 54419 3012 329000 53238 1224 1948 s390
| 58228 2884 329024 58144 1036 1960 sparc
|
|And here's the same data sorted by gzip text size:
|
| 40566 3012 328956 39030 1212 1916 m68k
| 44258 3100 329264 43772 1212 2068 i386
| 54419 3012 329000 53238 1224 1948 s390
| 55384 3012 332960 52012 1224 1896 arm
| 58228 2884 329024 58144 1036 1960 sparc
| 58312 3356 328992 55886 1504 1940 powerpc
| 60045 2832 329000 60643 988 1956 hppa
| 73691 7760 329088 67542 3913 2152 alpha
| 86766 3432 329088 78886 1212 2004 mipsel
| 86846 3432 329088 78918 1212 2004 mips
| 112417 7344 329144 109004 3768 2200 ia64

Bruce Hoult

unread,

May 31, 2018, 7:55:59 PM5/31/18

to

On Friday, June 1, 2018 at 5:03:26 AM UTC+12, matt...@gmail.com wrote:
> On Thursday, May 31, 2018 at 4:53:25 AM UTC-5, Bruce Hoult wrote:
> > Here's current data (this year) on the total size of SPEC CPU2006 on various CPU architectures, with I believe everything compiled with the same current version of gcc with the same flags.
> >
> > http://hoult.org/spec_size.png
> >
> > Observations:
> >
> > -- in 32 bit, Thumb2 and rv32c are neck and neck; x86 and MIPS16e are about 25% bigger, fixed-length 32 bit instruction ARM and RISC-V are about 10% - 15% bigger than x86, and MIPS32 is .. massive. PowerPC is not shown but I think from other experience would be maybe midway between ARM and MIPS.
>
> Vince Weaver's total executable size results put RV32IMC and x86 in a virtual tie. Thumb2 tries to compete with the 68k but can't.
>
> 1) 68k
> 2) Thumb2 +6%
> 3) Thumb1 +8%
> 4) RISCV32IMC +13%
> 5) x86 +13%
> 6) SH-3 +16%
> 7) RISCV64IMC +19%
> 8) x86_64 +20%
> 9) AArch64 +28%
> 10) ARM EABI +36%
> 11) PowerPC +36%
> 12) SPARC +43%
> 13) MIPS +54%

While I find Vince's page fun and have myself contributed improvements to (IIRC) at least ARM, Thumb{2}, Aarch64, and RISC-V, it should be remembered that this is a very small program, is untypical (has an unusually high amount of byte manipulation), and people are using every hand-coded assembler trick in the book. Code produced by compilers (especially the same version of the same compiler) is much more representative of how people actually use and experience each ISA.

> It's convenient how the 68k is excluded from most comparisons even in a thread about the 68k.

A bit unfair when in a previous message I said "m68000 can do it in two instructions and 12 bytes (one more than i386)".

https://groups.google.com/d/msg/comp.arch/wzZW4Jo5tbM/W_1OalsXCwAJ

I even gave the 68k the benefit of the doubt by not counting the link and unlk instructions (two instructions and six bytes) I couldn't find a way to turn off.

Bruce Hoult

unread,

May 31, 2018, 8:55:56 PM5/31/18

to

The supported ISAs have changed since then, as has the version of gzip :-) Also, the archive format uses is now .xz not .gz

I just repeated your test using:

for i in armhf arm64 i386 amd64 mips mipsel powerpc ppc64 s390x; do wget http://ftp.debian.org/debian/pool/main/g/gzip/gzip_1.6-5+b1_$i.deb; done

In addition I fetched:

for i in alpha hppa ia64 m68k riscv64 ppc64 sh4 sparc64 x32; do wget http://ftp.ports.debian.org/debian-ports/pool-$i/main/g/gzip/gzip_1.6-5_$i.deb; done

Hopefully the difference between 1.6-5 and 1.6-5+b1 is trivial.

Results (text/data/bss):

111822 2408 329656 gzip_1.6-5_alpha.deb
92197 3720 330088 gzip_1.6-5+b1_amd64.deb
83760 3968 329664 gzip_1.6-5+b1_arm64.deb
62747 2308 329496 gzip_1.6-5+b1_armhf.deb
97895 2292 329864 gzip_1.6-5+b1_i386.deb
183584 3300 329672 gzip_1.6-5+b1_ia64.deb
98698 2524 329680 gzip_1.6-5+b1_mips.deb
98490 2524 329680 gzip_1.6-5+b1_mipsel.deb
89659 1452 329516 gzip_1.6-5+b1_powerpc.deb
99137 5560 331664 gzip_1.6-5+b1_ppc64.deb
72707 3900 329672 gzip_1.6-5+b1_riscv64.deb
117578 3848 329664 gzip_1.6-5+b1_s390x.deb
88951 1780 329536 gzip_1.6-5_hppa.deb
71140 1432 329480 gzip_1.6-5_m68k.deb
68552 1436 329488 gzip_1.6-5_sh4.deb
83336 1532 329712 gzip_1.6-5_sparc64.deb
84140 1936 330024 gzip_1.6-5_x32.deb

Sorted by text size:

62747 2308 329496 gzip_1.6-5+b1_armhf.deb
68552 1436 329488 gzip_1.6-5_sh4.deb
71140 1432 329480 gzip_1.6-5_m68k.deb
72707 3900 329672 gzip_1.6-5+b1_riscv64.deb
83336 1532 329712 gzip_1.6-5_sparc64.deb
83760 3968 329664 gzip_1.6-5+b1_arm64.deb
84140 1936 330024 gzip_1.6-5_x32.deb
88951 1780 329536 gzip_1.6-5_hppa.deb
89659 1452 329516 gzip_1.6-5+b1_powerpc.deb
92197 3720 330088 gzip_1.6-5+b1_amd64.deb
97895 2292 329864 gzip_1.6-5+b1_i386.deb
98490 2524 329680 gzip_1.6-5+b1_mipsel.deb
98698 2524 329680 gzip_1.6-5+b1_mips.deb
99137 5560 331664 gzip_1.6-5+b1_ppc64.deb
111822 2408 329656 gzip_1.6-5_alpha.deb
117578 3848 329664 gzip_1.6-5+b1_s390x.deb
183584 3300 329672 gzip_1.6-5+b1_ia64.deb

So i386 and s390 are now among the biggest instead of the smallest!

armhf is the clear winner, with sh4, m68k and riscv64 not far behind (especially note only 2% difference between m68k and riscv64 .. I suspect riscv32 might be smaller than m68k, but it's not supported in Linux yet).

sparc64, arm64 and x32 make another tight group.

The above results charted:

https://pbs.twimg.com/media/DekR00yVAAUgH8s.jpg

matt...@gmail.com

unread,

Jun 1, 2018, 12:56:10 AM6/1/18

to

On Thursday, May 31, 2018 at 6:55:59 PM UTC-5, Bruce Hoult wrote:
> On Friday, June 1, 2018 at 5:03:26 AM UTC+12, matt...@gmail.com wrote:
> > On Thursday, May 31, 2018 at 4:53:25 AM UTC-5, Bruce Hoult wrote:
> > > Here's current data (this year) on the total size of SPEC CPU2006 on various CPU architectures, with I believe everything compiled with the same current version of gcc with the same flags.
> > >
> > > http://hoult.org/spec_size.png
> > >
> > > Observations:
> > >
> > > -- in 32 bit, Thumb2 and rv32c are neck and neck; x86 and MIPS16e are about 25% bigger, fixed-length 32 bit instruction ARM and RISC-V are about 10% - 15% bigger than x86, and MIPS32 is .. massive. PowerPC is not shown but I think from other experience would be maybe midway between ARM and MIPS.
> >
> > Vince Weaver's total executable size results put RV32IMC and x86 in a virtual tie. Thumb2 tries to compete with the 68k but can't.
> >
> > 1) 68k
> > 2) Thumb2 +6%
> > 3) Thumb1 +8%
> > 4) RISCV32IMC +13%
> > 5) x86 +13%
> > 6) SH-3 +16%
> > 7) RISCV64IMC +19%
> > 8) x86_64 +20%
> > 9) AArch64 +28%
> > 10) ARM EABI +36%
> > 11) PowerPC +36%
> > 12) SPARC +43%
> > 13) MIPS +54%
>
> While I find Vince's page fun and have myself contributed improvements to (IIRC) at least ARM, Thumb{2}, Aarch64, and RISC-V, it should be remembered that this is a very small program, is untypical (has an unusually high amount of byte manipulation), and people are using every hand-coded assembler trick in the book. Code produced by compilers (especially the same version of the same compiler) is much more representative of how people actually use and experience each ISA.

Code produced by compilers is representative of how good the compiler support is and varies greatly by options used (different code models, optimization options, not omitting the frame pointer, etc.). Anton's old results before GCC went downhill are closer to Vince Weavers's results than your results. GCC 2.95.3 was probably the best version for 68k integer code quality.

> > It's convenient how the 68k is excluded from most comparisons even in a thread about the 68k.
>
> A bit unfair when in a previous message I said "m68000 can do it in two instructions and 12 bytes (one more than i386)".

You didn't show code like the others architectures but it is trivial.

> https://groups.google.com/d/msg/comp.arch/wzZW4Jo5tbM/W_1OalsXCwAJ
>
> I even gave the 68k the benefit of the doubt by not counting the link and unlk instructions (two instructions and six bytes) I couldn't find a way to turn off.

For GCC, -fomit-frame-pointer is supposed to turn the frame pointer off but some versions of GCC still sometimes generate LINK and UNLK instructions for the 68k target. GCC is bad about generating LINK and UNLK for functions with no variables too. The frame pointer should be turned off for performance benchmarks. The x86 target is severely degraded with only 8 GP registers but the 68k should show some code density improvement. I believe the x86_64 omits the frame pointer by default.

Bruce Hoult

unread,

Jun 1, 2018, 1:43:18 AM6/1/18

to

On Friday, June 1, 2018 at 4:56:10 PM UTC+12, matt...@gmail.com wrote:
> On Thursday, May 31, 2018 at 6:55:59 PM UTC-5, Bruce Hoult wrote:
> > On Friday, June 1, 2018 at 5:03:26 AM UTC+12, matt...@gmail.com wrote:
> > > On Thursday, May 31, 2018 at 4:53:25 AM UTC-5, Bruce Hoult wrote:
> > > > Here's current data (this year) on the total size of SPEC CPU2006 on various CPU architectures, with I believe everything compiled with the same current version of gcc with the same flags.
> > > >
> > > > http://hoult.org/spec_size.png
> > > >
> > > > Observations:
> > > >
> > > > -- in 32 bit, Thumb2 and rv32c are neck and neck; x86 and MIPS16e are about 25% bigger, fixed-length 32 bit instruction ARM and RISC-V are about 10% - 15% bigger than x86, and MIPS32 is .. massive. PowerPC is not shown but I think from other experience would be maybe midway between ARM and MIPS.
> > >
> > > Vince Weaver's total executable size results put RV32IMC and x86 in a virtual tie. Thumb2 tries to compete with the 68k but can't.
> > >
> > > 1) 68k
> > > 2) Thumb2 +6%
> > > 3) Thumb1 +8%
> > > 4) RISCV32IMC +13%
> > > 5) x86 +13%
> > > 6) SH-3 +16%
> > > 7) RISCV64IMC +19%
> > > 8) x86_64 +20%
> > > 9) AArch64 +28%
> > > 10) ARM EABI +36%
> > > 11) PowerPC +36%
> > > 12) SPARC +43%
> > > 13) MIPS +54%
> >
> > While I find Vince's page fun and have myself contributed improvements to (IIRC) at least ARM, Thumb{2}, Aarch64, and RISC-V, it should be remembered that this is a very small program, is untypical (has an unusually high amount of byte manipulation), and people are using every hand-coded assembler trick in the book. Code produced by compilers (especially the same version of the same compiler) is much more representative of how people actually use and experience each ISA.
>
> Code produced by compilers is representative of how good the compiler support is and varies greatly by options used (different code models, optimization options, not omitting the frame pointer, etc.).

Compiler quality is something that 99% of users simply have to live with. As a compiler engineer I believe that the user should be able to write simple -O and get not-stupid code, and -O2 should produce code that runs as fast as possible without bloating the program size. A user having to explicitly add -funroll-loops or -fomit-frame-pointer or alignment directives instead of just using -O2 or -O3 is rubbish design. At *most* have a flag to specify which micro-architecture to tune for (defaulting to the one you're running on).

>Anton's old results before GCC went downhill are closer to Vince Weavers's results than your results. GCC 2.95.3 was probably the best version for 68k integer code quality.

I remember those days. In 2002 and 2003 the place I was working at was clinging grimly to gcc 2.95 because the 3.0.x releases were absolute rubbish. I think we finally relented when 3.3 came out.

That was 15 years and about SIXTY versions ago.

Maybe time to let go? The current versions are actually quite good. If no one cares about current release code quality on m68k to improve it then .. I guess that means no one cares.

> > > It's convenient how the 68k is excluded from most comparisons even in a thread about the 68k.
> >
> > A bit unfair when in a previous message I said "m68000 can do it in two instructions and 12 bytes (one more than i386)".
>
> You didn't show code like the others architectures but it is trivial.
>
> > https://groups.google.com/d/msg/comp.arch/wzZW4Jo5tbM/W_1OalsXCwAJ
> >
> > I even gave the 68k the benefit of the doubt by not counting the link and unlk instructions (two instructions and six bytes) I couldn't find a way to turn off.
>
> For GCC, -fomit-frame-pointer is supposed to turn the frame pointer off but some versions of GCC still sometimes generate LINK and UNLK instructions for the 68k target. GCC is bad about generating LINK and UNLK for functions with no variables too. The frame pointer should be turned off for performance benchmarks. The x86 target is severely degraded with only 8 GP registers but the 68k should show some code density improvement. I believe the x86_64 omits the frame pointer by default.

I tried -fomit-frame-pointer, of course. That used to be trained into my fingertips, but I haven't had to do that for many years!

I guess some people consider that you can't debug properly without "proper" stack frames with FP chaining and you must follow the ABI, no exceptions. No problem .. if they want bigger and slower code on their platform that's up to them.

BGB

unread,

Jun 1, 2018, 3:40:57 AM6/1/18

to

Yeah.

I remember back when I was trying to size-tune my compiler's output
(comparing against sh4-gcc; and x86/x64 for reference). Mostly I was
testing using Quake and a few random video codecs and similar.

At first I thought I had it sh4-gcc beat, but then noted that the C
library was still being built with -O3, so a lot bigger than ideal. Then
had everything building with -Os, thought I had it beat again, but
discovered that GCC/ld was leaving a lot of padding and symbols still in
the binary, so used "strip", ...

( Using "strip" made it more fair as my compiler wasn't putting piles of
symbols in the binaries. ).

On the x86 side, I found that with size-optimizations enabled, VS2008
would produce the smallest binaries, VS2015 rather large binaries, and
i386-gcc would produce intermediate sizes (roughly comparable to the
Thumb2 sizes).

IIRC, in my tests I had failed to match up with the VS2008 binaries, but
was able to roughly catch-up with stripped "-Os" Thumb2 binaries (which
were generally smaller than what I was getting from sh4-gcc).

Some of this involved using tricks like branching back to prior matching
function epilogs, which had a minor speed cost but did shave off some of
the code footprint.

But, somehow it did pretty ok despite being kind of naive/stupid
sometimes...

>> Anton's old results before GCC went downhill are closer to Vince Weavers's results than your results. GCC 2.95.3 was probably the best version for 68k integer code quality.
>
> I remember those days. In 2002 and 2003 the place I was working at was clinging grimly to gcc 2.95 because the 3.0.x releases were absolute rubbish. I think we finally relented when 3.3 came out.
>
> That was 15 years and about SIXTY versions ago.
>
> Maybe time to let go? The current versions are actually quite good. If no one cares about current release code quality on m68k to improve it then .. I guess that means no one cares.
>

I have reservations about m68k as it seems like it would be pretty steep
to do an efficient FPGA implementation of it.

Simpler RISC style ISA's seem like they will have less cost for
implementing them in an FPGA. similarly if one avoids (wherever
possible) supporting features which require internal state machines or
would imply the use of microcode.

Though, even with seemingly relatively simple CPU designs one may still
be using roughly 5 kLUT or so... (And a bit more if the design gets more
advanced).

Though, granted, BSR1 is now my first CPU core to actually get all the
way through a start-up program in simulation (though bugs and holes
likely remain, at least this means most of the basic parts of the ISA
are working).

Similarly, a more minimalist design could be possible if there was a
need for lower LUT cost. But, then again I am realizing Vivado doesn't
support Spartan 6 devices, thus apparently would need to also install
ISE to use these.

>
>>>> It's convenient how the 68k is excluded from most comparisons even in a thread about the 68k.
>>>
>>> A bit unfair when in a previous message I said "m68000 can do it in two instructions and 12 bytes (one more than i386)".
>>
>> You didn't show code like the others architectures but it is trivial.
>>
>>> https://groups.google.com/d/msg/comp.arch/wzZW4Jo5tbM/W_1OalsXCwAJ
>>>
>>> I even gave the 68k the benefit of the doubt by not counting the link and unlk instructions (two instructions and six bytes) I couldn't find a way to turn off.
>>
>> For GCC, -fomit-frame-pointer is supposed to turn the frame pointer off but some versions of GCC still sometimes generate LINK and UNLK instructions for the 68k target. GCC is bad about generating LINK and UNLK for functions with no variables too. The frame pointer should be turned off for performance benchmarks. The x86 target is severely degraded with only 8 GP registers but the 68k should show some code density improvement. I believe the x86_64 omits the frame pointer by default.
>
> I tried -fomit-frame-pointer, of course. That used to be trained into my fingertips, but I haven't had to do that for many years!
>
> I guess some people consider that you can't debug properly without "proper" stack frames with FP chaining and you must follow the ABI, no exceptions. No problem .. if they want bigger and slower code on their platform that's up to them.
>

Frame pointers seem like they are mostly rendered moot if one has
debug-info which knows about stack frame layout and similar (which is
sort of needed to inspect variables). But, I guess proper debug info
would also need to know about register allocation, ...

Bruce Hoult

unread,

Jun 1, 2018, 6:48:32 AM6/1/18

to

On Friday, June 1, 2018 at 7:40:57 PM UTC+12, BGB wrote:
> > I guess some people consider that you can't debug properly without "proper" stack frames with FP chaining and you must follow the ABI, no exceptions. No problem .. if they want bigger and slower code on their platform that's up to them.
>
> Frame pointers seem like they are mostly rendered moot if one has
> debug-info which knows about stack frame layout and similar (which is
> sort of needed to inspect variables). But, I guess proper debug info
> would also need to know about register allocation, ...

Exactly.

Thinking back into the mists of time, frame pointers seem to be associated with CPUs/ABIs where all arguments were passed on the stack -- and pre-ANSI C where everything was effectively varargs and functions were supposed to work even if more arguments had been passed than expected. You also had things constantly being pushed onto the stack or popped off (one variable at a time) and so the offset from the stack pointer to any given variable is different at different times. Not too hard for the compiler to keep track of, but makes the code harder to read for a human.

Anton Ertl

unread,

Jun 1, 2018, 10:10:32 AM6/1/18

to

Bruce Hoult <bruce...@gmail.com> writes:
>On Friday, June 1, 2018 at 4:56:10 PM UTC+12, matt...@gmail.com wrote:

>>Anton's old results before GCC went downhill are closer to Vince Weavers's=
> results than your results. GCC 2.95.3 was probably the best version for 68=
>k integer code quality.
>
>I remember those days. In 2002 and 2003 the place I was working at was clin=
>ging grimly to gcc 2.95 because the 3.0.x releases were absolute rubbish. I=

> think we finally relented when 3.3 came out.
>
>That was 15 years and about SIXTY versions ago.
>
>Maybe time to let go? The current versions are actually quite good.

Let's see:

Here's the disassembly of the primitive + of gforth-fast (from Gforth
0.7.0) on IA-32 (unfortunately, no AMD64 version of gcc-2.95):

gcc-2.95 gcc-7.2.0
mov eax , dword ptr 4 [esi] add dword ptr 114 [esp] , # 4
add esi , # 4 add ebp , # 4
add ecx , eax mov edi , dword ptr 114 [esp]
add ebx , # 4 mov esi , dword ptr [edi]
mov eax , dword ptr -4 [ebx] add dword ptr 28 [esp] , esi
jmp eax \ $FF $E0 mov ecx , dword ptr -4 [ebp]
jmp ecx
mov ecx , dword ptr -4 [ebp]
jmp ecx
jmp ecx

Gcc-7.2 does not manage to produce a good register allocation by
itself (the same holds for gcc-2.95), but unfortunately, the explicit
register allocations that we have in Gforth-0.7 don't work. I tried
to transplant the explicit register allocation from the latest
version, but no success, either.

The other new thing in the new gcc is that it duplicates the code
after labels, leading to the repetitions of the last few instructions.
This disables the "dynamic superinstruction" optimization of Gforth,
which would give a speedup by a factor of about 2.

I have seen the same effect on RV64G, but not for AMD64; for RV64G,
the only way to disable it I have found is to use -O1 (instead of the
usual -O2). I played around with the flags that are different between
-O2 and -O1 (as output by one of gcc's debugging options), but even if
I put in flags such that the flags shown by the debugging output are
the same, the difference between -O1 and -O2 remains. Unfortunately,
-O1 produces a different issue on AMD64, so a general setting of -O1
is not the solution, either.

Are the current versions quite good? Not really.

- anton

If no o=
>ne cares about current release code quality on m68k to improve it then .. I=

> guess that means no one cares.
>
>

>> > > It's convenient how the 68k is excluded from most comparisons even in=

> a thread about the 68k.

>> >=20
>> > A bit unfair when in a previous message I said "m68000 can do it in two=

> instructions and 12 bytes (one more than i386)".

>>=20

>> You didn't show code like the others architectures but it is trivial.

>>=20
>> > https://groups.google.com/d/msg/comp.arch/wzZW4Jo5tbM/W_1OalsXCwAJ
>> >=20
>> > I even gave the 68k the benefit of the doubt by not counting the link a=
>nd unlk instructions (two instructions and six bytes) I couldn't find a way=
> to turn off.
>>=20
>> For GCC, -fomit-frame-pointer is supposed to turn the frame pointer off b=
>ut some versions of GCC still sometimes generate LINK and UNLK instructions=
> for the 68k target. GCC is bad about generating LINK and UNLK for function=
>s with no variables too. The frame pointer should be turned off for perform=
>ance benchmarks. The x86 target is severely degraded with only 8 GP registe=
>rs but the 68k should show some code density improvement. I believe the x86=

>_64 omits the frame pointer by default.
>

>I tried -fomit-frame-pointer, of course. That used to be trained into my fi=

>ngertips, but I haven't had to do that for many years!
>

>I guess some people consider that you can't debug properly without "proper"=
> stack frames with FP chaining and you must follow the ABI, no exceptions. =
>No problem .. if they want bigger and slower code on their platform that's =
>up to them.

Stefan Monnier

unread,

Jun 1, 2018, 11:45:56 AM6/1/18

to

> So, where would you stop?
> 8bit integer - Yes
> 16bit integer - on 16-bit CPU obviously yes, what about wider CPUs?
> 32bit integer - Yes
> 64bit integer - The same 'Yes" as as 8b and 32b or much more limited, as only GPR load?
> 32bit Floating point ?
> 64bit Floating point ?
> 128bit SIMD ?
> 256bit SIMD ?
> 512bit SIMD ?

Probably depends on expected frequency of occurrence and expected
cost of the implementation alternatives.
64b immediates are sufficiently common and at the same time sufficiently
small (compared to a typical instruction fetch buffer) that they're
likely a good idea. I don't have enough experience with SIMD code to
judge whether it's worth it for them.

Stefan

matt...@gmail.com

unread,

Jun 1, 2018, 12:45:41 PM6/1/18

to

On Friday, June 1, 2018 at 12:43:18 AM UTC-5, Bruce Hoult wrote:
> Compiler quality is something that 99% of users simply have to live with. As a compiler engineer I believe that the user should be able to write simple -O and get not-stupid code, and -O2 should produce code that runs as fast as possible without bloating the program size. A user having to explicitly add -funroll-loops or -fomit-frame-pointer or alignment directives instead of just using -O2 or -O3 is rubbish design. At *most* have a flag to specify which micro-architecture to tune for (defaulting to the one you're running on).

Users may have to live with whatever inferior code and bloat developers and compilers produce but that does not mean the result is a good comparison of ISAs. Only the best code produced should be compared and that includes whatever options are necessary including disregarding ABIs, tweaking compiler options and even using old compilers which generated better code quality. Specifying a micro-architecture with x86/x86_64 will give CPU alignment and instruction scheduling for a particular CPU but for the 68k it will do absolutely nothing. That is why I said, "Code produced by compilers is representative of how good the compiler support is and varies greatly by options used (different code models, optimization options, not omitting the frame pointer, etc.)." Vince Weaver's code density comparison is too small but it does measure ISA code density better than your compiled results.

> >Anton's old results before GCC went downhill are closer to Vince Weavers's results than your results. GCC 2.95.3 was probably the best version for 68k integer code quality.
>
> I remember those days. In 2002 and 2003 the place I was working at was clinging grimly to gcc 2.95 because the 3.0.x releases were absolute rubbish. I think we finally relented when 3.3 came out.
>
> That was 15 years and about SIXTY versions ago.
>
> Maybe time to let go? The current versions are actually quite good. If no one cares about current release code quality on m68k to improve it then .. I guess that means no one cares.

A compiler is a tool. Many old tools are better quality than new ones (old tools at auctions often sell for more than new made in China junk). Should I throw away my old tools and buy inferior quality tools?

The new GCC wastes more of my time compiling, uses more memory and generates inferior code quality for my target. Maybe the right combination of fancy compiler options can make it come close to the code quality of the old GCC but I expect they won't be turned on when specifying a 68k micro-architecture to tune for. Maybe the new versions of GCC are good for high end x86_64 systems and I can't deny that GCCisms are the new C standard but I'm not a fan.

> I tried -fomit-frame-pointer, of course. That used to be trained into my fingertips, but I haven't had to do that for many years!
>
> I guess some people consider that you can't debug properly without "proper" stack frames with FP chaining and you must follow the ABI, no exceptions. No problem .. if they want bigger and slower code on their platform that's up to them.

The vbcc compiler does *not* use stack frames by default for the 68k and it usually does try to follow ABI standards. I prefer to turn off stack frames when debugging on the 68k as it greatly reduces the amount of bloat I have to look at. My debuggers work fine without stack frames. There may be some times when they are useful for debugging though.

matt...@gmail.com

unread,

Jun 1, 2018, 1:24:13 PM6/1/18

to

On Friday, June 1, 2018 at 2:40:57 AM UTC-5, BGB wrote:
> I have reservations about m68k as it seems like it would be pretty steep
> to do an efficient FPGA implementation of it.
>
>
> Simpler RISC style ISA's seem like they will have less cost for
> implementing them in an FPGA. similarly if one avoids (wherever
> possible) supporting features which require internal state machines or
> would imply the use of microcode.
>
> Though, even with seemingly relatively simple CPU designs one may still
> be using roughly 5 kLUT or so... (And a bit more if the design gets more
> advanced).

FPGA devices are cheap now. The FleaFPGA Ohm board (RPi form factor) with Lattice ECP5 FPGA was just $45 at an Indiegogo. Here are some old videos of it including in operation.

https://www.youtube.com/watch?v=mPIfhLXsYkQ
https://www.youtube.com/watch?v=6PU7TAN40Jk

The 68020 ISA with Amiga ECS custom chips (including gfx and 4 voice stereo sound) used the following resources in the ECP5 FPGA.

Resource usage:
------------------
LUT4 count: 13776 out of 24288 (56.72%)
Block RAM: 21 out of 56 2KByte-blocks (37.5%)
Slices count: 9465 out of 12144 (77.94%)
PLL count: 2 out of 2 (100%)

There is room for Amiga AGA, RTG, etc. which are already in more expensive FPGA boards. The CPU performance is 68030 level (quite usable with an efficient OS) even though the CPU core is minimal. There are a lot of capabilities for the resources used. It is mostly open source and has docs for programming the FPGA.

http://www.fleasystems.com/fleaFPGA_Ohm.html

BGB

unread,

Jun 1, 2018, 1:47:33 PM6/1/18

to

Yep.

Varargs get fun with register ABIs, as the va_list structure needs to
store all the registers which may potentially be used for arguments, in
addition to the initial stack pointer.

Similarly, mostly using fixed-size stack frames, and allocating
variables mostly in registers (given memory accesses are slow and thus
best avoided whenever possible).

PUSH/POP sequences are used to save/restore registers, ...

TBD: Maybe do a proper debugger for some of my stuff (vs just dumping a
disassembly of whatever code was recently executed (in my emulators) and
the current contents of the various registers).

But, for example, my compiler ends up with a funky/unresolved limitation
that currently a given va_list may only be used once in a function (via
a single va_start/va_end pair), because essentially the 'va_start' logic
ends up lifted into the prolog. Making this work would mean likely
detecting this case, and then either internally renaming the va_list, or
do a hidden "true" 'va_start' and transform the others into 'va_copy'
operations from this first one.

I considered this limitation "acceptable", but this had come up as a
point of argument before (the whole "there may be code /somewhere/ which
depends on this" argument).

OTOH, my compiler isn't meant to replace GCC or similar (to target
everything or accept any code it may encounter), but rather exists
because hacking on something like GCC to implement small
experimental/one-off targets is a slow/painful experience (*1). I am not
inclined to even try LLVM, its build times are far too long for me to
really consider it as a plausible option for this.

As-is, my compiler is basically a small but monolithic tool, and as-is
doesn't even use object files (anymore, *2), but can sort of mimic the
".o"/".a" use-case by storing a blob of intermediate bytecode.
Essentially a stack-oriented bytecode along vaguely similar lines to
MSIL / CIL. Things like ASM code are preprocessed but then passed along
as big text blobs in this bytecode format.

Generally, this is managed by a "middle section" which as-is basically
takes ASTs as input, and produces Three-Address-Code as its output
format (which is currently what the backend receives).

*1: GCC both takes a while to build, and typically all the 'stuff' for a
given target is spread out all over the place (so, adding a new target
would involve touching code and build files all over the toolchain). GCC
seems better suited for "stable" targets which are expected to exist for
a while.

It is easier, say, when the backend can implement something analogous to
a COM interface, and pretty much everything else in the pipeline
configured via this interface, with archs identified via FOURCC pairs,
... Ex: It will ask each backend "Hey, do you understand this FOURCC
pair?", and if yes, the frontend will be configured via this backend.

TODO/TBD: add a query interface and allow frontends and backends to be
implemented via DLL plugins or similar, probably with an interface
analogous to 'DriverProc'. Less certain, as this would add some
complexity as well (vs building everything as a monolithic binary).

*2: Earlier versions/backends generally used separated stages and COFF
objects, but in more recent backends it was easier (less code) to just
use a conjoined codegen/assembler/linker process (if relevant, ASM and
COFF can be treated simply inputs or potentially as outputs; though my
current backends are limited mostly to PE/COFF and flat-ROM outputs and
similar).

This thing has basically been beating around in my projects in various
forms for about a decade now (originally it was a fork off of an earlier
script-language interpreter of mine to attempt to make it parse C and
use C as a script-lang; but originally this didn't really go well, and
it spent years mostly being used to parse metadata from headers, then
later reused again as an actual compiler, ...).

Possibly a crap design, but works for my uses.

Terje Mathisen

unread,

Jun 1, 2018, 1:57:31 PM6/1/18

to

I would probably limit SIMD to load ops only, and only for single-use
constants.

I have some experience with using SIMD for serious code (audio codecs)
and there it actually made sense to effectively have a constant pool
since I would be using the same constants in several functions.

(One example is an array of alternating unset/set sign bits which I
would use with XOR to negate odd operands.)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"