Tradeoffs in ISA design

Kyle Hayes

unread,

Nov 8, 2019, 6:12:26 PM11/8/19

to

This came out long...

In the discussions around Mitch's My 66000 a couple of interesting points have come up. At the same time there is an interesting discussion about RISC-V going on on Realworldtech.com.

One of the points made recently on Realworldtech.com that made me think a bit is that we now have so many transistors at our disposal that what used to be the bottleneck in decades past (e.g. ALU) isn't any more, and thus the tradeoffs of a decade or two ago may need to be revisited.

Both of these bring up a number of tradeoffs that were made in ISA design over the years such as the following three things.

1) single instruction length vs. multiple lengths.

2) immediates in instructions vs. PC-relative loads

3) few addressing modes vs. more addressing modes (and a similar axis on addressing mode complexity).

Taking these issues one at a time:

The pendulum of instruction length has swung from very granular such as x86, Transputer etc. toward the extreme of single length instructions and is now heading back toward more lengths.

Single length instructions have some pros and cons:

Pro:

* if you align your instructions right, then you will not get instructions split across a page boundary.

Con:

* lower overall code density.
* more difficult to make ISA additions using more registers or additional fields. At some point you run out of bits.

RISC-V and My 66000 allow for several lengths of instruction. Perhaps not entirely for the same reasons. ARM uses smaller instructions for code density in order to maximize the use of cache at a cost in dynamic instruction counts. My 66000 goes for larger instructions to reduce certain types of common code, i.e. constant pool handling. RISC-V seems to do both with some of the 48-bit and 64-bit instruction forms as well as the compressed 16-bit form.

Compressing an ISA gives you fewer bits for everything. Many compressed forms drop 3-register operations and go with destructive 2-register ops instead to save a register field. Compressing an ISA removes a lot of your immediate bits. That means that your constant range drops and thus solutions like PC-relative loads start looking a lot nicer. Which means that you really may not want to skimp on your addressing modes. You will likely want to have many kinds of scaling or other manipulation of your immediate fields just to make them more useful.

So the instruction length, in the case of compression, seems to have an impact on the decision whether to have large immediates vs. PC-relative loads depending on how an ISA supports multiple lengths. The decision is not exclusive.

Having support for full-size constants in an ISA seems like it would have the following impacts:

* fewer non-sequential fetches since the immediates are inline.
* possible sooner completion of immediate availability because of the PC-dependent load that would need to happen if you use PC-relative loads.
* a small negative impact on decoding if you have multiple supported immediate widths. I think this would be fairly minor though.

It is not clear to me that having wide immediate support has an direct impact on the optimum number of addressing modes. As mentioned above, having almost no immediate fields does seem like it would change the balance on more complicated addressing modes.

The increasing number of transistors allows more addressing mode complexity to be supported without much impact to base clock rates. But the more complicated your addressing modes are, the more bits you need to encode them.

Best,
Kyle

Brett

unread,

Nov 8, 2019, 10:20:21 PM11/8/19

to

Kyle Hayes <kyle....@gmail.com> wrote:
> This came out long...
>
> In the discussions around Mitch's My 66000 a couple of interesting points
> have come up. At the same time there is an interesting discussion about
> RISC-V going on on Realworldtech.com.

The RISC-V discussion on RWT is silly.
There is only room for one high end RISC, ARM64 is it.
RISC-V is free low end, free low end has different constraints, which
RISC-V matches nearly perfectly.

Of course in another decade free low end will look like ARM64, which RISC-V
is not.
The size difference between a super low end ARM64 and RISC-V is maybe 20%,
a rounding error when you look at the size of the cache to make such a core
useful. And ARM64 gives more performance to pay that cost.

I think this is the fundamental complaint over on RWT.

Note that I am NOT saying that a high end ARM64 will be free, that is a
different market.
ARM will eventually give away low end ARM64 cores to fight RISC-V, and will
design a core for this market.
This will be sold to investors as feeding the market for upgrades, while
the real reason is keeping RISC-V out of the high end for longer.

> One of the points made recently on Realworldtech.com that made me think a
> bit is that we now have so many transistors at our disposal that what
> used to be the bottleneck in decades past (e.g. ALU) isn't any more, and
> thus the tradeoffs of a decade or two ago may need to be revisited.
>
> Both of these bring up a number of tradeoffs that were made in ISA design
> over the years such as the following three things.
>
> 1) single instruction length vs. multiple lengths.
>
> 2) immediates in instructions vs. PC-relative loads

PC relative loads are evil.
Means you have to track the PC pointer in the pipeline.
Pollutes the L1 data cache with L1 instruction data, wasting capacity.
It’s an extra load, when you may be load limited.
The load may delay the pipeline.
The load adds to your load queue which may be your limit.
The load is more crap to track which reduces the effective size of your OoO
window.
Maximum stupid.

Even an ordinary load from a generic pointer is better than PC loads.

Ivan Godard

unread,

Nov 9, 2019, 1:44:12 AM11/9/19

to

On 11/8/2019 7:20 PM, Brett wrote:
> Kyle Hayes <kyle....@gmail.com> wrote:
>> This came out long...
>>
>> In the discussions around Mitch's My 66000 a couple of interesting points
>> have come up. At the same time there is an interesting discussion about
>> RISC-V going on on Realworldtech.com.
>
> The RISC-V discussion on RWT is silly.
> There is only room for one high end RISC, ARM64 is it.
> RISC-V is free low end, free low end has different constraints, which
> RISC-V matches nearly perfectly.
>
> Of course in another decade free low end will look like ARM64, which RISC-V
> is not.
> The size difference between a super low end ARM64 and RISC-V is maybe 20%,
> a rounding error when you look at the size of the cache to make such a core
> useful. And ARM64 gives more performance to pay that cost.

Not so fast.

Yes, size as such doesn't matter: what matters is yield at the fab.
Those caches are huge, but they get built with spares in the arrays so
that a defective cell can be replaced by a spare in the fab.

That works for regularly-structured things like memories and caches. But
cores are random logic, with no regular structure suitable for sparing
except the whole core (which is why those six-core chips are bin-selects
from eight-core layouts). So it's the core that has the bulk of the
impact on yield, and a 20% difference in size, leading to a 20%
difference in yield, is real bucks.

<snip>

>> Both of these bring up a number of tradeoffs that were made in ISA design
>> over the years such as the following three things.
>>
>> 1) single instruction length vs. multiple lengths.
>>
>> 2) immediates in instructions vs. PC-relative loads
>
> PC relative loads are evil.

Amen!

Bruce Hoult

unread,

Nov 9, 2019, 2:11:50 AM11/9/19

to

On Friday, November 8, 2019 at 3:12:26 PM UTC-8, Kyle Hayes wrote:
> This came out long...
>
> In the discussions around Mitch's My 66000 a couple of interesting points have come up. At the same time there is an interesting discussion about RISC-V going on on Realworldtech.com.

omg a lot of rubbish there. And a discussion in which Linus Torvalds is the voice of reason???? I actually agree with most of what he says there.

Not however his contention that things such as base+index addressing is completely antiethical to the purity of RISC-V philosophy. That's just rubbish. The ISA manual commentary explicitly says that loads with base+(maybe scaled)index addressing (a 2R1W instruction) and stores with effective address writeback like PowerPC and ARM (also 2R1W) were actively considered but didn't make the cut for the base ISA. It's entirely possible or even probable they'll make an appearance in future. I've spoken to the (according to that discussion) BDFLs about this and they're fine with that if quantitative data says they are useful.

> One of the points made recently on Realworldtech.com that made me think a bit is that we now have so many transistors at our disposal that what used to be the bottleneck in decades past (e.g. ALU) isn't any more, and thus the tradeoffs of a decade or two ago may need to be revisited.

It's true that the ALU probably isn't the critical path in cycle time at the moment. Certainly our core designers have told me that when I've asked about the possible impact of including some more complex operations in the BitManip extensions (for example generalized reverse and generalized OR-combine).

> Both of these bring up a number of tradeoffs that were made in ISA design over the years such as the following three things.
>
> 1) single instruction length vs. multiple lengths.
>
> 2) immediates in instructions vs. PC-relative loads
>
> 3) few addressing modes vs. more addressing modes (and a similar axis on addressing mode complexity).
>
> Taking these issues one at a time:
>
> The pendulum of instruction length has swung from very granular such as x86, Transputer etc. toward the extreme of single length instructions and is now heading back toward more lengths.
>
> Single length instructions have some pros and cons:
>
> Pro:
>
> * if you align your instructions right, then you will not get instructions split across a page boundary.

It's impossible to *not* get that property with a fixed instruction length. Well -- unless it's not a power of two (assuming your pages are).

> Con:
>
> * lower overall code density.

That's the mantra, but that assumes the variable length instruction designers do a good job. Aarch64 code is fixed length 32 bit instructions and yet is usually almost exactly the same density as x86_64 code.

All the most dense instruction sets I know of are either 16 bit fixed opcode or else in small multiples of 16 bits. typically just 16 and 32, possibly 16, 32, and 48.

> * more difficult to make ISA additions using more registers or additional fields. At some point you run out of bits.

This is true.

> RISC-V and My 66000 allow for several lengths of instruction. Perhaps not entirely for the same reasons. ARM uses smaller instructions for code density in order to maximize the use of cache at a cost in dynamic instruction counts.

Thumb1 cost dynamic instruction count, but it is fixed length 16 bit. Thumb2 which is 16 and 32 bit doesn't cost any appreciable amount of dynamic instruction count. The only exception I know of is comparing it to original ARM, where if the ARM code uses predication then Thumb2 needs an extra IT* instruction to cast a predication shadow over up to 4 following instructions. Most other ISAs don't have predication at all, so this doesn't arise in comparisons with them.

>My 66000 goes for larger instructions to reduce certain types of common code, i.e. constant pool handling. RISC-V seems to do both with some of the 48-bit and 64-bit instruction forms as well as the compressed 16-bit form.

RISC-V has provision for future 48-bit, 64-bit and longer instructions (up to 200 bits or so -- much longer than x86 allows) but no one has implemented any as yet.

There were suggestions to use some 48 bit or 64 bit instructions in the Vector extension but this was quite forcefully vetoed by representatives from many organisations who insisted that the base vector ISA be fully functional with 32 bit instructions only.

Nevertheless, the assembly language syntax for the Vector instructions has been kept more general than can be used by the initial extension, with current instructions considered as a "compressed" subset of later 48 or 64 bit instructions.

For example later longer instructions are anticipated to have:
- vector shape optionally explicitly included in each instruction instead of using values in a CSR set up by a preceding VSETVL instruction. (or use the CSR, in the same manner as FP rounding modes)
- ability to not only use masking from v0 or not, but to negate the mask or to use another register as the source of the mask.
- ability to make the destination of FMA instructions not be the same as one of the source operands. Three sources plus a destination simply didn't fit in 32 bits along with everything else needed.

> Compressing an ISA gives you fewer bits for everything. Many compressed forms drop 3-register operations and go with destructive 2-register ops instead to save a register field. Compressing an ISA removes a lot of your immediate bits. That means that your constant range drops and thus solutions like PC-relative loads start looking a lot nicer. Which means that you really may not want to skimp on your addressing modes. You will likely want to have many kinds of scaling or other manipulation of your immediate fields just to make them more useful.

Compressed instructions also don't have room to specify different addressing modes!

> Having support for full-size constants in an ISA seems like it would have the following impacts:
>
> * fewer non-sequential fetches since the immediates are inline.
> * possible sooner completion of immediate availability because of the PC-dependent load that would need to happen if you use PC-relative loads.
> * a small negative impact on decoding if you have multiple supported immediate widths. I think this would be fairly minor though.

All true, except:
- putting together literals using multiple instructions also uses sequential fetches.
- if you use a load from a constant pool it's not necessarily PC-relative. Relative to the Zero register or Global Pointer register makes sense too -- or perhaps a from a general purpose register set up at the start of the function.

Note that IBM 360 which also had 12 bit offsets in loads automatically set up multiple general purpose registers as base registers for global data/constants.

> The increasing number of transistors allows more addressing mode complexity to be supported without much impact to base clock rates. But the more complicated your addressing modes are, the more bits you need to encode them.

Encoding bits and register ports are the biggies. Cycle time to do an add or shift or both is usually trivial, as is the transistor count. "Adders are free"

Bruce Hoult

unread,

Nov 9, 2019, 2:26:10 AM11/9/19

to

On Friday, November 8, 2019 at 7:20:21 PM UTC-8, Brett wrote:
> Kyle Hayes <kyle....@gmail.com> wrote:
> > This came out long...
> >
> > In the discussions around Mitch's My 66000 a couple of interesting points
> > have come up. At the same time there is an interesting discussion about
> > RISC-V going on on Realworldtech.com.
>
> The RISC-V discussion on RWT is silly.

Agreed.

> There is only room for one high end RISC, ARM64 is it.

Don't agree on that. Aarch64 has some brilliant bits, but it also has some decisions which I think will prove to be very short-sighted.

> RISC-V is free low end, free low end has different constraints, which RISC-V matches nearly perfectly.

RISC-V certainly makes a very good low end processor, *especially* a low end 64 bit processor which Aarch64 can never do as long as they keep their strict "no subsetting" rule.

As for high end .. we shall see. The U84 is only four years behind the A72. Last year's U74 was six years behind the A53. That looks like rapidly catching up to me.

> Of course in another decade free low end will look like ARM64, which RISC-V
> is not.
> The size difference between a super low end ARM64 and RISC-V is maybe 20%,
> a rounding error when you look at the size of the cache to make such a core
> useful. And ARM64 gives more performance to pay that cost.

It doesn't, actually :-) And if you've got a cache (and branch prediction) then you are *not* low end. There are a ton of applications that can use 64 bit addressing but don't need huge performance. And a lot of them want predictable execution times.

Anton Ertl

unread,

Nov 9, 2019, 4:39:29 AM11/9/19

to

Bruce Hoult <bruce...@gmail.com> writes:
>On Friday, November 8, 2019 at 7:20:21 PM UTC-8, Brett wrote:
>> There is only room for one high end RISC, ARM64 is it.

True, ISAs are often winner-takes-all. So for every market, there is
a dominating ISA. But RISC is no market, it's a design style.
Currently, we have Aarch64 dominating in mobile phones, and AMD64
dominating in laptops, desktops and servers. For stuff like NAS and
modems/routers, the network effects seem to be smaller, and there were
also MIPS and PowerPC there last I looked.

Anyway, Trump uses ARM as a stick to hit Huawei, ARM has not resisted
this in any visible way, and we will see how much that damages the
stick; I think that, in the long term it will be disastrous.
Potential customers will avoid ARM whenever they can (but also avoid
SiFive, instead preferring to use implementations without USA
technology).

>Don't agree on that. Aarch64 has some brilliant bits, but it also has some decisions which I think will prove to be very short-sighted.

Such as?

>RISC-V certainly makes a very good low end processor, *especially* a low end 64 bit processor which Aarch64 can never do as long as they keep their strict "no subsetting" rule.

Why would one want a 64-bit core that is lower-end than a Cortex-A34?
The silicon (and thus cost) of the A34 is tiny compared to the silicon
of 4GB of DRAM or even flash, so having something even smaller does
not save much money.

And the other competition is a 32-bit core (say, Cortex-M0) with some
extra stuff to allow accessing more memory.

>As for high end .. we shall see. The U84 is only four years behind the A72. Last year's U74 was six years behind the A53. That looks like rapidly catching up to me.

ARM has been trying to play catch-up with Intel/AMD in the server
market for quite a while. For now they have not succeeded.

> There are a ton of applications that can use 64 bit addressing but don't need huge performance.

Such as? And why would a 32-bit core with some address-extension
extra not be a good fit?

> And a lot of them want predictable execution times.

Actually the worst-case execution time of caches is predictable and
better than that of main memory if the cache is designed appropriately
(true LRU replacement is good). I expect that one can also predict
the worst-case with dynamic branch predictors, but I don't know if the
result is better than with static branch prediction.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Niklas Holsti

unread,

Nov 9, 2019, 5:00:05 AM11/9/19

to

On 2019-11-09 10:49, Anton Ertl wrote:

> Actually the worst-case execution time of caches is predictable and
> better than that of main memory if the cache is designed appropriately
> (true LRU replacement is good).

Yes, for instruction caches and sequential execution on a single core.
And LRU is indeed easiest.

Not so good for data caches, or united I+D caches, or multi-threading,
or multi-core execution.

> I expect that one can also predict
> the worst-case with dynamic branch predictors,

There is lots of work on WCET analysis of branch prediction schemes, but
I don't know enough about it to comment.

--
Niklas Holsti
niklas holsti tidorum fi
. @ .

Bruce Hoult

unread,

Nov 9, 2019, 9:47:12 AM11/9/19

to

On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:

> Bruce Hoult <bruce...@gmail.com> writes:
> >Don't agree on that. Aarch64 has some brilliant bits, but it also has some decisions which I think will prove to be very short-sighted.
>
> Such as?

Such as the fixed size 32 bit instructions. ARM invented the modern RISC with both 16 bit and 32 bit instructions, beating every other ISA style for code density on compiled C code. It's what propelled them to superstardom. Astounding that they abandoned it for 64 bit at the cost of 30% code density compared to their own 32 bit ISA -- and to our 64 bit one. They did a darn good job for a fixed-length ISA, basically matching x86_64 for code size, and I guess they thought that was good enough at the time.

At the same time, they're likely to run out of opcode space for extensions in the foreseeable future, as MIPS did.

There's basically no way to retrofit either 16 bit instructions or longer ones to the current Aarch64 instruction encoding. If they want to do it they'll either need to add a mode (ala Thumb1) or a completely new encoding. Maybe that's what ARMv9 will be.

> >RISC-V certainly makes a very good low end processor, *especially* a low end 64 bit processor which Aarch64 can never do as long as they keep their strict "no subsetting" rule.
>
> Why would one want a 64-bit core that is lower-end than a Cortex-A34?
> The silicon (and thus cost) of the A34 is tiny compared to the silicon
> of 4GB of DRAM or even flash, so having something even smaller does
> not save much money.

The only justification I *need* to give is that customers are lining up to write six figure cheques to license such cores. The is a *ton* of customer demand for cores with basically a Cortex M0 microarchitecture and code size but with 64 bit registers.

It's easy to do and people will pay for it and no one else has it. Boom.

> And the other competition is a 32-bit core (say, Cortex-M0) with some
> extra stuff to allow accessing more memory.

Extra stuff such as bank switching and address translation, sure. It's not a lot smaller than just making the damn registers 64 bits wide. Plus you can't just compile standard C code for it but have to muck about with that bank switching. It's awful. Surely the 8086-286 taught us that much?

Some people want 64 bit for the address space but it's not necessarily RAM. It might be sparse. Or you might have other bigger 64 bit applications processors in the system but don't want to bother them with taking interrupts or transferring data between peripherals and memory. But you want something a little more than a dumb state machine DMA.

You might not want just one such core. Maybe you want dozens of them. Size starts to count.

ARM won't sell you a 64 bit core without SIMD.
ARM won't sell you a 64 bit core without FP.
ARM won't sell you a 64 bit core without multiply&divide.
ARM won't sell you a 64 bit core with 16 registers.
ARM can't sell you a 64 bit core with comparable code density to their own 32 bit cores.
ARM won't let you add custom instructions -- historically. They're starting next year, with a single core. Guess why.

Interestingly the one thing no one has asked us to remove so far is the expander for 16 bit instructions. From memory it's about 300 LUTs or 800 gates. You make that back (at ~1 byte per instruction on average) in smaller SRAM or icache if your code is more than a few hundred instructions.

It's notable that where a particular 32 bit ARM core has 32 KB of icache, the comparable 64 bit core has 48 KB. That's expensive.

Some people who want a 64 bit core don't care about large memory addressing at all. They just know that a significant part of energy use goes into instruction fetch and decode and control. Doubling the datapath width can, depending on your workload, nearly halve the number of instructions you have to fetch and decode and execute.

> >As for high end .. we shall see. The U84 is only four years behind the A72. Last year's U74 was six years behind the A53. That looks like rapidly catching up to me.
>
> ARM has been trying to play catch-up with Intel/AMD in the server
> market for quite a while. For now they have not succeeded.

It's very different. That market has historically wanted to run legacy x86 applications.

To whatever extent ARM does succeed in making the server market ISA-agnostic, we benefit from that too.

> > There are a ton of applications that can use 64 bit addressing but don't need huge performance.
>
> Such as? And why would a 32-bit core with some address-extension
> extra not be a good fit?

See above.

> > And a lot of them want predictable execution times.
>
> Actually the worst-case execution time of caches is predictable and
> better than that of main memory if the cache is designed appropriately
> (true LRU replacement is good). I expect that one can also predict
> the worst-case with dynamic branch predictors, but I don't know if the
> result is better than with static branch prediction.

It's not caches vs dynamic RAM. It's caches vs about the same size of static RAM. In many applications a couple of KB is enough.

MitchAlsup

unread,

Nov 9, 2019, 12:14:39 PM11/9/19

to

On Friday, November 8, 2019 at 5:12:26 PM UTC-6, Kyle Hayes wrote:
> This came out long...
>
> In the discussions around Mitch's My 66000 a couple of interesting points have come up. At the same time there is an interesting discussion about RISC-V going on on Realworldtech.com.
>
> One of the points made recently on Realworldtech.com that made me think a bit is that we now have so many transistors at our disposal that what used to be the bottleneck in decades past (e.g. ALU) isn't any more, and thus the tradeoffs of a decade or two ago may need to be revisited.

ALU has not been a bottleneck since 1990. Feeding them has.

>
> Both of these bring up a number of tradeoffs that were made in ISA design over the years such as the following three things.
>
> 1) single instruction length vs. multiple lengths.

I vote FOR multiple instruction lengths with one enormous caveat: everything
but constants is available in the first decode parcel.

>
> 2) immediates in instructions vs. PC-relative loads

My 66000 has both. IP relative access is for library relative data not
for application addressable data. In My 66000 the compiler/linker/loader
do not use IP relative memory refs to access constants.

>
> 3) few addressing modes vs. more addressing modes (and a similar axis on addressing mode complexity).

My 66000 has 2::
MEM Rd,[Rb+DISP16]
MEM Rd,[Rb+Ri<<s+DISP]

When Rb=R0, IP is used
When Ri=R0, no indexing is performed.

>
> Taking these issues one at a time:
>
> The pendulum of instruction length has swung from very granular such as x86, Transputer etc. toward the extreme of single length instructions and is now heading back toward more lengths.
>
> Single length instructions have some pros and cons:
>
> Pro:
>
> * if you align your instructions right, then you will not get instructions split across a page boundary.
>
> Con:
>
> * lower overall code density.
> * more difficult to make ISA additions using more registers or additional fields. At some point you run out of bits.
>
> RISC-V and My 66000 allow for several lengths of instruction. Perhaps not entirely for the same reasons. ARM uses smaller instructions for code density in order to maximize the use of cache at a cost in dynamic instruction counts. My 66000 goes for larger instructions to reduce certain types of common code, i.e. constant pool handling. RISC-V seems to do both with some of the 48-bit and 64-bit instruction forms as well as the compressed 16-bit form.
>
> Compressing an ISA gives you fewer bits for everything. Many compressed forms drop 3-register operations and go with destructive 2-register ops instead to save a register field. Compressing an ISA removes a lot of your immediate bits. That means that your constant range drops and thus solutions like PC-relative loads start looking a lot nicer. Which means that you really may not want to skimp on your addressing modes. You will likely want to have many kinds of scaling or other manipulation of your immediate fields just to make them more useful.

In the 32-bit RISC implementations, pasting bits together was sufficient.
In the 64-bit world it is not--you simply can't afford to past together
a 64-bit operand by using 4 instructions with 16-bit immediates. Thus,
the switch to memory resident compile time constants, and the pollution
of the HARDER cache resource (D$).

>
> So the instruction length, in the case of compression, seems to have an impact on the decision whether to have large immediates vs. PC-relative loads depending on how an ISA supports multiple lengths. The decision is not exclusive.

My 66000 has both, but does not use IP relative memory refs to acquire
constants.

>
> Having support for full-size constants in an ISA seems like it would have the following impacts:
>
> * fewer non-sequential fetches since the immediates are inline.

Check

> * possible sooner completion of immediate availability because of the PC-dependent load that would need to happen if you use PC-relative loads.

Check

> * a small negative impact on decoding if you have multiple supported immediate widths. I think this would be fairly minor though.

Completely trivial, when the encoding is done well.

>
> It is not clear to me that having wide immediate support has an direct impact on the optimum number of addressing modes. As mentioned above, having almost no immediate fields does seem like it would change the balance on more complicated addressing modes.
>
> The increasing number of transistors allows more addressing mode complexity to be supported without much impact to base clock rates. But the more complicated your addressing modes are, the more bits you need to encode them.

Also note:

My 6600 has a Table Transfer instruction, where IP+(DISP16<<2) is used to
address memory for Execute permission, and the value fetched is either
the instruction to be executed or the address of te instruction to be
executed. In the former, the data is fed into PARSE and DECODE, in the
later, the data become IP. Both JUMPs and CALLs can go through such a table.

I used DISP16<<2; because there was already HW to do this (conditional
branches use DISP16), and because one can place the indirection table
anywhere within 128 KBytes of the address of the TT instruction.
>
> Best,
> Kyle

MitchAlsup

unread,

Nov 9, 2019, 12:19:34 PM11/9/19

to

Are we free to disagree?

> Means you have to track the PC pointer in the pipeline.

You have to do this anyway (exception recovery.)

> Pollutes the L1 data cache with L1 instruction data, wasting capacity.

the uses My 66000 performs do not have these properties.

> It’s an extra load, when you may be load limited.

IP-relative loads of constants does what you state. IP-relative loads of
real data (library resident data) is different, here IP-relative accesses
enable position independent library placement along with its data.

> The load may delay the pipeline.
> The load adds to your load queue which may be your limit.
> The load is more crap to track which reduces the effective size of your OoO
> window.
> Maximum stupid.

When used incorrectly. You DON'T have to use them incorrectly.

MitchAlsup

unread,

Nov 9, 2019, 12:22:19 PM11/9/19

to

Where "automatically" means instructions were used to put particular data
in particular registers so that the rest of the subroutine was made easier.

EricP

unread,

Nov 9, 2019, 1:40:22 PM11/9/19

to

Brett wrote:
>
> PC relative loads are evil.

You seem to be under the impression that PC relative addressing
is only used for loading program constants. It is not.
It can be used to access variable data as PC relative too.

> Means you have to track the PC pointer in the pipeline.

The PC is already flowing through the pipeline from Fetch
as you need it for branch relative & exceptions anyway.

No special tracking is required if it copies the PC in as an
operand when it queues the instruction at the front of the EX pipeline.
Which it knows to do because you used PC-rel addressing.

> Pollutes the L1 data cache with L1 instruction data, wasting capacity.

Sort of. On say Alpha the coding standard had the entry point for a
routine start on a cache boundary. Prior to the entry was the table
of constants loaded using what was effectively PC-rel addressing.

The constants cache line(s) should not wind up resident in both
I$L1 and D$L1, so you don't get a tug-of-war or ping-pong for ownership.

But yes, the constant line does have to be loaded into D$L1.

Note though that some of those constants could be 64-bits,
including linkage addresses, which on many RISC ISA's would have
taken 4 instructions to "load immediate" rather than 1 LD PC-rel.

Also I have not seen a RISC ISA with equivalent for float load immediate.
So that may be the only way to load float constants.

> It’s an extra load, when you may be load limited.
> The load may delay the pipeline.
> The load adds to your load queue which may be your limit.
> The load is more crap to track which reduces the effective size of your OoO
> window.
>

> Even an ordinary load from a generic pointer is better than PC loads.

An ISA needs PC-rel for position independent, self relative data.
On fixed length RISC, it may be better than 4 load integer immediates.
On fixed length RISC, its probably the only way to load float immediates.

Anton Ertl

unread,

Nov 9, 2019, 1:52:40 PM11/9/19

to

EricP <ThatWould...@thevillage.com> writes:
>The constants cache line(s) should not wind up resident in both
>I$L1 and D$L1, so you don't get a tug-of-war or ping-pong for ownership.

Both constants and instructions are read-only, so both caches can
contain them (and have them both in shared state if they do). IA-32
and AMD64 CPUs starting with the P6 and K7 have no problem with
read-only data and instruction accesses to the same cache line.
However, the P5 and K6 actually do have cache ping-ponging slowdowns
in that case, for whatever reason.

Most other architectures do not keep the I-cache and the D-cache
coherent in hardware, and therefore are unlikely to ping-pong even
when there is written data in the same cache line as executed code.

MitchAlsup

unread,

Nov 9, 2019, 2:00:12 PM11/9/19

to

On Saturday, November 9, 2019 at 12:40:22 PM UTC-6, EricP wrote:
> Brett wrote:
> >
> > PC relative loads are evil.
>
> You seem to be under the impression that PC relative addressing
> is only used for loading program constants. It is not.
> It can be used to access variable data as PC relative too.
>
> > Means you have to track the PC pointer in the pipeline.
>
> The PC is already flowing through the pipeline from Fetch
> as you need it for branch relative & exceptions anyway.
>
> No special tracking is required if it copies the PC in as an
> operand when it queues the instruction at the front of the EX pipeline.
> Which it knows to do because you used PC-rel addressing.
>
> > Pollutes the L1 data cache with L1 instruction data, wasting capacity.
>
> Sort of. On say Alpha the coding standard had the entry point for a
> routine start on a cache boundary. Prior to the entry was the table
> of constants loaded using what was effectively PC-rel addressing.
>
> The constants cache line(s) should not wind up resident in both
> I$L1 and D$L1, so you don't get a tug-of-war or ping-pong for ownership.
>
> But yes, the constant line does have to be loaded into D$L1.
>
> Note though that some of those constants could be 64-bits,
> including linkage addresses, which on many RISC ISA's would have
> taken 4 instructions to "load immediate" rather than 1 LD PC-rel.
>
> Also I have not seen a RISC ISA with equivalent for float load immediate.
> So that may be the only way to load float constants.

My 66000 has floating point immediates::

FMAC R7,R9,3.141592653589263,R13

Eliminating that trouble.

>
> > It’s an extra load, when you may be load limited.
> > The load may delay the pipeline.
> > The load adds to your load queue which may be your limit.
> > The load is more crap to track which reduces the effective size of your OoO
> > window.
> >
> > Even an ordinary load from a generic pointer is better than PC loads.
>
> An ISA needs PC-rel for position independent, self relative data.
> On fixed length RISC, it may be better than 4 load integer immediates.
> On fixed length RISC, its probably the only way to load float immediates.

And these last 2 points are why it is better to simply provide constants
to instructions using variable length encodings.

MitchAlsup

unread,

Nov 9, 2019, 2:04:02 PM11/9/19

to

On Saturday, November 9, 2019 at 12:52:40 PM UTC-6, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >The constants cache line(s) should not wind up resident in both
> >I$L1 and D$L1, so you don't get a tug-of-war or ping-pong for ownership.
>
> Both constants and instructions are read-only, so both caches can
> contain them (and have them both in shared state if they do). IA-32
> and AMD64 CPUs starting with the P6 and K7 have no problem with
> read-only data and instruction accesses to the same cache line.
> However, the P5 and K6 actually do have cache ping-ponging slowdowns
> in that case, for whatever reason.

Putting constants in the I-stream solves both issues.

>
> Most other architectures do not keep the I-cache and the D-cache
> coherent in hardware, and therefore are unlikely to ping-pong even
> when there is written data in the same cache line as executed code.

I think what you want to say is::

Most other architectures do not keep the I-cache and D-cache
SUFFICIENTLY coherent in hardware,.....

The word sufficiently enables nearly infinite lookahead the FETCH/DECODE processes are allowed.

But, sooner or later, the I-cache will catch up to the D-cache snoops
and invalidates.

Ivan Godard

unread,

Nov 9, 2019, 2:26:01 PM11/9/19

to

Not if you use result-replay instead of issue-replay.

>> Pollutes the L1 data cache with L1 instruction data, wasting capacity.
>
> the uses My 66000 performs do not have these properties.
>
>> It’s an extra load, when you may be load limited.
>
> IP-relative loads of constants does what you state. IP-relative loads of
> real data (library resident data) is different, here IP-relative accesses
> enable position independent library placement along with its data.

Or you can just have another base specReg. Trying to jam that extra
functionality onto the PC is a false economy; code gets simpler and
safer with specRegs for constant pool, frame locals, static writable
globals, code (*not* PC), and TLS. BTW, these are not genRegs and should
not be writable as registers.

BGB

unread,

Nov 9, 2019, 2:38:44 PM11/9/19

to

On 11/9/2019 1:11 AM, Bruce Hoult wrote:
> On Friday, November 8, 2019 at 3:12:26 PM UTC-8, Kyle Hayes wrote:
>> This came out long...
>>
>> In the discussions around Mitch's My 66000 a couple of interesting points have come up. At the same time there is an interesting discussion about RISC-V going on on Realworldtech.com.
>
> omg a lot of rubbish there. And a discussion in which Linus Torvalds is the voice of reason???? I actually agree with most of what he says there.
>
> Not however his contention that things such as base+index addressing is completely antiethical to the purity of RISC-V philosophy. That's just rubbish. The ISA manual commentary explicitly says that loads with base+(maybe scaled)index addressing (a 2R1W instruction) and stores with effective address writeback like PowerPC and ARM (also 2R1W) were actively considered but didn't make the cut for the base ISA. It's entirely possible or even probable they'll make an appearance in future. I've spoken to the (according to that discussion) BDFLs about this and they're fine with that if quantitative data says they are useful.
>
>
>> One of the points made recently on Realworldtech.com that made me think a bit is that we now have so many transistors at our disposal that what used to be the bottleneck in decades past (e.g. ALU) isn't any more, and thus the tradeoffs of a decade or two ago may need to be revisited.
>
> It's true that the ALU probably isn't the critical path in cycle time at the moment. Certainly our core designers have told me that when I've asked about the possible impact of including some more complex operations in the BitManip extensions (for example generalized reverse and generalized OR-combine).
>

IME, it appears mostly shoveling stuff around in memory is a big cost
for a lot of the code I have looked at. Optimizing around other stuff,
mostly ends up amounting to, in effect, limiting the number of
clock-cycles spent not shoveling stuff around in memory...

>
>> Both of these bring up a number of tradeoffs that were made in ISA design over the years such as the following three things.
>>
>> 1) single instruction length vs. multiple lengths.
>>
>> 2) immediates in instructions vs. PC-relative loads
>>
>> 3) few addressing modes vs. more addressing modes (and a similar axis on addressing mode complexity).
>>
>> Taking these issues one at a time:
>>
>> The pendulum of instruction length has swung from very granular such as x86, Transputer etc. toward the extreme of single length instructions and is now heading back toward more lengths.
>>
>> Single length instructions have some pros and cons:
>>
>> Pro:
>>
>> * if you align your instructions right, then you will not get instructions split across a page boundary.
>
>
> It's impossible to *not* get that property with a fixed instruction length. Well -- unless it's not a power of two (assuming your pages are).
>

Pretty much.

>> Con:
>>
>> * lower overall code density.
>
> That's the mantra, but that assumes the variable length instruction designers do a good job. Aarch64 code is fixed length 32 bit instructions and yet is usually almost exactly the same density as x86_64 code.
>

I would not consider x86-64 to be the front-runner in terms of code
density (particularly with typical "optimize for speed" settings).

32-bit x86 does a bit better, at least given an appropriate choice in
terms of compiler and settings (IOW: size optimization; and avoiding
newer MSVC versions which seem mostly incapable of producing small
binaries).

> All the most dense instruction sets I know of are either 16 bit fixed opcode or else in small multiples of 16 bits. typically just 16 and 32, possibly 16, 32, and 48.
>

Yes, my experience agrees.

My BJX2 ISA also falls into this category for scalar code, though in its
VLIW-like "WEX" mode, it uses solely 32-bit instructions.

I have left 48-bits as optional, mostly as:
* The decoding logic for them isn't entirely free;
* The gains in terms of code density and performance are fairly small.

So, for smaller implementations (of the No-FPU / No-MMU variety), it
makes sense to leave them out to save some cost. Granted, there are
still cheaper options here.

The relatively lackluster gains from 48-bit instructions led me to "not
bother" with 64-bit instructions. Any gains from their existence were
likely to be negligible.

One merit of 48-bit instructions, is that (if used economically), they
have a fairly large encoding space.

>
>> * more difficult to make ISA additions using more registers or additional fields. At some point you run out of bits.
>
> This is true.
>
>
>> RISC-V and My 66000 allow for several lengths of instruction. Perhaps not entirely for the same reasons. ARM uses smaller instructions for code density in order to maximize the use of cache at a cost in dynamic instruction counts.
>
> Thumb1 cost dynamic instruction count, but it is fixed length 16 bit. Thumb2 which is 16 and 32 bit doesn't cost any appreciable amount of dynamic instruction count. The only exception I know of is comparing it to original ARM, where if the ARM code uses predication then Thumb2 needs an extra IT* instruction to cast a predication shadow over up to 4 following instructions. Most other ISAs don't have predication at all, so this doesn't arise in comparisons with them.
>

Thumb1 is boderline unusable IMO.

Thumb2 fares better at least, and it is pros/cons between this and the
original ARM32. In my own codegen's for ARM, I mostly ended up going for
Thumb2.

>
>> My 66000 goes for larger instructions to reduce certain types of common code, i.e. constant pool handling. RISC-V seems to do both with some of the 48-bit and 64-bit instruction forms as well as the compressed 16-bit form.
>
> RISC-V has provision for future 48-bit, 64-bit and longer instructions (up to 200 bits or so -- much longer than x86 allows) but no one has implemented any as yet.
>
> There were suggestions to use some 48 bit or 64 bit instructions in the Vector extension but this was quite forcefully vetoed by representatives from many organisations who insisted that the base vector ISA be fully functional with 32 bit instructions only.
>

My own experience also creates pressure to try to fit everything in
32-bit ops, but partly it is that anything which is 48-bit is
(currently) automatically unusable in wide-execute code.

I could, possibly, consider making provision for a 2x Op48 case,
allowing, eg:
MOV #imm32, R4 | MOV (PC, disp24), R5

But, this would introduce some new problems (eg: all Op32 cases would
also need to be able to be encoded as an Op48, ...).

> Nevertheless, the assembly language syntax for the Vector instructions has been kept more general than can be used by the initial extension, with current instructions considered as a "compressed" subset of later 48 or 64 bit instructions.
>

I don't really like regarding them as "compressed" personally, as this
doesn't really accurately reflect what is going on IMO.

A "compressed" ISA would imply that there is a simple
mostly-instruction-independent mechanism to unpack the smaller ops into
their larger equivalents, which are then decoded as their larger
equivalent.

Generally, this is not how it works in practice; the smaller ops
generally have their own decoders, and need not map exactly 1:1, ...

> For example later longer instructions are anticipated to have:
> - vector shape optionally explicitly included in each instruction instead of using values in a CSR set up by a preceding VSETVL instruction. (or use the CSR, in the same manner as FP rounding modes)
> - ability to not only use masking from v0 or not, but to negate the mask or to use another register as the source of the mask.
> - ability to make the destination of FMA instructions not be the same as one of the source operands. Three sources plus a destination simply didn't fit in 32 bits along with everything else needed.
>
>
>> Compressing an ISA gives you fewer bits for everything. Many compressed forms drop 3-register operations and go with destructive 2-register ops instead to save a register field. Compressing an ISA removes a lot of your immediate bits. That means that your constant range drops and thus solutions like PC-relative loads start looking a lot nicer. Which means that you really may not want to skimp on your addressing modes. You will likely want to have many kinds of scaling or other manipulation of your immediate fields just to make them more useful.
>
> Compressed instructions also don't have room to specify different addressing modes!
>

Yep.

>
>> Having support for full-size constants in an ISA seems like it would have the following impacts:
>>
>> * fewer non-sequential fetches since the immediates are inline.
>> * possible sooner completion of immediate availability because of the PC-dependent load that would need to happen if you use PC-relative loads.
>> * a small negative impact on decoding if you have multiple supported immediate widths. I think this would be fairly minor though.
>
> All true, except:
> - putting together literals using multiple instructions also uses sequential fetches.

Though, if the number of instructions is short, it may still work out
cheaper-on-average, and is a lot less annoying to deal with IMO.

> - if you use a load from a constant pool it's not necessarily PC-relative. Relative to the Zero register or Global Pointer register makes sense too -- or perhaps a from a general purpose register set up at the start of the function.
>
> Note that IBM 360 which also had 12 bit offsets in loads automatically set up multiple general purpose registers as base registers for global data/constants.
>

In my Thumb codegens, I had generally found it often easier to use a GPR
as a literal-table index vs mess around with PC-relative loads...

My ISA design has a GBR register, but I am mostly using it for accessing
global variables (".data"/".bss" sections) in my PBO ABI (as opposed to
using PC-relative addressing to access global variables).

The ISA design is generally effective enough at composing constants
inline such that loading them from memory is mostly unnecessary.

>
>> The increasing number of transistors allows more addressing mode complexity to be supported without much impact to base clock rates. But the more complicated your addressing modes are, the more bits you need to encode them.
>
> Encoding bits and register ports are the biggies. Cycle time to do an add or shift or both is usually trivial, as is the transistor count. "Adders are free"
>

Partially true, on my FPGA core prototypes, adder latency is still kind
of an issue, but I can throw small adders all over the place since the
FPGA has about as many CARRY4 units as LUT6's...

The limitation with overly complex addressing modes is more that they
tend to be "not particularly useful" in practice. Even if one has the
modes, and the compiler is able to use them, there isn't really much all
that often that can make effective use of them. It may work out
similarly effective to use multiple instructions in the rare few cases
when they might have otherwise been usable.

So, basically, while (Reg, Disp) or (Reg, Index) are used pretty much
everywhere (or variants with PC or GBR or similar as a base-register),
much beyond this things start to drop off pretty rapidly.

For example, I left out auto-increment addressing from BJX2 mostly
because it was used too infrequently to really justify the added
cost/complexity of performing an additional write-back.

PUSH/POP remain, but mostly because SP ended up being treated more like
an SPR, and operations like PUSH/POP/etc generally access SP via a
side-channel, rather than via a GPR port (and accessing SP via a GPR
port following an operation which modified it via its side-channel is
accomplished via an interlock rather than via register forwarding, ...).

The least cost option here, however, would be to leave out both cases,
and perform prolog/epilog sequences using SP-relative loads/stores instead.

Ivan Godard

unread,

Nov 9, 2019, 2:52:27 PM11/9/19

to

On 11/9/2019 9:14 AM, MitchAlsup wrote:

> My 6600 has a Table Transfer instruction, where IP+(DISP16<<2) is used to
> address memory for Execute permission, and the value fetched is either
> the instruction to be executed or the address of te instruction to be
> executed. In the former, the data is fed into PARSE and DECODE, in the
> later, the data become IP. Both JUMPs and CALLs can go through such a table.

For a switch Mill can use a jump table in the constant pool instead of
in the code, but from the compier that's the same. However it sounds
like your TT instruction can also be used as an EXEC, which I hadn't
heard of before. Some questions:

1) how do you handle EXECing variable length instructions? Padding to
largest?

2) What happens if the EXEC'd operation faults? Do you replay the TTI or
the execee; how?

3) What happens if the execee is a branch?

4) what if the execee is a call? What's the return address? What does a
debugger see in backtrace?

5) What C source constructs will get an exec TTI from your compiler?
Maybe a switch all of whose targets are single ops (plus the break branch)?

6) how do you single step through a TTI? Can I place a hardware
breakpoint on an execee instruction?

7) Hw does your compiler decide whether to generate a branch-style TTI
through a table of addresses vs. an exec-style TTI to a table of branch
instructions?

MitchAlsup

unread,

Nov 9, 2019, 3:23:16 PM11/9/19

to

I was speaking of delivering the instruction causing a fault to the exception
handler. This value is the IP at the time the instruction is decoded. So, you
have to drag it through the pipe anyway.

>
> >> Pollutes the L1 data cache with L1 instruction data, wasting capacity.
> >
> > the uses My 66000 performs do not have these properties.
> >
> >> It’s an extra load, when you may be load limited.
> >
> > IP-relative loads of constants does what you state. IP-relative loads of
> > real data (library resident data) is different, here IP-relative accesses
> > enable position independent library placement along with its data.
>
> Or you can just have another base specReg. Trying to jam that extra
> functionality onto the PC is a false economy; code gets simpler and
> safer with specRegs for constant pool, frame locals, static writable
> globals, code (*not* PC), and TLS. BTW, these are not genRegs and should
> not be writable as registers.

While I can be agreeable with the false economy of using IP, I don't have
a specReg thing in my architecture. So the addition of another place to
store stuff seems unnecessary.

BGB

unread,

Nov 9, 2019, 3:36:48 PM11/9/19

to

On 11/9/2019 11:14 AM, MitchAlsup wrote:
> On Friday, November 8, 2019 at 5:12:26 PM UTC-6, Kyle Hayes wrote:
>> This came out long...
>>
>> In the discussions around Mitch's My 66000 a couple of interesting points have come up. At the same time there is an interesting discussion about RISC-V going on on Realworldtech.com.
>>
>> One of the points made recently on Realworldtech.com that made me think a bit is that we now have so many transistors at our disposal that what used to be the bottleneck in decades past (e.g. ALU) isn't any more, and thus the tradeoffs of a decade or two ago may need to be revisited.
>
> ALU has not been a bottleneck since 1990. Feeding them has.
>>
>> Both of these bring up a number of tradeoffs that were made in ISA design over the years such as the following three things.
>>
>> 1) single instruction length vs. multiple lengths.
>
> I vote FOR multiple instruction lengths with one enormous caveat: everything
> but constants is available in the first decode parcel.
>>
>> 2) immediates in instructions vs. PC-relative loads
>
> My 66000 has both. IP relative access is for library relative data not
> for application addressable data. In My 66000 the compiler/linker/loader
> do not use IP relative memory refs to access constants.
>>
>> 3) few addressing modes vs. more addressing modes (and a similar axis on addressing mode complexity).
>
> My 66000 has 2::
> MEM Rd,[Rb+DISP16]
> MEM Rd,[Rb+Ri<<s+DISP]
>
> When Rb=R0, IP is used
> When Ri=R0, no indexing is performed.

Similar here, though it is (Rb+Disp9) and (Rb+(Ri<<Sc)).

Rb:
R0: PC
R1: GBR (Global Pointer)

Ri:
R0: DLR/R0, Scaled if Rb>=R2
R1: DLR/R0, Unscaled if Rb>=R2
Rb=R0, (R0)
Rb=R1, (TBR, R0)
SP/R15: No Index (ZZR / Zero Register)

Where TBR serves a similar role to FS/GS registers in x86, namely for
accessing a Thread/Process state structure, and for accessing TLS variables.

It was originally planned that TBR would be used for accessing PBO index
addresses, but this was changed such that GBR was used for this (in
effect daisy-chaining is used to move from one GBR to another).

I generally consider my PBO design to be a bit more efficient for NoMMU
stuff than the ELF FDPIC ABI design, which generally requires a
considerable amount of GOT wrangling and a lot of additional indirect
memory access, ... Namely, I can use PBO without an obvious/significant
performance penalty, unlike FDPIC.

>>
>> Taking these issues one at a time:
>>
>> The pendulum of instruction length has swung from very granular such as x86, Transputer etc. toward the extreme of single length instructions and is now heading back toward more lengths.
>>
>> Single length instructions have some pros and cons:
>>
>> Pro:
>>
>> * if you align your instructions right, then you will not get instructions split across a page boundary.
>>
>> Con:
>>
>> * lower overall code density.
>> * more difficult to make ISA additions using more registers or additional fields. At some point you run out of bits.
>>
>> RISC-V and My 66000 allow for several lengths of instruction. Perhaps not entirely for the same reasons. ARM uses smaller instructions for code density in order to maximize the use of cache at a cost in dynamic instruction counts. My 66000 goes for larger instructions to reduce certain types of common code, i.e. constant pool handling. RISC-V seems to do both with some of the 48-bit and 64-bit instruction forms as well as the compressed 16-bit form.
>>
>> Compressing an ISA gives you fewer bits for everything. Many compressed forms drop 3-register operations and go with destructive 2-register ops instead to save a register field. Compressing an ISA removes a lot of your immediate bits. That means that your constant range drops and thus solutions like PC-relative loads start looking a lot nicer. Which means that you really may not want to skimp on your addressing modes. You will likely want to have many kinds of scaling or other manipulation of your immediate fields just to make them more useful.
>
> In the 32-bit RISC implementations, pasting bits together was sufficient.
> In the 64-bit world it is not--you simply can't afford to past together
> a 64-bit operand by using 4 instructions with 16-bit immediates. Thus,
> the switch to memory resident compile time constants, and the pollution
> of the HARDER cache resource (D$).

While 4 is a worst-case, one doesn't necessarily need this many; if a
majority of 64-bit constants can still be composed in 2 or 3
instructions, pasting may remain acceptable in the average case
(particularly if the CPU core in question requires multiple clock cycles
to access memory and it is possible to execute the bit-pasting
operations in-parallel with other instructions).

MitchAlsup

unread,

Nov 9, 2019, 3:40:51 PM11/9/19

to

On Saturday, November 9, 2019 at 1:52:27 PM UTC-6, Ivan Godard wrote:
> On 11/9/2019 9:14 AM, MitchAlsup wrote:
>
> > My 6600 has a Table Transfer instruction, where IP+(DISP16<<2) is used to
> > address memory for Execute permission, and the value fetched is either
> > the instruction to be executed or the address of te instruction to be
> > executed. In the former, the data is fed into PARSE and DECODE, in the
> > later, the data become IP. Both JUMPs and CALLs can go through such a table.
>
> For a switch Mill can use a jump table in the constant pool instead of
> in the code, but from the compier that's the same. However it sounds
> like your TT instruction can also be used as an EXEC, which I hadn't
> heard of before. Some questions:

It is NOT an EXEC ! It only transfers control, either to the calculated address
or to the value in memory at the calculated address. It contains no semantic
or returning to the original code stream after one instruction is EXECed.
That is is is a branch (of some sort) and not an EXECUTE.

>
> 1) how do you handle EXECing variable length instructions? Padding to
> largest?

There is a size parameter in the instruction {W,D,Q,Oct} defining the span
between small switch blocks.

>
> 2) What happens if the EXEC'd operation faults? Do you replay the TTI or
> the execee; how?

IP points the the Table Transfer instruction

>
> 3) What happens if the execee is a branch?

It can be whatever.

>
> 4) what if the execee is a call? What's the return address? What does a
> debugger see in backtrace?

There is a TT which is an indirect call (method call), return address is
the instruction following the TT which did the call indirectly.

>
> 5) What C source constructs will get an exec TTI from your compiler?

C++ Method calls, and C indirect calls.

*(p->array[i])(arg1, arg2, arg3); // used a lot in my simulators

> Maybe a switch all of whose targets are single ops (plus the break branch)?

Dense switches with short clauses. If any clause it too long, then dense
switch with indirect branch.

>
> 6) how do you single step through a TTI? Can I place a hardware
> breakpoint on an execee instruction?

A direct TT instruction transfers control to the calculated address--that is
it reads memory at calculated address and stuffs it into DECODE.
An indirect TT instruction accesses memory at calculated address
and takes the value and stuffs it into IP.

Single stepping a direct TT will see control arrive in the tabularized <short> array.
Single stepping an indirect TT will see control arrive at the value found
in the tabularized array--but not the array access itself.

>
> 7) Hw does your compiler decide whether to generate a branch-style TTI
> through a table of addresses vs. an exec-style TTI to a table of branch
> instructions?

{Assuming the switch cases are dense:}
It starts with the assumption that the clauses are too big and builds the
indirect version. If all the clauses are small (16 words or less) then
the clauses become the array and the indirect bit is turned off.

MitchAlsup

unread,

Nov 9, 2019, 3:45:30 PM11/9/19

to

Compared to not having to paste anything at all ?

> (particularly if the CPU core in question requires multiple clock cycles
> to access memory and it is possible to execute the bit-pasting
> operations in-parallel with other instructions).

In the case the I$ is multiple clock cycles in latency, you REALLY want to
access multiple words wide per access. Once you are accessing multiple
words wide per fetch, the an instruction buffer is de rigueur, which brings
branch look ahead, and constants at very low costs. Most of the time using
a 32-bit or 64-bit constant costs zero additional cycles of post FETCH
latency.

Bruce Hoult

unread,

Nov 9, 2019, 4:16:42 PM11/9/19

to

Yes, system software conspired to do this for you e.g. the compiler and linker. IIRC assembly language programmers had to use a special directive to tell the linker how many and which registers to use as global base registers, but the linker automatically did the calculations and adjusted instructions to use the appropriate base register and offset.

If you have enough registers then this approach lets you scale a little bit past the limits of your 12 bit offset -- effectively another one or two bits, for example. But not unlimited :-)

Bruce Hoult

unread,

Nov 9, 2019, 4:45:43 PM11/9/19

to

On Saturday, November 9, 2019 at 11:38:44 AM UTC-8, BGB wrote:
> On 11/9/2019 1:11 AM, Bruce Hoult wrote:
> >> Con:
> >>
> >> * lower overall code density.
> >
> > That's the mantra, but that assumes the variable length instruction designers do a good job. Aarch64 code is fixed length 32 bit instructions and yet is usually almost exactly the same density as x86_64 code.
> >
>
> I would not consider x86-64 to be the front-runner in terms of code
> density (particularly with typical "optimize for speed" settings).

In 64-bit land, AMD64 set the standard for code density when it came out around 2000, handily beating MIPS64, Alpha, and IA64.

ARM64 essentially matches AMD64, making them the co-leaders until RV64 came along a couple of years later.

(RISC-V was already in development when ARM64 was announced, but ARM64 had obviously been in the pipeline for a considerable time .. some people I know who are ex-ARM tell me that ARM64 was started pretty much as soon as AMD64 was announced (making it about 13 years until first commercial availability in the iPhone 5s) but the project was apparently in hiatus for some years due to lack of customer interest)

> 32-bit x86 does a bit better, at least given an appropriate choice in
> terms of compiler and settings (IOW: size optimization; and avoiding
> newer MSVC versions which seem mostly incapable of producing small
> binaries).

But that's not 64 bit.

There are a ton of 32 bit ISAs with far better code density than any 64 bit ISA except RISC-V. Not only x86 but m86k, Thumb, SuperH.

MitchAlsup

unread,

Nov 9, 2019, 4:46:14 PM11/9/19

to

On Saturday, November 9, 2019 at 3:16:42 PM UTC-6, Bruce Hoult wrote:
> On Saturday, November 9, 2019 at 9:22:19 AM UTC-8, MitchAlsup wrote:
> > On Saturday, November 9, 2019 at 1:11:50 AM UTC-6, Bruce Hoult wrote:
> > > - if you use a load from a constant pool it's not necessarily PC-relative. Relative to the Zero register or Global Pointer register makes sense too -- or perhaps a from a general purpose register set up at the start of the function.
> > >
> > > Note that IBM 360 which also had 12 bit offsets in loads automatically set up multiple general purpose registers as base registers for global data/constants.
> >
> > Where "automatically" means instructions were used to put particular data
> > in particular registers so that the rest of the subroutine was made easier.
>
> Yes, system software conspired to do this for you e.g. the compiler and linker. IIRC assembly language programmers had to use a special directive to tell the linker how many and which registers to use as global base registers, but the linker automatically did the calculations and adjusted instructions to use the appropriate base register and offset.

In order to get a pointer to the subroutine just called, 360 performs a BALR
to the next instruction, and then starts using it as a base register:

Entry:
STM R14,R12,12(R13)
BALR R14,0(R13) // this transfers control to the next instruction
// and delivers a pointer to the next instruction
USING Entry,R14 // this is an assembly directive telling assembler
// that R14 can be used to perform branches

>
> If you have enough registers then this approach lets you scale a little bit past the limits of your 12 bit offset -- effectively another one or two bits, for example. But not unlimited :-)

It is still software cruft making up for lack of ISA functionality. In the
case of System 360, the designers can be forgiven (1960 design ships in 1964).

Terje Mathisen

unread,

Nov 9, 2019, 5:37:00 PM11/9/19

to

MitchAlsup wrote:
> In order to get a pointer to the subroutine just called, 360 performs a BALR
> to the next instruction, and then starts using it as a base register:
>
> Entry:
> STM R14,R12,12(R13)
> BALR R14,0(R13) // this transfers control to the next instruction
> // and delivers a pointer to the next instruction
> USING Entry,R14 // this is an assembly directive telling assembler
> // that R14 can be used to perform branches

This used to be the classic way to do PIC code in 16-bit x86 asm, the
pattern was most often used in virus code:

call next
next:
pop bx

It was of course also used in more kosher PIC code/libraries, but there
it crashed (performance-wise) into the first return predictors, so it
had to be replaced with a real function call:

call get_pic ;; returns with address of next in AX
next:

get_pic:
pop ax ;; return address
push ax ;; push it back so the RET will work
ret

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Bruce Hoult

unread,

Nov 9, 2019, 6:05:07 PM11/9/19

to

On Saturday, November 9, 2019 at 11:38:44 AM UTC-8, BGB wrote:

> On 11/9/2019 1:11 AM, Bruce Hoult wrote:
> > Nevertheless, the assembly language syntax for the Vector instructions has been kept more general than can be used by the initial extension, with current instructions considered as a "compressed" subset of later 48 or 64 bit instructions.
>
> I don't really like regarding them as "compressed" personally, as this
> doesn't really accurately reflect what is going on IMO.
>
> A "compressed" ISA would imply that there is a simple
> mostly-instruction-independent mechanism to unpack the smaller ops into
> their larger equivalents, which are then decoded as their larger
> equivalent.
>
> Generally, this is not how it works in practice; the smaller ops
> generally have their own decoders, and need not map exactly 1:1, ...

I agree, the "compressed" terminology can be misleading. It certainly was for me when I first read about Thumb in the mid 90s. I had all kinds of visions of Huffman encoding, complex hardware decoders etc. But it's nothing like that.

If someone out there hasn't looked into Thumb or RISC-V RVC or others in detail, "compressed" instructions simply mean providing an alternative instruction encoding for the same operation but using fewer bits by one or more of:

- reusing one of the operands as the destination

- making an operand or destination implicit (i.e. fixed)

- giving a smaller choice of registers

- giving a smaller choice of addressing modes

- using smaller literals/offsets

- not supporting optional features

Every "compressed" instruction has an exact mapping to a full size instruction. The assembly-language programmer (or compiler) writes code for the full size instruction set and the assembler (and/or linker) selects the smaller encoding when it can.

The only exceptions to every compressed instruction mapping exactly to a full size instruction are:

Thumb1: the apparently 32-bit "jump and link" instruction is in fact two 16-bit instructions 11110nnnnnnnnnnn and 11111nnnnnnnnnnn, the first of which does LR <- PC+imm11<<12 and the second does LR <- PC+2 and PC <- LR+imm11<<1 in parallel. Neither of these can be recoded as standard 32 bit instructions. The first because the immediate has too many significant bits to fit "ADD LR,PC,imm8 rol (imm4<<1)" (though the output from the instruction decoder will be fine). The second because "BLX LR" doesn't allow an offset. In Thumb2 these two instructions are a single 32 bit instruction and therefore must appear together and in order. In Thumb1 you can separate them by other instructions or even use one independently of the other. But you'd have to code them by hand using .word because the assembler will never do such a thing :-)

RISC-V: C.JAL and C.JALR do LR <- PC+2 not LR <- PC+4. (As of course does the 2nd 16 bit instruction in Thumb1 JL.)

Bruce Hoult

unread,

Nov 9, 2019, 6:31:48 PM11/9/19

to

On Saturday, November 9, 2019 at 1:46:14 PM UTC-8, MitchAlsup wrote:
> In order to get a pointer to the subroutine just called, 360 performs a BALR
> to the next instruction, and then starts using it as a base register:
>
> Entry:
> STM R14,R12,12(R13)
> BALR R14,0(R13) // this transfers control to the next instruction
> // and delivers a pointer to the next instruction
> USING Entry,R14 // this is an assembly directive telling assembler
> // that R14 can be used to perform branches
> >
> > If you have enough registers then this approach lets you scale a little bit past the limits of your 12 bit offset -- effectively another one or two bits, for example. But not unlimited :-)
>
> It is still software cruft making up for lack of ISA functionality. In the
> case of System 360, the designers can be forgiven (1960 design ships in 1964).

I consider this a perfectly acceptable trade-off, no matter what the year.

That's exactly how you did it on MIPS until they added AUIPC in MIPSr6! It also works in RISC-V, but is discouraged due to potentially confusing return address cache hardware.

Plus, on both MIPS and RISC-V the AUIPC lets you offset the resulting base address +/- 2GB from the PC. As does ARM64's ADRP instruction, but +/- 4GB and (perversely) clearing the 12 LSBs thus making the code NOT PIC unless you relocate only by multiples of 4K.

Have you *SEEN* how PIC code works on x86_64 in 2019? The function calls a special runtime function that copies the return address off the stack into the function result register, and then returns. Three instructions including two control flow transfers and all three touching memory. FFS. Vastly worse than BALR R14,0(R13).

Kyle Hayes

unread,

Nov 9, 2019, 6:35:56 PM11/9/19

to

On Friday, November 8, 2019 at 11:11:50 PM UTC-8, Bruce Hoult wrote:
> On Friday, November 8, 2019 at 3:12:26 PM UTC-8, Kyle Hayes wrote:
> > This came out long...
> >
> > In the discussions around Mitch's My 66000 a couple of interesting points have come up. At the same time there is an interesting discussion about RISC-V going on on Realworldtech.com.
>
> omg a lot of rubbish there. And a discussion in which Linus Torvalds is the voice of reason???? I actually agree with most of what he says there.

It is the Internet. The signal to noise ratio is low. Better than Slashdot or Reddit (but those do not set a high bar).

[snip]

> > One of the points made recently on Realworldtech.com that made me think a bit is that we now have so many transistors at our disposal that what used to be the bottleneck in decades past (e.g. ALU) isn't any more, and thus the tradeoffs of a decade or two ago may need to be revisited.
>
> It's true that the ALU probably isn't the critical path in cycle time at the moment. Certainly our core designers have told me that when I've asked about the possible impact of including some more complex operations in the BitManip extensions (for example generalized reverse and generalized OR-combine).
>

That's the point. If it is not in the critical path, what does it make sense to do in that stage? Of course, adding more functionality in each stage because it is "free" today creates the risk of tripping over some type of linkage between ISA and implementation. Think delayed branch slots. They were "free" too.

> > Single length instructions have some pros and cons:
> >
> > Pro:
> >
> > * if you align your instructions right, then you will not get instructions split across a page boundary.
>
>
> It's impossible to *not* get that property with a fixed instruction length. Well -- unless it's not a power of two (assuming your pages are).

Seems trivial: if all instruction are 4 bytes but they do not need to be aligned, then I could start an instruction at 0xFFFF and the last three bytes would be on a different page of memory. Am I missing something?

> > Con:
> >
> > * lower overall code density.
>
> That's the mantra, but that assumes the variable length instruction designers do a good job. Aarch64 code is fixed length 32 bit instructions and yet is usually almost exactly the same density as x86_64 code.

As was stated by others, AMD64 code is not necessarily the best example of code density.

> All the most dense instruction sets I know of are either 16 bit fixed opcode or else in small multiples of 16 bits. typically just 16 and 32, possibly 16, 32, and 48.

This is my experience too.

[snip]

> > RISC-V and My 66000 allow for several lengths of instruction. Perhaps not entirely for the same reasons. ARM uses smaller instructions for code density in order to maximize the use of cache at a cost in dynamic instruction counts.
>
> Thumb1 cost dynamic instruction count, but it is fixed length 16 bit. Thumb2 which is 16 and 32 bit doesn't cost any appreciable amount of dynamic instruction count. The only exception I know of is comparing it to original ARM, where if the ARM code uses predication then Thumb2 needs an extra IT* instruction to cast a predication shadow over up to 4 following instructions. Most other ISAs don't have predication at all, so this doesn't arise in comparisons with them.

I will have to read up on Thumb2 again. It has been years. I had remembered that there still was a dynamic instruction count cost (though lower than Thumb1).

[snip interesting points on RISC-V Vector extension and 48/64 instructions]

> > Compressing an ISA gives you fewer bits for everything. Many compressed forms drop 3-register operations and go with destructive 2-register ops instead to save a register field. Compressing an ISA removes a lot of your immediate bits. That means that your constant range drops and thus solutions like PC-relative loads start looking a lot nicer. Which means that you really may not want to skimp on your addressing modes. You will likely want to have many kinds of scaling or other manipulation of your immediate fields just to make them more useful.
>
> Compressed instructions also don't have room to specify different addressing modes!

Yes. Fewer bits all around.

> > Having support for full-size constants in an ISA seems like it would have the following impacts:
> >
> > * fewer non-sequential fetches since the immediates are inline.
> > * possible sooner completion of immediate availability because of the PC-dependent load that would need to happen if you use PC-relative loads.
> > * a small negative impact on decoding if you have multiple supported immediate widths. I think this would be fairly minor though.
>
> All true, except:
> - putting together literals using multiple instructions also uses sequential fetches.

Good point. These are also possible candidates for instruction fusion though the amount of wasted bits in that case is high.

> - if you use a load from a constant pool it's not necessarily PC-relative. Relative to the Zero register or Global Pointer register makes sense too -- or perhaps a from a general purpose register set up at the start of the function.

I was specifically calling out PC-relative as contrasted to inline constants. There are all kinds of things that could use a GP or special register like thread local storage, the GOT...

> Note that IBM 360 which also had 12 bit offsets in loads automatically set up multiple general purpose registers as base registers for global data/constants.

Interesting. I really need to learn more about it.

[snip]

Best,
Kyle

MitchAlsup

unread,

Nov 9, 2019, 6:36:46 PM11/9/19

to

On Saturday, November 9, 2019 at 5:05:07 PM UTC-6, Bruce Hoult wrote:
> On Saturday, November 9, 2019 at 11:38:44 AM UTC-8, BGB wrote:
> > On 11/9/2019 1:11 AM, Bruce Hoult wrote:
> > > Nevertheless, the assembly language syntax for the Vector instructions has been kept more general than can be used by the initial extension, with current instructions considered as a "compressed" subset of later 48 or 64 bit instructions.
> >
> > I don't really like regarding them as "compressed" personally, as this
> > doesn't really accurately reflect what is going on IMO.
> >
> > A "compressed" ISA would imply that there is a simple
> > mostly-instruction-independent mechanism to unpack the smaller ops into
> > their larger equivalents, which are then decoded as their larger
> > equivalent.
> >
> > Generally, this is not how it works in practice; the smaller ops
> > generally have their own decoders, and need not map exactly 1:1, ...
>
> I agree, the "compressed" terminology can be misleading. It certainly was for me when I first read about Thumb in the mid 90s. I had all kinds of visions of Huffman encoding, complex hardware decoders etc. But it's nothing like that.
>
> If someone out there hasn't looked into Thumb or RISC-V RVC or others in detail, "compressed" instructions simply mean providing an alternative instruction encoding for the same operation but using fewer bits by one or more of:
>
> - reusing one of the operands as the destination

Several CISC machines recognize::

MOV R5,R7
OP R5,R3
as:
OP R5,R7,R3

But, still, to me this seems to be a step backward.

>
> - making an operand or destination implicit (i.e. fixed)

<Shivering> I just don't know what to say.......

>
> - giving a smaller choice of registers

>
> - giving a smaller choice of addressing modes

Since I only have 2, this bothers me little.
>
> - using smaller literals/offsets

Completely justified when you can get away with it.

>
> - not supporting optional features
>
> Every "compressed" instruction has an exact mapping to a full size instruction.

Where 'exact' means equal to or having more features available.

>The assembly-language programmer (or compiler) writes code for the full size instruction set and the assembler (and/or linker) selects the smaller encoding when it can.

Much preferable to having the compiler perform the choice.

MitchAlsup

unread,

Nov 9, 2019, 6:39:25 PM11/9/19

to

On Saturday, November 9, 2019 at 5:31:48 PM UTC-6, Bruce Hoult wrote:
> On Saturday, November 9, 2019 at 1:46:14 PM UTC-8, MitchAlsup wrote:
> > In order to get a pointer to the subroutine just called, 360 performs a BALR
> > to the next instruction, and then starts using it as a base register:
> >
> > Entry:
> > STM R14,R12,12(R13)
> > BALR R14,0(R13) // this transfers control to the next instruction
> > // and delivers a pointer to the next instruction
> > USING Entry,R14 // this is an assembly directive telling assembler
> > // that R14 can be used to perform branches
> > >
> > > If you have enough registers then this approach lets you scale a little bit past the limits of your 12 bit offset -- effectively another one or two bits, for example. But not unlimited :-)
> >
> > It is still software cruft making up for lack of ISA functionality. In the
> > case of System 360, the designers can be forgiven (1960 design ships in 1964).
>
> I consider this a perfectly acceptable trade-off, no matter what the year.

The reason I do not consider this acceptable, is that when one has access
to IP and can use it as a base register, there is no need.

>
> That's exactly how you did it on MIPS until they added AUIPC in MIPSr6! It also works in RISC-V, but is discouraged due to potentially confusing return address cache hardware.
>
> Plus, on both MIPS and RISC-V the AUIPC lets you offset the resulting base address +/- 2GB from the PC. As does ARM64's ADRP instruction, but +/- 4GB and (perversely) clearing the 12 LSBs thus making the code NOT PIC unless you relocate only by multiples of 4K.
>
> Have you *SEEN* how PIC code works on x86_64 in 2019? The function calls a special runtime function that copies the return address off the stack into the function result register, and then returns. Three instructions including two control flow transfers and all three touching memory. FFS. Vastly worse than BALR R14,0(R13).

Probably true. But there are way to get PIC without having to jump through
those kinds of hoops.

Kyle Hayes

unread,

Nov 9, 2019, 7:05:17 PM11/9/19

to

On Saturday, November 9, 2019 at 6:47:12 AM UTC-8, Bruce Hoult wrote:
> On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
> > Bruce Hoult <bruce...@gmail.com> writes:
> > >Don't agree on that. Aarch64 has some brilliant bits, but it also has some decisions which I think will prove to be very short-sighted.
> >
> > Such as?
>
> Such as the fixed size 32 bit instructions. ARM invented the modern RISC with both 16 bit and 32 bit instructions, beating every other ISA style for code density on compiled C code. It's what propelled them to superstardom. Astounding that they abandoned it for 64 bit at the cost of 30% code density compared to their own 32 bit ISA -- and to our 64 bit one. They did a darn good job for a fixed-length ISA, basically matching x86_64 for code size, and I guess they thought that was good enough at the time.

I have to agree. When I first read about AArch64, my first thought on seeing the lack of Thumb-like instructions was the same as your note above.

Tacking on a compressed version of an ISA later is never going to be quite as clean as planning for one in advance, but ARM _knew_ that Thumb(2) was a winner and still did not include it. Perhaps this was due to the long lead time on AArch64 (discussed later in the thread) when it was not as clear how such code density advantages could be even on large systems.

> At the same time, they're likely to run out of opcode space for extensions in the foreseeable future, as MIPS did.

I think all ISAs are going to have a problem with this unless they embrace the x86 model of prefixes. Modes may be a feasible way around this though the approach with Thumb1 did not impress me. Another possibility is splitting out some of the bits to extra registers or flags set in blocks like the predication in Thumb2 or My 66000, or the vectorization "prefix" in My 66000.

> There's basically no way to retrofit either 16 bit instructions or longer ones to the current Aarch64 instruction encoding. If they want to do it they'll either need to add a mode (ala Thumb1) or a completely new encoding. Maybe that's what ARMv9 will be.

It does seem weird that ARM did this.

> > >RISC-V certainly makes a very good low end processor, *especially* a low end 64 bit processor which Aarch64 can never do as long as they keep their strict "no subsetting" rule.
> >
> > Why would one want a 64-bit core that is lower-end than a Cortex-A34?
> > The silicon (and thus cost) of the A34 is tiny compared to the silicon
> > of 4GB of DRAM or even flash, so having something even smaller does
> > not save much money.
>
> The only justification I *need* to give is that customers are lining up to write six figure cheques to license such cores. The is a *ton* of customer demand for cores with basically a Cortex M0 microarchitecture and code size but with 64 bit registers.

"People pay for it" is a pretty good justification. But people pay for fads all the time. Whether they will keep paying for it is the question.

That said, if people are requesting 64-bit registers, they probably have an actual need. Having 64-bit code usually has a pretty dire effect on code size and data use compared to the usual tiny embedded MCU. The tradeoffs must be worth it.

Can you share any of the needs? For deeply embedded chips, I can't see many obvious reasons for this. Those chips do not have lots of RAM so even 32-bit addresses should be more than enough. If they are using some sort of sparse addressing and virtual memory, then they are not what I think of as deeply embedded.

> > And the other competition is a 32-bit core (say, Cortex-M0) with some
> > extra stuff to allow accessing more memory.
>
> Extra stuff such as bank switching and address translation, sure. It's not a lot smaller than just making the damn registers 64 bits wide. Plus you can't just compile standard C code for it but have to muck about with that bank switching. It's awful. Surely the 8086-286 taught us that much?
>
> Some people want 64 bit for the address space but it's not necessarily RAM. It might be sparse. Or you might have other bigger 64 bit applications processors in the system but don't want to bother them with taking interrupts or transferring data between peripherals and memory. But you want something a little more than a dumb state machine DMA.

Does your argument here boil down to using the same ISA across all parts of the system? If so, I am not sure I buy it. The advantage of having a single ISA seems to be fairly minimal compared to the other factors like special instructions, code size etc.

> You might not want just one such core. Maybe you want dozens of them. Size starts to count.

As does code size.

> ARM won't sell you a 64 bit core without SIMD.
> ARM won't sell you a 64 bit core without FP.
> ARM won't sell you a 64 bit core without multiply&divide.
> ARM won't sell you a 64 bit core with 16 registers.
> ARM can't sell you a 64 bit core with comparable code density to their own 32 bit cores.
> ARM won't let you add custom instructions -- historically. They're starting next year, with a single core. Guess why.

This points out the biggest risk to RISC-V in my opinion: ARM can always lower prices on some parts of the market and kill off RISC-V at the low end. ARM might lose a lot of money doing that though.

> Interestingly the one thing no one has asked us to remove so far is the expander for 16 bit instructions. From memory it's about 300 LUTs or 800 gates. You make that back (at ~1 byte per instruction on average) in smaller SRAM or icache if your code is more than a few hundred instructions.
>
> It's notable that where a particular 32 bit ARM core has 32 KB of icache, the comparable 64 bit core has 48 KB. That's expensive.
>
> Some people who want a 64 bit core don't care about large memory addressing at all. They just know that a significant part of energy use goes into instruction fetch and decode and control. Doubling the datapath width can, depending on your workload, nearly halve the number of instructions you have to fetch and decode and execute.

This seems like more DSP or SIMD use here and not generally 32 vs. 64 bit. Unless I am missing my guesses at what these processors do.

> > >As for high end .. we shall see. The U84 is only four years behind the A72. Last year's U74 was six years behind the A53. That looks like rapidly catching up to me.
> >
> > ARM has been trying to play catch-up with Intel/AMD in the server
> > market for quite a while. For now they have not succeeded.
>
> It's very different. That market has historically wanted to run legacy x86 applications.
>
> To whatever extent ARM does succeed in making the server market ISA-agnostic, we benefit from that too.

That's a good point.

> > > There are a ton of applications that can use 64 bit addressing but don't need huge performance.
> >
> > Such as? And why would a 32-bit core with some address-extension
> > extra not be a good fit?
>
> See above.
>
>
> > > And a lot of them want predictable execution times.
> >
> > Actually the worst-case execution time of caches is predictable and
> > better than that of main memory if the cache is designed appropriately
> > (true LRU replacement is good). I expect that one can also predict
> > the worst-case with dynamic branch predictors, but I don't know if the
> > result is better than with static branch prediction.
>
> It's not caches vs dynamic RAM. It's caches vs about the same size of static RAM. In many applications a couple of KB is enough.

And this is where I am confused. If you only have a few KB of RAM, then using a 64-bit processor would eat that up quickly. It is all about data density in RAM when you have so little.

If your algorithms profit from 64-bit registers, then it seems more likely that you are doing something DSP-like where either specialized instructions or SIMD might be a better way to do things.

Best,
Kyle

Ivan Godard

unread,

Nov 9, 2019, 7:44:36 PM11/9/19

to

On 11/9/2019 4:05 PM, Kyle Hayes wrote:
> On Saturday, November 9, 2019 at 6:47:12 AM UTC-8, Bruce Hoult wrote:
>> On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
>>> Bruce Hoult <bruce...@gmail.com> writes:
>>>> Don't agree on that. Aarch64 has some brilliant bits, but it also has some decisions which I think will prove to be very short-sighted.
>>>
>>> Such as?
>>
>> Such as the fixed size 32 bit instructions. ARM invented the modern RISC with both 16 bit and 32 bit instructions, beating every other ISA style for code density on compiled C code. It's what propelled them to superstardom. Astounding that they abandoned it for 64 bit at the cost of 30% code density compared to their own 32 bit ISA -- and to our 64 bit one. They did a darn good job for a fixed-length ISA, basically matching x86_64 for code size, and I guess they thought that was good enough at the time.
>
> I have to agree. When I first read about AArch64, my first thought on seeing the lack of Thumb-like instructions was the same as your note above.
>
> Tacking on a compressed version of an ISA later is never going to be quite as clean as planning for one in advance, but ARM _knew_ that Thumb(2) was a winner and still did not include it. Perhaps this was due to the long lead time on AArch64 (discussed later in the thread) when it was not as clear how such code density advantages could be even on large systems.
>
>> At the same time, they're likely to run out of opcode space for extensions in the foreseeable future, as MIPS did.
>
> I think all ISAs are going to have a problem with this unless they embrace the x86 model of prefixes. Modes may be a feasible way around this though the approach with Thumb1 did not impress me. Another possibility is splitting out some of the bits to extra registers or flags set in blocks like the predication in Thumb2 or My 66000, or the vectorization "prefix" in My 66000.

Or avoid the whole problem, by distinguishing the distribution ISA from
the execution ISA. After all, that's what microcode cores do, and the
same works one step up, too.

>> ARM won't sell you a 64 bit core without SIMD.
>> ARM won't sell you a 64 bit core without FP.
>> ARM won't sell you a 64 bit core without multiply&divide.
>> ARM won't sell you a 64 bit core with 16 registers.
>> ARM can't sell you a 64 bit core with comparable code density to their own 32 bit cores.
>> ARM won't let you add custom instructions -- historically. They're starting next year, with a single core. Guess why.

Mill will.

You know, that's not a bad marketing slogan. Better than "Intel inside"
anyway.

Kyle Hayes

unread,

Nov 9, 2019, 7:58:53 PM11/9/19

to

On Saturday, November 9, 2019 at 4:44:36 PM UTC-8, Ivan Godard wrote:
> On 11/9/2019 4:05 PM, Kyle Hayes wrote:
> > On Saturday, November 9, 2019 at 6:47:12 AM UTC-8, Bruce Hoult wrote:

[snip]

> >> At the same time, they're [ARM] likely to run out of opcode space for extensions in the foreseeable future, as MIPS did.

> >
> > I think all ISAs are going to have a problem with this unless they embrace the x86 model of prefixes. Modes may be a feasible way around this though the approach with Thumb1 did not impress me. Another possibility is splitting out some of the bits to extra registers or flags set in blocks like the predication in Thumb2 or My 66000, or the vectorization "prefix" in My 66000.
>
> Or avoid the whole problem, by distinguishing the distribution ISA from
> the execution ISA. After all, that's what microcode cores do, and the
> same works one step up, too.

There is a long history of people trying to have one higher level ISA and compiling down to an execution ISA. Not a lot of successes along that path.

Notable exceptions include Java and IBM's i-series.

Now these are not necessarily what Mill is trying to do so there may be an apple-to-oranges comparison problem here.

In the Mill's case you are designing both ISAs so you can do a pretty good job of matching up your code. The semantic gap is probably not going to be as large as that of Java to a "real" CPU or trying to map one "real" ISA to another "real" ISA.

It is the top-level ISA that will need to have instructions added over time.

> >> ARM won't sell you a 64 bit core without SIMD.
> >> ARM won't sell you a 64 bit core without FP.
> >> ARM won't sell you a 64 bit core without multiply&divide.
> >> ARM won't sell you a 64 bit core with 16 registers.
> >> ARM can't sell you a 64 bit core with comparable code density to their own 32 bit cores.
> >> ARM won't let you add custom instructions -- historically. They're starting next year, with a single core. Guess why.
>
> Mill will.
>
> You know, that's not a bad marketing slogan. Better than "Intel inside"
> anyway.

Nice!

Best,
Kyle

Bruce Hoult

unread,

Nov 9, 2019, 8:12:22 PM11/9/19

to

On Saturday, November 9, 2019 at 3:36:46 PM UTC-8, MitchAlsup wrote:
> On Saturday, November 9, 2019 at 5:05:07 PM UTC-6, Bruce Hoult wrote:
> > On Saturday, November 9, 2019 at 11:38:44 AM UTC-8, BGB wrote:
> > > On 11/9/2019 1:11 AM, Bruce Hoult wrote:
> > > > Nevertheless, the assembly language syntax for the Vector instructions has been kept more general than can be used by the initial extension, with current instructions considered as a "compressed" subset of later 48 or 64 bit instructions.
> > >
> > > I don't really like regarding them as "compressed" personally, as this
> > > doesn't really accurately reflect what is going on IMO.
> > >
> > > A "compressed" ISA would imply that there is a simple
> > > mostly-instruction-independent mechanism to unpack the smaller ops into
> > > their larger equivalents, which are then decoded as their larger
> > > equivalent.
> > >
> > > Generally, this is not how it works in practice; the smaller ops
> > > generally have their own decoders, and need not map exactly 1:1, ...
> >
> > I agree, the "compressed" terminology can be misleading. It certainly was for me when I first read about Thumb in the mid 90s. I had all kinds of visions of Huffman encoding, complex hardware decoders etc. But it's nothing like that.
> >
> > If someone out there hasn't looked into Thumb or RISC-V RVC or others in detail, "compressed" instructions simply mean providing an alternative instruction encoding for the same operation but using fewer bits by one or more of:
> >
> > - reusing one of the operands as the destination
>
> Several CISC machines recognize::
>
> MOV R5,R7
> OP R5,R3
> as:
> OP R5,R7,R3
>
> But, still, to me this seems to be a step backward.

Certainly you might want to do that on x86, Thumb1, SuperH, PDP11, M68K and others where you *only* have two-address instructions.

This is not the case on what we are discussing, 32 bit opcode 3-address ISAs such as Thumb2 and RISC-V which have "compressed" 16 bit opcodes duplicating some of the functionality. If you want to do A = B + C then you use a single 32-bit instruction, not two 16-bit instructions (even though you *could* do that too).

> > - making an operand or destination implicit (i.e. fixed)
>
> <Shivering> I just don't know what to say.......

In My 66000, may I refer you to:

BSR JSR r31 is an implicit destination!!!

BB BC BR BSR use r0 as an implicit base register for the offset.

All the above use r0 as an implicit destination.

The use of the PC as an implicit destination for jumps/branches and as an implicit source for relative branches is of course common to most ISAs, including RISC-V. It is, however, a choice and you *could* do things in another way.

What happens in My 66000 if you do ADD r0,r0,#48 ?

Using an implicit Link Register is also common in ISAs, but less universal. PDP11, for example, allowed you to specify any register as the Link Register (weirdly by modern standards, it pushed the old value on to the stack).

The base RISC-V 32 bit opcodes require you to say explicitly which register to use as a Link Register (and this is used for compiler intrinsics to use an alternate Link Register). It is only the 16 bit opcodes that implicitly assume x1 is the Link Register.

I'm happy that in My 66000 you:

- don't use an implicit stack pointer
- don't use an implicit condition code register
- don't use an implicit register for subroutine return (even though you do for call!)

The base RISC-V 32 bit opcodes also don't use implicit registers for any of those things. However, the "compressed" 16 bit opcodes do assume x1 is the Link Register (we call it RA Return Address) for calls and returns, and assume x2 is the stack pointer for SP-relative loads and stores, and for adjusting the SP on function entry and exit.

> > - giving a smaller choice of registers
>
> <Shivering>

I don't know why <Shivering> here. If you want to use a register not in the supported subset you just use a full size instruction instead.

EricP

unread,

Nov 9, 2019, 8:51:42 PM11/9/19

to

Bruce Hoult wrote:
>
> Have you *SEEN* how PIC code works on x86_64 in 2019? The function calls a special runtime function that copies the return address off the stack into the function result register, and then returns. Three instructions including two control flow transfers and all three touching memory. FFS. Vastly worse than BALR R14,0(R13).

Is there something wrong with
LEA rax, [RIP]

Chris M. Thomasson

unread,

Nov 9, 2019, 9:13:24 PM11/9/19

to

On 11/9/2019 4:44 PM, Ivan Godard wrote:
> On 11/9/2019 4:05 PM, Kyle Hayes wrote:
>> On Saturday, November 9, 2019 at 6:47:12 AM UTC-8, Bruce Hoult wrote:
>>> On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
>>>> Bruce Hoult <bruce...@gmail.com> writes:

[...]
> Mill will.

Nice!

>
> You know, that's not a bad marketing slogan. Better than "Intel inside"
> anyway.

Fwiw, Joe Seigh had a nice signature:

When you get lemons, you make lemonade.
When you get hardware, you make software.

Always made me laugh.

Bruce Hoult

unread,

Nov 9, 2019, 9:16:56 PM11/9/19

to

You'd think. And I have seen that in generated code. But I've also seen the runtime library used, and recently. This year.

Compiling this...

double lookup(int n){
static double tab[] = {3.14,2.7};
return tab[n];
}

... with gcc 7.4.0 (hmm a bit old .. this is Ubuntu 18.04) I get an LEA for 64 bit code:

$ gcc -O1 -o pic pic.c && objdump -D pic
...
00000000000005fa <lookup>:
5fa: 48 63 ff movslq %edi,%rdi
5fd: 48 8d 05 9c 00 00 00 lea 0x9c(%rip),%rax # 6a0 <tab.1795>
604: f2 0f 10 04 f8 movsd (%rax,%rdi,8),%xmm0
609: c3 retq

But with 32 bit I get the thunk:

$ gcc -O1 -o pic pic.c -m32 && objdump -D pic
...
000004ed <lookup>:
4ed: e8 17 00 00 00 call 509 <__x86.get_pc_thunk.ax>
4f2: 05 ea 1a 00 00 add $0x1aea,%eax
4f7: 8b 54 24 04 mov 0x4(%esp),%edx
4fb: dd 84 d0 b4 e5 ff ff fldl -0x1a4c(%eax,%edx,8)
502: c3 ret

...

00000509 <__x86.get_pc_thunk.ax>:
509: 8b 04 24 mov (%esp),%eax
50c: c3 ret

I'm *sure* I've seen that in 64 bit code as well, but I don't recall the circumstances. Maybe it was some program that came with the OS or from elsewhere that I was debugging. I don't know.

Bruce Hoult

unread,

Nov 9, 2019, 9:33:11 PM11/9/19

to

On Saturday, November 9, 2019 at 6:13:24 PM UTC-8, Chris M. Thomasson wrote:
> On 11/9/2019 4:44 PM, Ivan Godard wrote:
> > On 11/9/2019 4:05 PM, Kyle Hayes wrote:
> >> On Saturday, November 9, 2019 at 6:47:12 AM UTC-8, Bruce Hoult wrote:
> >>> On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
> >>>> Bruce Hoult <bruce...@gmail.com> writes:
> [...]
> > Mill will.
>
> Nice!

I'll be 1000x more impressed when it's "Mill Does"

Doesn't rhyme, but.

Mill has got a ton of very nice looking ideas, and I'd love to try it and use it. But I can't.

Same goes for My 66000. Very nice design. But it's paper. I'd buy a dev board with one in a metric heartbeat if it was under $1000.

I paid $999 of my own money for a HiFive Unleashed (1.5 GHz 4x U54 RV64IMAFDC, 1x RV64IMAC, 8 GB DDR4 2400, gigE) way back before I joined SiFive. Nice board. Tends to need about 1.3 GHz to match a 1.2 GHz Raspberry Pi 3 (ARM A53, 1 GB DDR3, 100baseT) on stuff that fits in cache (it's only single-issue vs dual issue), but creams it on anything that needs actual RAM or network.

I have a pretty good collection of Raspberry Pis (including the new A72 Pi4), Odroids (A15 and A53), Arduinos, cheap RISC-V boards out of China (e.g. the new Longan Nano arrived a couple of weeks ago)

If it exists and looks interesting I'll pretty much buy one.

So c'mon guys...

Even a freaking FPGA bitstream for an Arty will do...

MitchAlsup

unread,

Nov 9, 2019, 10:41:31 PM11/9/19

to

On Saturday, November 9, 2019 at 6:05:17 PM UTC-6, Kyle Hayes wrote:
> On Saturday, November 9, 2019 at 6:47:12 AM UTC-8, Bruce Hoult wrote:
> > On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
> > > Bruce Hoult <bruce...@gmail.com> writes:
> > > >Don't agree on that. Aarch64 has some brilliant bits, but it also has some decisions which I think will prove to be very short-sighted.
> > >
> > > Such as?
> >
> > Such as the fixed size 32 bit instructions. ARM invented the modern RISC with both 16 bit and 32 bit instructions, beating every other ISA style for code density on compiled C code. It's what propelled them to superstardom. Astounding that they abandoned it for 64 bit at the cost of 30% code density compared to their own 32 bit ISA -- and to our 64 bit one. They did a darn good job for a fixed-length ISA, basically matching x86_64 for code size, and I guess they thought that was good enough at the time.
>
> I have to agree. When I first read about AArch64, my first thought on seeing the lack of Thumb-like instructions was the same as your note above.
>
> Tacking on a compressed version of an ISA later is never going to be quite as clean as planning for one in advance, but ARM _knew_ that Thumb(2) was a winner and still did not include it. Perhaps this was due to the long lead time on AArch64 (discussed later in the thread) when it was not as clear how such code density advantages could be even on large systems.
>
> > At the same time, they're likely to run out of opcode space for extensions in the foreseeable future, as MIPS did.
>
> I think all ISAs are going to have a problem with this unless they embrace the x86 model of prefixes. Modes may be a feasible way around this though the approach with Thumb1 did not impress me. Another possibility is splitting out some of the bits to extra registers or flags set in blocks like the predication in Thumb2 or My 66000, or the vectorization "prefix" in My 66000.

Thank you, but, My 66000 has plenty of room for future expansion. The Major OpCode field is only 70% filled, teh 2-OP group is only 44% filled, and the
1-op group is less than 5% filled. Only the 3-op group is over 50% filled
and I can add multiple 3-op groups as needed.

>
> > There's basically no way to retrofit either 16 bit instructions or longer ones to the current Aarch64 instruction encoding. If they want to do it they'll either need to add a mode (ala Thumb1) or a completely new encoding. Maybe that's what ARMv9 will be.
>
> It does seem weird that ARM did this.
>
> > > >RISC-V certainly makes a very good low end processor, *especially* a low end 64 bit processor which Aarch64 can never do as long as they keep their strict "no subsetting" rule.
> > >
> > > Why would one want a 64-bit core that is lower-end than a Cortex-A34?
> > > The silicon (and thus cost) of the A34 is tiny compared to the silicon
> > > of 4GB of DRAM or even flash, so having something even smaller does
> > > not save much money.
> >
> > The only justification I *need* to give is that customers are lining up to write six figure cheques to license such cores. The is a *ton* of customer demand for cores with basically a Cortex M0 microarchitecture and code size but with 64 bit registers.
>
> "People pay for it" is a pretty good justification. But people pay for fads all the time. Whether they will keep paying for it is the question.
>
> That said, if people are requesting 64-bit registers, they probably have an actual need. Having 64-bit code usually has a pretty dire effect on code size and data use compared to the usual tiny embedded MCU. The tradeoffs must be worth it.

64-bit code is no bigger than 32-bit code in My 66000, nor is it any smaller.

>
> Can you share any of the needs? For deeply embedded chips, I can't see many obvious reasons for this. Those chips do not have lots of RAM so even 32-bit addresses should be more than enough. If they are using some sort of sparse addressing and virtual memory, then they are not what I think of as deeply embedded.
>
> > > And the other competition is a 32-bit core (say, Cortex-M0) with some
> > > extra stuff to allow accessing more memory.
> >
> > Extra stuff such as bank switching and address translation, sure. It's not a lot smaller than just making the damn registers 64 bits wide. Plus you can't just compile standard C code for it but have to muck about with that bank switching. It's awful. Surely the 8086-286 taught us that much?
> >
> > Some people want 64 bit for the address space but it's not necessarily RAM. It might be sparse. Or you might have other bigger 64 bit applications processors in the system but don't want to bother them with taking interrupts or transferring data between peripherals and memory. But you want something a little more than a dumb state machine DMA.

This is one of the reasons to have a 2-;evel Virtual to physical translation means. The smaller applications can have their compressed 32-bit encoding,
which is then mapped into the real much larger than 32-bit physical space.

>
> Does your argument here boil down to using the same ISA across all parts of the system? If so, I am not sure I buy it. The advantage of having a single ISA seems to be fairly minimal compared to the other factors like special instructions, code size etc.

For example, My 66000 has no (zero) instructions with privilege.

>
> > You might not want just one such core. Maybe you want dozens of them. Size starts to count.
>
> As does code size.
>
> > ARM won't sell you a 64 bit core without SIMD.
> > ARM won't sell you a 64 bit core without FP.
> > ARM won't sell you a 64 bit core without multiply&divide.
> > ARM won't sell you a 64 bit core with 16 registers.
> > ARM can't sell you a 64 bit core with comparable code density to their own 32 bit cores.
> > ARM won't let you add custom instructions -- historically. They're starting next year, with a single core. Guess why.
>
> This points out the biggest risk to RISC-V in my opinion: ARM can always lower prices on some parts of the market and kill off RISC-V at the low end. ARM might lose a lot of money doing that though.

As long as they win, they won't care.

MitchAlsup

unread,

Nov 9, 2019, 10:48:35 PM11/9/19

to

Actually its R0 that i implicit

>
> BB BC BR BSR use r0 as an implicit base register for the offset.

It uses IP which is not an GPR.

>
> All the above use r0 as an implicit destination.
>
> The use of the PC as an implicit destination for jumps/branches and as an implicit source for relative branches is of course common to most ISAs, including RISC-V. It is, however, a choice and you *could* do things in another way.
>
> What happens in My 66000 if you do ADD r0,r0,#48 ?

R0 is not in any way weird, R0 holds values just like any other register.
There really ARE 32 64-bit registers. R0 gets the return address on CALLs,
and R0 gets substituted on certain instruction decodings.

Now, in certain instructions, specifying R0 cause the HW to substitute some other reasonable value, Rb = R0 cause IP to be substituted, Ri causes ZERO
to be substituted.

>
> Using an implicit Link Register is also common in ISAs, but less universal. PDP11, for example, allowed you to specify any register as the Link Register (weirdly by modern standards, it pushed the old value on to the stack).
>
> The base RISC-V 32 bit opcodes require you to say explicitly which register to use as a Link Register (and this is used for compiler intrinsics to use an alternate Link Register). It is only the 16 bit opcodes that implicitly assume x1 is the Link Register.
>
> I'm happy that in My 66000 you:
>
> - don't use an implicit stack pointer
> - don't use an implicit condition code register
> - don't use an implicit register for subroutine return (even though you do for call!)

Its a bit count issue.

>
> The base RISC-V 32 bit opcodes also don't use implicit registers for any of those things. However, the "compressed" 16 bit opcodes do assume x1 is the Link Register (we call it RA Return Address) for calls and returns, and assume x2 is the stack pointer for SP-relative loads and stores, and for adjusting the SP on function entry and exit.
>
>
> > > - giving a smaller choice of registers
> >
> > <Shivering>
>
> I don't know why <Shivering> here. If you want to use a register not in the supported subset you just use a full size instruction instead.

Maybe I have listened to too many compiler writers (or wrongers!)

MitchAlsup

unread,

Nov 9, 2019, 10:49:30 PM11/9/19

to

Not is you want to load the bit pattern immediately following the current
instruction.

MitchAlsup

unread,

Nov 9, 2019, 10:53:24 PM11/9/19

to

On Saturday, November 9, 2019 at 8:16:56 PM UTC-6, Bruce Hoult wrote:
> On Saturday, November 9, 2019 at 5:51:42 PM UTC-8, EricP wrote:
> > Bruce Hoult wrote:
> > >
> > > Have you *SEEN* how PIC code works on x86_64 in 2019? The function calls a special runtime function that copies the return address off the stack into the function result register, and then returns. Three instructions including two control flow transfers and all three touching memory. FFS. Vastly worse than BALR R14,0(R13).
> >
> > Is there something wrong with
> > LEA rax, [RIP]
>
> You'd think. And I have seen that in generated code. But I've also seen the runtime library used, and recently. This year.
>
> Compiling this...
>
> double lookup(int n){
> static double tab[] = {3.14,2.7};
> return tab[n];
> }

SECT DATA
Label1:
DWORD 3.14
DWORD 2.7

SECT CODE
lookup:
LDD R1,[IP+R0<<3+label1]
RET R0

Bruce Hoult

unread,

Nov 9, 2019, 11:11:44 PM11/9/19

to

On Saturday, November 9, 2019 at 4:05:17 PM UTC-8, Kyle Hayes wrote:
> On Saturday, November 9, 2019 at 6:47:12 AM UTC-8, Bruce Hoult wrote:
> > At the same time, they're likely to run out of opcode space for extensions in the foreseeable future, as MIPS did.
>
> I think all ISAs are going to have a problem with this unless they embrace the x86 model of prefixes. Modes may be a feasible way around this though the approach with Thumb1 did not impress me. Another possibility is splitting out some of the bits to extra registers or flags set in blocks like the predication in Thumb2 or My 66000, or the vectorization "prefix" in My 66000.

RISC-V has defined a system for encoding the length of instructions up to 176 bits (22 bytes) long within the first 16 bit parcel of the instruction, while keeping the destination register and up to three source registers in the first 32 bits of the instruction.

I'm not happy with the currently proposed scheme as I think it leaves too few free bits in the first 32 bits for 80 bit instructions, in particular. I'd like to be able to add a few instructions containing a 64 bit literal (both integer and fp) in an 80 bit instruction. The good news is the length encoding for instructions longer than 32 bits has not yet been ratified, and as far as we know no one has built any hardware using them yet. And the BDFL(s) seem open to changing the scheme.

> That said, if people are requesting 64-bit registers, they probably have an actual need. Having 64-bit code usually has a pretty dire effect on code size and data use compared to the usual tiny embedded MCU. The tradeoffs must be worth it.

Could you perhaps explain what you mean by "dire effect on code size and data use"?

I'm not aware of any such effect, not even in a minor way. Code compiled for RV64 is within epsilon of the size of code doing the same computation in any 8, 16, or 32 bit ISA of which I am aware. Certainly for compiled code.

Some 8 or 16 bit ISAs (includng x86) are susceptible to tricky stuff in hand-written code for micro-benchmarks but that tends not to scale. For example you can code a very compact and reasonable performance memcpy() or strlen() in x86 assembler, but it's not smaller than a function call, so once your program is significantly bigger than the total of a handful of library routines it just doesn't matter any more.

Stack use grows a bit because of saving and restoring 64 bit registers instead of 32 bit. If your code calls functions 10 deep and each function saves on average 4 registers then that's an extra 160 bytes of stack. Unlikely to be a deal-breaker.

Data structures full of pointers get bigger if you store raw 64 bit pointers. If you've only got a few KB of RAM then you can store 16 bit offsets and add them to a base register.

> Can you share any of the needs? For deeply embedded chips, I can't see many obvious reasons for this. Those chips do not have lots of RAM so even 32-bit addresses should be more than enough. If they are using some sort of sparse addressing and virtual memory, then they are not what I think of as deeply embedded.

I really can't. If customers want it known what they are doing then they make a press release :-) Probably a joint one. S2 (Cortex M0-like 64 bit) was announced in April and no customers have gone public on it yet. A number of customers have gone public on S5/E5 use (the much larger 64 bit embedded core with a 5-stage pipeline).

> > > And the other competition is a 32-bit core (say, Cortex-M0) with some
> > > extra stuff to allow accessing more memory.
> >
> > Extra stuff such as bank switching and address translation, sure. It's not a lot smaller than just making the damn registers 64 bits wide. Plus you can't just compile standard C code for it but have to muck about with that bank switching. It's awful. Surely the 8086-286 taught us that much?
> >
> > Some people want 64 bit for the address space but it's not necessarily RAM. It might be sparse. Or you might have other bigger 64 bit applications processors in the system but don't want to bother them with taking interrupts or transferring data between peripherals and memory. But you want something a little more than a dumb state machine DMA.
>
> Does your argument here boil down to using the same ISA across all parts of the system? If so, I am not sure I buy it. The advantage of having a single ISA seems to be fairly minimal compared to the other factors like special instructions, code size etc.

We have among the best code size, period, on any ISA. Indisputably the best for 64 bit ISAs. We can add special instructions. Having a widely-used base instruction set and being able to share the costs of compiler maintenance with a large community tends to give much better results than using your own private ISA or even a minor one such as MIPS, ARC, Xtensa etc. Our tooling might not be quite up to x86 and ARM maturity yet, but it's already better than everything *else*.

> > You might not want just one such core. Maybe you want dozens of them. Size starts to count.
>
> As does code size.

Definitely. See above.

> > ARM won't sell you a 64 bit core without SIMD.
> > ARM won't sell you a 64 bit core without FP.
> > ARM won't sell you a 64 bit core without multiply&divide.
> > ARM won't sell you a 64 bit core with 16 registers.
> > ARM can't sell you a 64 bit core with comparable code density to their own 32 bit cores.
> > ARM won't let you add custom instructions -- historically. They're starting next year, with a single core. Guess why.
>
> This points out the biggest risk to RISC-V in my opinion: ARM can always lower prices on some parts of the market and kill off RISC-V at the low end. ARM might lose a lot of money doing that though.

It's mostly not about price. We charge good money for our cores. As long as ARM doesn't change the things in the above list we'd still get most of the same customers even if ARM cores were free. After all, we're already competing against a number of pretty good RISC-V cores available for free on github, including Rocket (which we maintain), Western Digital's SweRV, and others.

ARM could change any of the things in the above list labelled "won't" pretty much with the stroke of a pen. Well, plus quite a bit of RTL configuration work and compiler and runtime library work (that we've already done). That would certainly make things a bit more difficult for us.

> And this is where I am confused. If you only have a few KB of RAM, then using a 64-bit processor would eat that up quickly. It is all about data density in RAM when you have so little.

The data layout in RAM doesn't need to change just because you're using a 64 bit CPU.

> If your algorithms profit from 64-bit registers, then it seems more likely that you are doing something DSP-like where either specialized instructions or SIMD might be a better way to do things.

Perhaps. In which case you can add in SIMD (maybe just the instructions you'll actually use). You don't need to add in FP and SHA and goodness knows what else at the same time, if you won't use them.

Bruce Hoult

unread,

Nov 9, 2019, 11:34:03 PM11/9/19

to

On Saturday, November 9, 2019 at 7:48:35 PM UTC-8, MitchAlsup wrote:
> On Saturday, November 9, 2019 at 7:12:22 PM UTC-6, Bruce Hoult wrote:
> > In My 66000, may I refer you to:
> >
> > BSR JSR r31 is an implicit destination!!!
>
> Actually its R0 that i implicit
> >
> > BB BC BR BSR use r0 as an implicit base register for the offset.
>
> It uses IP which is not an GPR.

Hmm .. maybe I have misinterpreted the following.

http://hoult.org/my66000_regs.png
http://hoult.org/my66000_reg_org.png

> > All the above use r0 as an implicit destination.
> >
> > The use of the PC as an implicit destination for jumps/branches and as an implicit source for relative branches is of course common to most ISAs, including RISC-V. It is, however, a choice and you *could* do things in another way.
> >
> > What happens in My 66000 if you do ADD r0,r0,#48 ?
>
> R0 is not in any way weird, R0 holds values just like any other register.
> There really ARE 32 64-bit registers. R0 gets the return address on CALLs,
> and R0 gets substituted on certain instruction decodings.

The document I have shows r31 getting the return address. See both the above screenshots.

Feel free to shoot me a newer document at br...@hoult.org

> Now, in certain instructions, specifying R0 cause the HW to substitute some other reasonable value, Rb = R0 cause IP to be substituted, Ri causes ZERO
> to be substituted.

Ok, I see. So in other contexts .. in particular arithmetic, r0 is a regular register. It's only in certain contexts that IP or ZERO is substituted, and the original data value is still there for arithmetic and other things afterwards.

That seems like .. a lot of complexity .. also for compiler and other tool writers .. just to get sorta kinda 34 registers instead of 32 :-)

Terje Mathisen

unread,

Nov 10, 2019, 5:29:42 AM11/10/19

to

Bruce Hoult wrote:
> On Saturday, November 9, 2019 at 4:05:17 PM UTC-8, Kyle Hayes wrote:
>> That said, if people are requesting 64-bit registers, they probably
>> have an actual need. Having 64-bit code usually has a pretty dire
>> effect on code size and data use compared to the usual tiny
>> embedded MCU. The tradeoffs must be worth it.
>
> Could you perhaps explain what you mean by "dire effect on code size
> and data use"?
>
> I'm not aware of any such effect, not even in a minor way. Code
> compiled for RV64 is within epsilon of the size of code doing the
> same computation in any 8, 16, or 32 bit ISA of which I am aware.
> Certainly for compiled code.
>
> Some 8 or 16 bit ISAs (includng x86) are susceptible to tricky stuff
> in hand-written code for micro-benchmarks but that tends not to
> scale. For example you can code a very compact and reasonable
> performance memcpy() or strlen() in x86 assembler, but it's not
> smaller than a function call, so once your program is significantly
> bigger than the total of a handful of library routines it just
> doesn't matter any more.

Let's see:

memcpy(dest, sorc, len) in 16 or 32-bit x86 should just have been

rep movsb ;; 2 bytes, vs 3-5 bytes for a CALL

as long as the compiler can use the fact that it needs a memcpy to steer
register allocation.

Worst case you need three reg-reg moves up front to get everything setup
right.

strlen(src)

mov cx,-1
next:
lodsb
loopnz next
mov ax,-1
sub ax,cx

so this is 11 bytes, i.e. the CALL is significantly shorter and could
use much higher performance code.

already...@yahoo.com

unread,

Nov 10, 2019, 5:31:05 AM11/10/19

to

If I read your docs/slides correctly then:
- Mill will not sell you Cortex-M0 sized 64-bit core. Even minimally configured Tin is much bigger than Cortex-M3.
- Mill will not sell you a 64 bit core with code density comparable to Thumb2. At best, comparable to aarch64, but even that not proven.

Anton Ertl

unread,

Nov 10, 2019, 5:35:22 AM11/10/19

to

Bruce Hoult <bruce...@gmail.com> writes:
>On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
>> Bruce Hoult <bruce...@gmail.com> writes:

>> >Don't agree on that. Aarch64 has some brilliant bits, but it also has so=

>me decisions which I think will prove to be very short-sighted.

>>=20
>> Such as?
>
>Such as the fixed size 32 bit instructions. ARM invented the modern RISC wi=
>th both 16 bit and 32 bit instructions, beating every other ISA style for c=

>ode density on compiled C code. It's what propelled them to superstardom.

Doubtful. I think that ARM succeeded by focussing on the embedded and
the low-power market rather than trying to compete at the high end
during the GHz race years. And they succeeded by developing a
business model for this market: selling design and architecture
licenses rather than selling chips.

>A=
>stounding that they abandoned it for 64 bit at the cost of 30% code density=

> compared to their own 32 bit ISA -- and to our 64 bit one.

They obviously don't consider it worth the complication for 64-bit
systems. Makes sense to me. They have an ISA for low-end systems:
T32. No need to complicate A64 (Aarch64) with that. Note that
ARMv8-M supports "a variant of" T32, not A64.

>At the same time, they're likely to run out of opcode space for extensions =

>in the foreseeable future, as MIPS did.

Can you make the "foreseeable" more concrete? Which year? Or what
ISA extension do you foresee that will make them run into this
problem?

>> Why would one want a 64-bit core that is lower-end than a Cortex-A34?
>> The silicon (and thus cost) of the A34 is tiny compared to the silicon
>> of 4GB of DRAM or even flash, so having something even smaller does
>> not save much money.
>

>The only justification I *need* to give is that customers are lining up to =

>write six figure cheques to license such cores.

That does not answer my question at all. And it might be that your
license is just cheaper than whatever ARM is charging (and as for the
customers paying you more than for free cores, they may not be
comfortable with free stuff for CYA reasons).

>The is a *ton* of customer =
>demand for cores with basically a Cortex M0 microarchitecture and code size=

> but with 64 bit registers.

Apparently not enough for ARM to actually do that. Such a core should
be relatively easy for ARM to do.

Interestingly, what ARM offers is the Cortex-A32, a 32-bit (Aarch32)
only ARMv8-A implementation. Not sure why one would want that,
either.

>> And the other competition is a 32-bit core (say, Cortex-M0) with some
>> extra stuff to allow accessing more memory.
>

>Extra stuff such as bank switching and address translation, sure. It's not =
>a lot smaller than just making the damn registers 64 bits wide. Plus you ca=
>n't just compile standard C code for it but have to muck about with that ba=

>nk switching. It's awful. Surely the 8086-286 taught us that much?

You cannot use a regular ARM compiler for these accesses, yes. As for
standard C code, a lot of the nasal demons in the C standard come from
architectures like the 8086, so if you have standard C code (a big if)
and you have a compiler that maps C to this extended architecture, you
can just use it.

However, my idea was that the 64-bit accesses would only be used in a
few routines, and that the programmer would much around explicitly
with the banks there.

>Some people want 64 bit for the address space but it's not necessarily RAM.=
> It might be sparse.

For what?

>Or you might have other bigger 64 bit applications pro=
>cessors in the system but don't want to bother them with taking interrupts =
>or transferring data between peripherals and memory. But you want something=

> a little more than a dumb state machine DMA.

I.e., maybe a processor on a PCIe card, or something like Intel's
Management Engine or AMDs PSP. Yes, that would be an application that
makes sense.

>You might not want just one such core. Maybe you want dozens of them.

For what?

>ARM won't sell you a 64 bit core without SIMD.
>ARM won't sell you a 64 bit core without FP.
>ARM won't sell you a 64 bit core without multiply&divide.
>ARM won't sell you a 64 bit core with 16 registers.

Good. The 32-bit ISAs of ARM are a mess. If you look up a
description of an instruction, you find that it's optional, and you
have no idea how much using that instruction will limit the reach of
your program. Conversely, the fact that AMD64 guarantees SSE2 was a
big advantage over IA-32 which did not: programs compiled for AMD64
could just use SSE2 for FP instead of having to resort to the 387
instruction set.

So the fact that Aarch64 guarantees all these features is an
advantage, and RISC-V will feel the disadvantage of having so little
guaranteed once it tries to expand into the general-purpose computing
space.

>ARM can't sell you a 64 bit core with comparable code density to their own =
>32 bit cores.

IMO Code density is overrated in this newsgroup.

>It's notable that where a particular 32 bit ARM core has 32 KB of icache, t=

>he comparable 64 bit core has 48 KB. That's expensive.

Which cores do you compare?

Looking at the data sheet of the RK3399
<http://www.rockchip.fr/RK3399%20datasheet%20V1.8.pdf>, I see: that the

2 Cortex-A72s: 48KB I-cache, 32KB 4-way set-associative D-cache, 1MB
L2 shared between these cores.

4 Cortex-A53: 32KB I-cache, 32KB 4-way set-associativd D-cache, 512KB
L2 shared between the cores.

Do they use 8KB pages?

The data sheet of the AmLogic S922X (with Cortex-A73) does not specify
the cache sizes.

In any case, Intel has lived with 32KB I-caches on their
high-performance cores for many years and continues to do so in Sunny
Cove (while increasing the D-cache size), and according to you the
code density is the same as that of Aarch64, so the 48KB of the A72 in
the RK3399 is probably just because they could make it that large
without incurring another cycle of latency.

>Some people who want a 64 bit core don't care about large memory addressing=
> at all. They just know that a significant part of energy use goes into ins=
>truction fetch and decode and control. Doubling the datapath width can, dep=
>ending on your workload, nearly halve the number of instructions you have t=

>o fetch and decode and execute.

A good reason to use SIMD instructions, then, not a 64-bit core.
That's if you have SIMD instructions.

>> >As for high end .. we shall see. The U84 is only four years behind the A=
>72. Last year's U74 was six years behind the A53. That looks like rapidly c=
>atching up to me.
>>=20

>> ARM has been trying to play catch-up with Intel/AMD in the server
>> market for quite a while. For now they have not succeeded.
>

>It's very different. That market has historically wanted to run legacy x86 =
>applications.

Not really. There are lots of servers running only free software that
is also available for, e.g., Aarch64.

The issues are not in the applications, but in the fringes:

Booting and therefore installing an OS on an IA-32/AMD64 system is
more standardized and better streamlined than for Aarch64; e.g., I can
boot all kinds of systems from the same USB stick or HDD (the
BIOS->UEFI transition introduced a little fragmentation, however); by
contrast, I cannot use the SD cards of the Odroid C2, Odroid N2,
RockPro64, and Raspi4 interchangeably, even though they all support
Aarch64 and Aarch32.

I have to invest extra work to get at the performance counters of
these Aarch64 machines, and have not succeeded for the Odroid N2, and
only partially for the RockPro64 (and have not tried on the Raspi4),
because the custom kernels they come with are supplied without perf,
and getting a stock kernel to work on them requires extra work. Of
course, not that many people need performance counters, but many need
some other fringe specialty that works out of the box on AMD64 boxes,
but acts up on Aarch64.

The booting issue is also a problem on Alpha and PowerPC (and probably
others), just read the relevant FAQs. Apparently the natural state of
booting is fragmented, and it requires dedication to get to a common
booting system (or now two) as present on AMD64.

>To whatever extent ARM does succeed in making the server market ISA-agnosti=

>c, we benefit from that too.

Only if you have an ISA rather than a fragmented mess of options.

Fragmentation hurts in general-purpose computing, big time.

>> > And a lot of them want predictable execution times.

>>=20

>> Actually the worst-case execution time of caches is predictable and
>> better than that of main memory if the cache is designed appropriately
>> (true LRU replacement is good). I expect that one can also predict
>> the worst-case with dynamic branch predictors, but I don't know if the
>> result is better than with static branch prediction.
>

>It's not caches vs dynamic RAM. It's caches vs about the same size of stati=

>c RAM. In many applications a couple of KB is enough.

Then of course you don't need caches. But you also don't need 64-bit
addressing.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Nov 10, 2019, 5:55:48 AM11/10/19

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Looking at the data sheet of the RK3399
><http://www.rockchip.fr/RK3399%20datasheet%20V1.8.pdf>, I see: that the
>
>2 Cortex-A72s: 48KB I-cache, 32KB 4-way set-associative D-cache, 1MB
>L2 shared between these cores.
>
>4 Cortex-A53: 32KB I-cache, 32KB 4-way set-associativd D-cache, 512KB
>L2 shared between the cores.
>
>Do they use 8KB pages?

getconf PAGESIZE

on the Rockpro64 outputs "4096". So I wonder why they are using 8KB
ways, which means that they have extra complications when implementing
the D-caches as virtually-indexed physically tagged.

Or maybe the data sheet in incorrect about the associativity.

Ivan Godard

unread,

Nov 10, 2019, 6:27:52 AM11/10/19

to

I'm not familiar with the internals of Cortex, but the Mill's
configuration tools are very flexible and I suspect we can reach
anything that any other 64-bit ISA can reach. Tin is not the smallest
configurable Mill; it's the smallest we want to bother selling.

In particular, it is possible to configure a Mill with no caches, TLB,
floating point, scratchpad, or predictors, and with only single issue
per side. The tools do presently require the two sides, but there's
nothing to prevent enhancing that to use single-stream single issue;
split stream and bundle encoding really is only necessary to deal with
wide Mills.

For the data paths, the belt can be any size big enough to hold the
arguments of the widest operation. At present the widest ops are
triadic, so a 4-wide belt is configurable, although the smallest we have
tested is 8-wide. I'm not aware of any 64-bit chip with ISAs with only
four registers.

For the FUs, you do need to configure one each load, store, branch, and
ALU. Can you send me a link to the ISA description of something that
gets by with fewer?

If you can live within a 4GB address space Mill can be configured as a
32-bit core. I'm sure someone will try to do a 16-bit Mill someday too,
although I doubt its usefulness.

We can and will sell such sub-Tin configs to those who want them and are
willing to pay for them.

> - Mill will not sell you a 64 bit core with code density comparable to Thumb2. At best, comparable to aarch64, but even that not proven.

Again, I've never used Thumb2 and have never tried to run
program-to-program comparisons. We have done comparisons against 286
code and come out roughly similar when compiling -Os, depending on the
particular test code and config. How does Thumb compare to 286?

I do wonder how you got the measurements you used for the the Thumb2
comparison you cite. Perhaps we have a virus?

already...@yahoo.com

unread,

Nov 10, 2019, 6:57:39 AM11/10/19

to

On Sunday, November 10, 2019 at 1:27:52 PM UTC+2, Ivan Godard wrote:
> On 11/10/2019 2:31 AM, already...@yahoo.com wrote:
> > On Sunday, November 10, 2019 at 2:44:36 AM UTC+2, Ivan Godard wrote:
>
> >>>> ARM won't sell you a 64 bit core without SIMD.
> >>>> ARM won't sell you a 64 bit core without FP.
> >>>> ARM won't sell you a 64 bit core without multiply&divide.
> >>>> ARM won't sell you a 64 bit core with 16 registers.
> >>>> ARM can't sell you a 64 bit core with comparable code density to their own 32 bit cores.
> >>>> ARM won't let you add custom instructions -- historically. They're starting next year, with a single core. Guess why.
> >>
> >> Mill will.
> >>
> >
> > If I read your docs/slides correctly then:
> > - Mill will not sell you Cortex-M0 sized 64-bit core. Even minimally configured Tin is much bigger than Cortex-M3.
>
>
> I'm not familiar with the internals of Cortex, but the Mill's
> configuration tools are very flexible and I suspect we can reach
> anything that any other 64-bit ISA can reach. Tin is not the smallest
> configurable Mill;

M3 has 3-stage pipeline is 1-wide and has no caches or FPU or address translation. Memory protection unit is optional. For 1-wide it has pretty high IPC (unlike M0), but it's still rather small core.

> it's the smallest we want to bother selling.

We were talking about things that you plan to sell rather than something that could exist in theory.

>
> In particular, it is possible to configure a Mill with no caches, TLB,
> floating point, scratchpad, or predictors, and with only single issue
> per side. The tools do presently require the two sides, but there's
> nothing to prevent enhancing that to use single-stream single issue;
> split stream and bundle encoding really is only necessary to deal with
> wide Mills.
>
> For the data paths, the belt can be any size big enough to hold the
> arguments of the widest operation. At present the widest ops are
> triadic, so a 4-wide belt is configurable, although the smallest we have
> tested is 8-wide. I'm not aware of any 64-bit chip with ISAs with only
> four registers.
>
> For the FUs, you do need to configure one each load, store, branch, and
> ALU. Can you send me a link to the ISA description of something that
> gets by with fewer?
>
> If you can live within a 4GB address space Mill can be configured as a
> 32-bit core. I'm sure someone will try to do a 16-bit Mill someday too,
> although I doubt its usefulness.
>
> We can and will sell such sub-Tin configs to those who want them and are
> willing to pay for them.
>
> > - Mill will not sell you a 64 bit core with code density comparable to Thumb2. At best, comparable to aarch64, but even that not proven.
>
> Again, I've never used Thumb2 and have never tried to run
> program-to-program comparisons. We have done comparisons against 286
> code and come out roughly similar when compiling -Os, depending on the
> particular test code and config. How does Thumb compare to 286?

Thumb2 is 32-bit ISA. I am not sure how to compare it to 16-bit ISA like 286.
Results would very strongly depend on presence/absence of 32-bit arithmetic and of >64KB addressing in the test case.
Also, 286 has very non-orthogonal register set. It means that asm. coder or very good compiler can beat less good compiler by huge margin. What compiler were you using? -Os sounds like something from gcc world, but gcc does not support 286 target.

>
> I do wonder how you got the measurements you used for the the Thumb2
> comparison you cite. Perhaps we have a virus?

Thumb2 is ~20% better than i386.
In the past you claimed near-parity with i386, but I would think that it was for 32-bit code. 64-bit code is necessarily slightly less dense than 32-bit. Or, may be, not if code is small and tools are smart.

Terje Mathisen

unread,

Nov 10, 2019, 7:06:14 AM11/10/19

to

Bruce Hoult wrote:
> But with 32 bit I get the thunk:
>
> $ gcc -O1 -o pic pic.c -m32 && objdump -D pic ... 000004ed <lookup>:
> 4ed: e8 17 00 00 00 call 509 <__x86.get_pc_thunk.ax>
> 4f2: 05 ea 1a 00 00 add $0x1aea,%eax 4f7: 8b 54 24 04
> mov 0x4(%esp),%edx 4fb: dd 84 d0 b4 e5 ff ff fldl
> -0x1a4c(%eax,%edx,8) 502: c3 ret
>
> ...
>
> 00000509 <__x86.get_pc_thunk.ax>: 509: 8b 04 24 mov
> (%esp),%eax 50c: c3 ret
>
> I'm *sure* I've seen that in 64 bit code as well, but I don't recall
> the circumstances. Maybe it was some program that came with the OS or
> from elsewhere that I was debugging. I don't know.
>

I'm pretty sure that was a simple compiler/rtl performance bug in the
case where you saw it, since RIP-relative addressing was one of the key
additions to the AMD64 extension.

Terje Mathisen

unread,

Nov 10, 2019, 7:55:21 AM11/10/19

to

Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
>> On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
>>> And the other competition is a 32-bit core (say, Cortex-M0) with some
>>> extra stuff to allow accessing more memory.
>>
>> Extra stuff such as bank switching and address translation, sure. It's not =
>> a lot smaller than just making the damn registers 64 bits wide. Plus you ca=
>> n't just compile standard C code for it but have to muck about with that ba=
>> nk switching. It's awful. Surely the 8086-286 taught us that much?
>
> You cannot use a regular ARM compiler for these accesses, yes. As for
> standard C code, a lot of the nasal demons in the C standard come from
> architectures like the 8086, so if you have standard C code (a big if)
> and you have a compiler that maps C to this extended architecture, you
> can just use it.

Huh?

Anton, you're usually a very reasonable guy, but blaming x86 for C
compiler nasal demons is _really_ unfair: This is solely a result of
compiler guys (have to be guys, I can't imagine a woman being that
obtuse) having the bright idea that they could use any borderline (or by
some kind of logic, over the border) code as a reason to "optimize" it
by removing it, including all code that it would touch.

Create warnings, yes. Even stop compiling (but allow a #pragma/option to
override), but don't simply remove code.

already...@yahoo.com

unread,

Nov 10, 2019, 8:01:07 AM11/10/19

to

On Sunday, November 10, 2019 at 1:27:52 PM UTC+2, Ivan Godard wrote:

On M3 load and store share AGU. On M0 I am sure that even more units sharing is going on at cost of IPC, but giving smaller area.

M3 has no branch predictor. Due to short pipeline the impact is not that bad.

Personally I played with implementing MIPS-like ISA with register file that has only 1 read port and 1 write port. It's doable, but IPC is not great. IIRC the best I got was slightly worse than 1/3, but I am not a pro in the field. I'd guess that seasoned pro can get close to 0.4.
Now, Cortex-M3 is *not* like that. I would think that it has register file with 3 read ports. I have no idea about Cortex-M0.

>
> If you can live within a 4GB address space Mill can be configured as a
> 32-bit core. I'm sure someone will try to do a 16-bit Mill someday too,
> although I doubt its usefulness.
>

Bruce specified 64-bit as a hard requirement.

Anton Ertl

unread,

Nov 10, 2019, 9:40:07 AM11/10/19

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>Anton Ertl wrote:
>> Bruce Hoult <bruce...@gmail.com> writes:
>>> On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
>>>> And the other competition is a 32-bit core (say, Cortex-M0) with some
>>>> extra stuff to allow accessing more memory.
>>>
>>> Extra stuff such as bank switching and address translation, sure. It's not =
>>> a lot smaller than just making the damn registers 64 bits wide. Plus you ca=
>>> n't just compile standard C code for it but have to muck about with that ba=
>>> nk switching. It's awful. Surely the 8086-286 taught us that much?
>>
>> You cannot use a regular ARM compiler for these accesses, yes. As for
>> standard C code, a lot of the nasal demons in the C standard come from
>> architectures like the 8086, so if you have standard C code (a big if)
>> and you have a compiler that maps C to this extended architecture, you
>> can just use it.
>
>Huh?
>
>Anton, you're usually a very reasonable guy, but blaming x86 for C
>compiler nasal demons is _really_ unfair: This is solely a result of
>compiler guys (have to be guys, I can't imagine a woman being that
>obtuse) having the bright idea that they could use any borderline (or by
>some kind of logic, over the border) code as a reason to "optimize" it
>by removing it, including all code that it would touch.

That's a good point. Architectures like the 8086 (there is no x86
architecture) are the reason for certain undefined behaviour in the C
standard, but it takes *censored* compiler maintainers to turn this
undefined behaviour into nasal demons on architectures like AMD64
64-bit mode that do not have the architectural properties that
inspired this undefined behaviour.

Anyway, my point was that the C standard is prepared for such
architectures. A lot of C code isn't, though, and I wanted to point
out that difference.

BGB

unread,

Nov 10, 2019, 10:18:40 AM11/10/19

to

On 11/9/2019 12:38 PM, EricP wrote:
> Brett wrote:
>>
>> PC relative loads are evil.
>
> You seem to be under the impression that PC relative addressing
> is only used for loading program constants. It is not.
> It can be used to access variable data as PC relative too.
>

Yeah.

Doing a 24-bit displacement or similar relative to PC is usable for
accessing global variables in most cases. Main exception is if one needs
a NoMMU ABI, where in order to have multiple instances of a given
program binary, it is necessary to have a means of accessing globals
which is not tied to the loaded address of the program's binary (such as
accessing data sections via a "Global Register" or similar).

One traditional solution to this was "FDPIC", where the GOT is used to
access all global variables and call all functions, and GOT entries
contain the GOT to load for called functions (function pointers are
essentially handled as pointers into the GOT; so function-address and
target GOT are loaded indirectly), ...

The problem here is that this comes with a bit of a performance overhead
relative to using direct base-relative loads and "bare" function calls.

>> Means you have to track the PC pointer in the pipeline.
>
> The PC is already flowing through the pipeline from Fetch
> as you need it for branch relative & exceptions anyway.
>
> No special tracking is required if it copies the PC in as an
> operand when it queues the instruction at the front of the EX pipeline.
> Which it knows to do because you used PC-rel addressing.
>

Generally true (ignoring cases where people try to use the PC being used
for instruction-fetch for this).

>> Pollutes the L1 data cache with L1 instruction data, wasting capacity.
>
> Sort of. On say Alpha the coding standard had the entry point for a
> routine start on a cache boundary. Prior to the entry was the table
> of constants loaded using what was effectively PC-rel addressing.
>
> The constants cache line(s) should not wind up resident in both
> I$L1 and D$L1, so you don't get a tug-of-war or ping-pong for ownership.
>
> But yes, the constant line does have to be loaded into D$L1.
>
> Note though that some of those constants could be 64-bits,
> including linkage addresses, which on many RISC ISA's would have
> taken 4 instructions to "load immediate" rather than 1 LD PC-rel.
>

There is a trade-off here, while 4>1, the actual difference in clock
cycles and memory use is smaller than this.

More so, the actual need to load full width 64-bit bit constants seems
to be rare in practice.

A possible ugly hack could be a "jumbo-load" in my ISA, eg:
F4jj_Cjjj F4jj_Cjjj FAjj_jjjj

Which is an otherwise illegal encoding, but could be interpreted as a
single-cycle 64-bit load in 96 bits. There would be a few bits left per
instruction-word in the opcode bits, which could be used to encode a
destination register. This "instruction" would essentially hijack all 3
execute lanes at the same time.

Implicitly, it will require 3-wide WEX support in order to work.

So, a way of encoding:
MOV #imm64, Rn

The open question here, is if there is really enough need for this to
justify using such an ugly hack...

Went and added this as an experimental feature, somehow still manages to
pass timing. Does increase CPU core cost by around 2k LUTs though, so it
seems to be a relatively expensive "instruction"...

> Also I have not seen a RISC ISA with equivalent for float load immediate.
> So that may be the only way to load float constants.
>

My BJX1 ISA did, sorta, in the form of a direct-immediate Half-Float load.

In BJX2, many "typical" floating-point constants end up being loaded as
a two-op sequence, frequently:
LDIZ imm24, DLR
SHLD DLR, #40, Rn

Though, with GPRs being used for FPU ops.

>> It’s an extra load, when you may be load limited.
>> The load may delay the pipeline.
>> The load adds to your load queue which may be your limit.
>> The load is more crap to track which reduces the effective size of
>> your OoO
>> window.
>> Even an ordinary load from a generic pointer is better than PC loads.
>
> An ISA needs PC-rel for position independent, self relative data.
> On fixed length RISC, it may be better than 4 load integer immediates.
> On fixed length RISC, its probably the only way to load float immediates.
>

Yeah.

Also doing stuff in 64-bit land via 64-bit absolute addressing is kinda
stupid and expensive...

Better to do relative branches/calls whenever possible, and access
globals/etc relative to either PC or a base pointer.

already...@yahoo.com

unread,

Nov 10, 2019, 10:34:30 AM11/10/19

to

8086, 8088 and 80186 are sufficiently similar to each other to be called a single architecture.

> are the reason for certain undefined behaviour in the C
> standard,

You mean, like pointers and 'unsigned int' not guaranteed to be of the same size and data pointers not guaranteed to be of the same size as function pointers?
I think, from "nasal demons" perspective those are the least problematic UBs in the 'C' standard.

EricP

unread,

Nov 10, 2019, 12:04:39 PM11/10/19

to

BGB wrote:
> On 11/9/2019 12:38 PM, EricP wrote:
>> Brett wrote:
>>>
>>> PC relative loads are evil.
>>
>> You seem to be under the impression that PC relative addressing
>> is only used for loading program constants. It is not.
>> It can be used to access variable data as PC relative too.
>
> Yeah.
>
> Doing a 24-bit displacement or similar relative to PC is usable for
> accessing global variables in most cases. Main exception is if one needs
> a NoMMU ABI, where in order to have multiple instances of a given
> program binary, it is necessary to have a means of accessing globals
> which is not tied to the loaded address of the program's binary (such as
> accessing data sections via a "Global Register" or similar).
>
>
> One traditional solution to this was "FDPIC", where the GOT is used to
> access all global variables and call all functions, and GOT entries
> contain the GOT to load for called functions (function pointers are
> essentially handled as pointers into the GOT; so function-address and
> target GOT are loaded indirectly), ...
>
> The problem here is that this comes with a bit of a performance overhead
> relative to using direct base-relative loads and "bare" function calls.

Using PC-rel efficiently, in other words without all the GOT bullshit,
requires that the OS when loading/mapping an image file,
to relocate the memory sections as a single unit.
That is, that rx-code, ro-data, rw-init, rw-zero
memory sections all maintain their relative positions.

I call such a unit a Region: a contiguous set of one or more memory
sections (aka segments) of different sizes and protections.
Each Region has a flag to enable/disable its relocation.
If Region relocation is allowed and needed, all its sections/segments
are relocated together, maintaining their relative offsets.
The image file specifies a set of one or more memory regions,
which specifies contiguous sets of sections/segments.

The OS image mapper looks for a memory hole big enough for the
whole region, then creates the set of section/segments in it.

It would be a minor tweak to retrofit the above behavior into existing OS's.

The one issue that comes up is the size of the offset
which will not be known until linkage.
Ideally the compiler assumes a 64-bit offset, the linker
discovers it is actually smaller, and compacts the code.

>>> Means you have to track the PC pointer in the pipeline.
>>
>> The PC is already flowing through the pipeline from Fetch
>> as you need it for branch relative & exceptions anyway.
>>
>> No special tracking is required if it copies the PC in as an
>> operand when it queues the instruction at the front of the EX pipeline.
>> Which it knows to do because you used PC-rel addressing.
>>
>
> Generally true (ignoring cases where people try to use the PC being used
> for instruction-fetch for this).

I don't understand that statement.

> There is a trade-off here, while 4>1, the actual difference in clock
> cycles and memory use is smaller than this.
>
> More so, the actual need to load full width 64-bit bit constants seems
> to be rare in practice.
>
>
> A possible ugly hack could be a "jumbo-load" in my ISA, eg:
> F4jj_Cjjj F4jj_Cjjj FAjj_jjjj
>
> Which is an otherwise illegal encoding, but could be interpreted as a
> single-cycle 64-bit load in 96 bits. There would be a few bits left per
> instruction-word in the opcode bits, which could be used to encode a
> destination register. This "instruction" would essentially hijack all 3
> execute lanes at the same time.

I had a similar method in my gedanken-design RISC ISA.
Except I called it a feature :-)

I tossed out one of the dogma tenants of RISC,
that register fields most be single purpose dest or source
and had a register field that was both source and dest.

RISC-ish 32 bit fixed size instruction, 6 bit major opcode,
5 bit register number.

3 instructions for load immediate, with 6-bit opcode, 5-bit reg,
21-bit immediate: the opcode specifies whether to 1 or 0 fill
the high bits on the left, providing the crutial extra bit
so 64 bits can be loaded in 3 instruction as 1+21+21+21 bits.

LDI0 reg, imm21 // Load reg = 21-bit immediate with 0 filled on left
LDI1 reg, imm21 // Load reg = 21-bit immediate with 1 filled on left
LDIS reg, imm21 // Load Scaled reg = (reg << 21) | imm21

These are fetched as normal, fixed size instructions,
and it costs 12 bytes and a register to load a 64-bit immediate.
(People often forget that it costs a register too.)

> Implicitly, it will require 3-wide WEX support in order to work.
>
> So, a way of encoding:
> MOV #imm64, Rn
>
> The open question here, is if there is really enough need for this to
> justify using such an ugly hack...
>
>
> Went and added this as an experimental feature, somehow still manages to
> pass timing. Does increase CPU core cost by around 2k LUTs though, so it
> seems to be a relatively expensive "instruction"...

The other thing that immediates can do is operate directly,
skipping loading into a register first.
reg = reg op imm

Eliminates instruction(s) and saves a register to load the immediate.

>>> It’s an extra load, when you may be load limited.
>>> The load may delay the pipeline.
>>> The load adds to your load queue which may be your limit.
>>> The load is more crap to track which reduces the effective size of
>>> your OoO
>>> window.
>>> Even an ordinary load from a generic pointer is better than PC loads.
>>
>> An ISA needs PC-rel for position independent, self relative data.
>> On fixed length RISC, it may be better than 4 load integer immediates.
>> On fixed length RISC, its probably the only way to load float immediates.
>
> Yeah.
>
> Also doing stuff in 64-bit land via 64-bit absolute addressing is kinda
> stupid and expensive...
>
> Better to do relative branches/calls whenever possible, and access
> globals/etc relative to either PC or a base pointer.

I just didn't want anyone thinking they can decommission
PC-rel addressing, especially on philosophical or moral grounds.

MitchAlsup

unread,

Nov 10, 2019, 12:06:35 PM11/10/19

to

On Saturday, November 9, 2019 at 10:34:03 PM UTC-6, Bruce Hoult wrote:
> On Saturday, November 9, 2019 at 7:48:35 PM UTC-8, MitchAlsup wrote:
> > On Saturday, November 9, 2019 at 7:12:22 PM UTC-6, Bruce Hoult wrote:
> > > In My 66000, may I refer you to:
> > >
> > > BSR JSR r31 is an implicit destination!!!
> >
> > Actually its R0 that i implicit
> > >
> > > BB BC BR BSR use r0 as an implicit base register for the offset.
> >
> > It uses IP which is not an GPR.
>
> Hmm .. maybe I have misinterpreted the following.
>
> http://hoult.org/my66000_regs.png
> http://hoult.org/my66000_reg_org.png

The text remains correct, the picture has changed.
R0 gets the value of the return address
R0 is a proxy for IP when used as a base register
R0 is a proxy for 0x0 when used as an index register
IP moved to a different control block.

>
>
> > > All the above use r0 as an implicit destination.
> > >
> > > The use of the PC as an implicit destination for jumps/branches and as an implicit source for relative branches is of course common to most ISAs, including RISC-V. It is, however, a choice and you *could* do things in another way.
> > >
> > > What happens in My 66000 if you do ADD r0,r0,#48 ?
> >
> > R0 is not in any way weird, R0 holds values just like any other register.
> > There really ARE 32 64-bit registers. R0 gets the return address on CALLs,
> > and R0 gets substituted on certain instruction decodings.
>
> The document I have shows r31 getting the return address. See both the above screenshots.

You have an old copy.

>
> Feel free to shoot me a newer document at br...@hoult.org
>
>
> > Now, in certain instructions, specifying R0 cause the HW to substitute some other reasonable value, Rb = R0 cause IP to be substituted, Ri causes ZERO
> > to be substituted.
>
> Ok, I see. So in other contexts .. in particular arithmetic, r0 is a regular register. It's only in certain contexts that IP or ZERO is substituted, and the original data value is still there for arithmetic and other things afterwards.

Yes, an encoding trick.

>
> That seems like .. a lot of complexity .. also for compiler and other tool writers .. just to get sorta kinda 34 registers instead of 32 :-)

33 plus zero.

MitchAlsup

unread,

Nov 10, 2019, 12:11:07 PM11/10/19

to

On Sunday, November 10, 2019 at 4:35:22 AM UTC-6, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
> >> Bruce Hoult <bruce...@gmail.com> writes:
> >> >Don't agree on that. Aarch64 has some brilliant bits, but it also has so=
> >me decisions which I think will prove to be very short-sighted.
> >>=20
> >> Such as?
> >
> >Such as the fixed size 32 bit instructions. ARM invented the modern RISC wi=
> >th both 16 bit and 32 bit instructions, beating every other ISA style for c=
> >ode density on compiled C code. It's what propelled them to superstardom.

Err., no. Clipper had 16-bit and 32-bit instructions.

>
> Doubtful. I think that ARM succeeded by focussing on the embedded and
> the low-power market rather than trying to compete at the high end
> during the GHz race years. And they succeeded by developing a
> business model for this market: selling design and architecture
> licenses rather than selling chips.
>
> >A=
> >stounding that they abandoned it for 64 bit at the cost of 30% code density=
> > compared to their own 32 bit ISA -- and to our 64 bit one.
>
> They obviously don't consider it worth the complication for 64-bit
> systems. Makes sense to me. They have an ISA for low-end systems:
> T32. No need to complicate A64 (Aarch64) with that. Note that
> ARMv8-M supports "a variant of" T32, not A64.
>
> >At the same time, they're likely to run out of opcode space for extensions =
> >in the foreseeable future, as MIPS did.
>
> Can you make the "foreseeable" more concrete? Which year? Or what
> ISA extension do you foresee that will make them run into this
> problem?
>
> >> Why would one want a 64-bit core that is lower-end than a Cortex-A34?
> >> The silicon (and thus cost) of the A34 is tiny compared to the silicon
> >> of 4GB of DRAM or even flash, so having something even smaller does
> >> not save much money.
> >
> >The only justification I *need* to give is that customers are lining up to =
> >write six figure cheques to license such cores.

6 figure checks keep a CPU design team busy for 1 month.

Anton Ertl

unread,

Nov 10, 2019, 1:06:43 PM11/10/19

to

already...@yahoo.com writes:
>On Sunday, November 10, 2019 at 4:40:07 PM UTC+2, Anton Ertl wrote:
>> That's a good point. Architectures like the 8086 (there is no x86
>> architecture)
>
>8086, 8088 and 80186 are sufficiently similar to each other to be called a single architecture.

Yes, I call it the 8086 architecture. And the 286 real mode is an
extension of this architecture. But the "x86" term has been used for
the 16-bit, 32-bit and 64-bit instruction sets (all incompatible), as
if they were one ISA.

>> are the reason for certain undefined behaviour in the C
>> standard,
>
>You mean, like pointers and 'unsigned int' not guaranteed to be of the same size and data pointers not guaranteed to be of the same size as function pointers?

The concrete thing I was thinking about was that comparing pointers to
different objects with <, <=, >, >= is undefined. But sure, this is
probably not the only undefined behaviour that was introduced to
accomodate the 8086 (or maybe earlier architectures with similar
characteristics).

>I think, from "nasal demons" perspective those are the least problematic UBs in the 'C' standard.

Why do you think so?

1) Because C compilers do not use these UBs for nasal demons yet?

2) Because you don't know programs that exhibit these UBs?

There are >200 UBs in C11, and probably quite a few that fall under
1).

BGB

unread,

Nov 10, 2019, 1:44:06 PM11/10/19

to

I am gradually making progress towards something usable on a Nexys A7;
As-is, I can boot to a command-line, type commands, and get Doom or
Quake to start loading...

Currently though, Doom just sort of reaches the title screen and
crashes. Quake gets to the console, tries to start playing the E1M3
demo, and triggers a breakpoint.

I can see from debug messages and sanity-check functions that the
"math.h" functions are currently generally producing garbage, so
debugging is still a bit of an ongoing process...

I find and fix bugs, but then find that things are still broken. It is
at the stage of things where the bugs that remain are generally off in
edge cases, like instructions failing with specific inputs, or
discovering an edge case where MMIO ops failed if the load/store
instructions directly followed each other, ...

Code density is generally similar to SuperH for size-optimized binaries.
This will come at a performance cost, but makes sense for things where
code-density matters more than speed.

Enabling WEX comes with a roughly 38% code density penalty (as the
generated program then consists mostly of 32-bit instructions, rather
than mostly of 16-bit instructions in the size-optimized case).

Accurate comparison with other architectures is made harder by lacking a
good way to do "apples for apples" comparisons (need to control for
things like which C library is used and how it is linked, ...).

Generally, it seems "acceptable" though.

Niklas Holsti

unread,

Nov 10, 2019, 1:54:18 PM11/10/19

to

On 2019-11-10 13:27, Ivan Godard wrote:

> If you can live within a 4GB address space Mill can be configured as a
> 32-bit core.

I am surprised -- I seem to remember statements that the Mill needs 64
bits because pointers need some flag/control bits in addition to the
virtual address bits, and also because large areas of the 64-bit virtual
address space are assigned to specific uses. Could you describe the
possible 32-bit Mill in a little more detail, especially on these
points: the structure of pointers, and the allocation of virtual address
space?

--
Niklas Holsti
niklas holsti tidorum fi
. @ .

Ivan Godard

unread,

Nov 10, 2019, 4:20:58 PM11/10/19

to

The three GC support bits might be omitted, and would be if the customer
decided that a conventional GC (or none) would do. Likewise bounded
memory descriptors could be omitted; any array bounds checking would
then be in software, as in non-Mill ISAs.

The minimal Mill described has no TLB and (like its big brethren) uses a
static address space; consequently all processes must share the 4GB
space. The number of supported threads and the amount of space reserved
for stack fragments is part of the config. We have never played with a
32-bit config to tune those numbers, but as a wild guess for a tuning
starting point I'd start with 256 threads and 1MB. However this
*heavily* depends on what the customer was going to use it for. A
dedicated IOT config might have only 64k physical RAM, and would config
as 1 thread and 4k stack.

We are certainly not going to make anything like these ourselves.
However, the config system would handle this and NRE talks...

Bruce Hoult

unread,

Nov 10, 2019, 4:25:41 PM11/10/19

to

On Sunday, November 10, 2019 at 2:35:22 AM UTC-8, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >Such as the fixed size 32 bit instructions. ARM invented the modern RISC wi=
> >th both 16 bit and 32 bit instructions, beating every other ISA style for c=
> >ode density on compiled C code. It's what propelled them to superstardom.
>
> Doubtful. I think that ARM succeeded by focussing on the embedded and
> the low-power market rather than trying to compete at the high end
> during the GHz race years. And they succeeded by developing a
> business model for this market: selling design and architecture
> licenses rather than selling chips.

They did that also, of course.

But Thumb got ARM into mobile devices (especially ones with 16 bit memory bus), while Thumb2 let them *own* mobile.

> >A=
> >stounding that they abandoned it for 64 bit at the cost of 30% code density=
> > compared to their own 32 bit ISA -- and to our 64 bit one.
>
> They obviously don't consider it worth the complication for 64-bit
> systems. Makes sense to me. They have an ISA for low-end systems:
> T32. No need to complicate A64 (Aarch64) with that. Note that
> ARMv8-M supports "a variant of" T32, not A64.

Again the assumption, contradicted by what we see, that there is no demand for 64 bit at the low end.

> >At the same time, they're likely to run out of opcode space for extensions =
> >in the foreseeable future, as MIPS did.
>
> Can you make the "foreseeable" more concrete? Which year? Or what
> ISA extension do you foresee that will make them run into this
> problem?

It's unknowable.

Do you want to use the same basic ISA forever? The part that compilers use to generate code from medium to high level programming languages. Is it practical to think you could?

We believe that enough has been learned about computer ISA design in the last 70 years to answer "yes" to those questions.

> >> Why would one want a 64-bit core that is lower-end than a Cortex-A34?
> >> The silicon (and thus cost) of the A34 is tiny compared to the silicon
> >> of 4GB of DRAM or even flash, so having something even smaller does
> >> not save much money.
> >
> >The only justification I *need* to give is that customers are lining up to =
> >write six figure cheques to license such cores.
>
> That does not answer my question at all. And it might be that your
> license is just cheaper than whatever ARM is charging (and as for the
> customers paying you more than for free cores, they may not be
> comfortable with free stuff for CYA reasons).

That might explain how we sell low end 32 bit cores or high end 64 bit ones. It doesn't explain why we're selling a lot of low end 64 bit cores, which is the current subject. ARM doesn't compete in that space at all.

> >The is a *ton* of customer =
> >demand for cores with basically a Cortex M0 microarchitecture and code size=
> > but with 64 bit registers.
>
> Apparently not enough for ARM to actually do that. Such a core should
> be relatively easy for ARM to do.

You'd think so, yes. Except they, so far, refuse to subset their 64 bit ISA. And are completely hosed on code density even if they do.

> >> And the other competition is a 32-bit core (say, Cortex-M0) with some
> >> extra stuff to allow accessing more memory.
> >
> >Extra stuff such as bank switching and address translation, sure. It's not =
> >a lot smaller than just making the damn registers 64 bits wide. Plus you ca=
> >n't just compile standard C code for it but have to muck about with that ba=
> >nk switching. It's awful. Surely the 8086-286 taught us that much?
>
> You cannot use a regular ARM compiler for these accesses, yes. As for
> standard C code, a lot of the nasal demons in the C standard come from
> architectures like the 8086, so if you have standard C code (a big if)
> and you have a compiler that maps C to this extended architecture, you
> can just use it.

You could make it work, sure, at a pretty big cost in speed and code size reloading segment registers for every "far pointer" dereference. We've been there. We know it sucks. No one missed the 286 once the 386 was out and NT/Linux appeared (which admittedly took a *long* time).

> However, my idea was that the 64-bit accesses would only be used in a
> few routines, and that the programmer would much around explicitly
> with the banks there.

You could make it work, yes.

> >ARM won't sell you a 64 bit core without SIMD.
> >ARM won't sell you a 64 bit core without FP.
> >ARM won't sell you a 64 bit core without multiply&divide.
> >ARM won't sell you a 64 bit core with 16 registers.
>
> Good. The 32-bit ISAs of ARM are a mess. If you look up a
> description of an instruction, you find that it's optional, and you
> have no idea how much using that instruction will limit the reach of
> your program. Conversely, the fact that AMD64 guarantees SSE2 was a
> big advantage over IA-32 which did not: programs compiled for AMD64
> could just use SSE2 for FP instead of having to resort to the 387
> instruction set.
>
> So the fact that Aarch64 guarantees all these features is an
> advantage, and RISC-V will feel the disadvantage of having so little
> guaranteed once it tries to expand into the general-purpose computing
> space.

Application processors in servers / desktops / laptops / smartphones are utterly different to deeply embedded tiny processors.

In embedded you know exactly what code you are going to run and can select your ISA features to suit.

When you're running standard software packages that can't and will not happen. *Everyone* concerned agreed years ago that RISC-V processors in general purpose Linux systems will support RV64IMAFDC and software will assume that support. The Debian and Fedora maintainers are onboard with that, along with all the core vendors (SiFive, Andes, Syntacore, lowRISC, Codasip, Esperanto, CloudBEAR, Shakti, the pan-European Supercomputer Initiative etc).

People using Buildroot or OpenEmbedded or something Linux in a situation with a fixed or controlled software load can of course use any ISA subset they want because they're compiling everything themselves.

> >ARM can't sell you a 64 bit core with comparable code density to their own =
> >32 bit cores.
>
> IMO Code density is overrated in this newsgroup.

You obviously haven't seen how actual customers complain if code size for RISC-V comes out 3% or 5% bigger than Thumb2 on some benchmark or inner loop! Even though other benchmarks or loops work out with the advantage in the other direction by a similar amount.

> >It's notable that where a particular 32 bit ARM core has 32 KB of icache, t=
> >he comparable 64 bit core has 48 KB. That's expensive.
>
> Which cores do you compare?

A15 vs A57 and A72.

> In any case, Intel has lived with 32KB I-caches on their
> high-performance cores for many years and continues to do so in Sunny
> Cove (while increasing the D-cache size), and according to you the
> code density is the same as that of Aarch64, so the 48KB of the A72 in
> the RK3399 is probably just because they could make it that large
> without incurring another cycle of latency.

It's hard to compare with Intel, because they have not only I-cache but also uop cache. It also makes a difference how quickly you can service an L1 cache miss from L2. If the L2 is relatively fast then you can make L1 smaller. A lot of trade-offs there.

> >Some people who want a 64 bit core don't care about large memory addressing=
> > at all. They just know that a significant part of energy use goes into ins=
> >truction fetch and decode and control. Doubling the datapath width can, dep=
> >ending on your workload, nearly halve the number of instructions you have t=
> >o fetch and decode and execute.
>
> A good reason to use SIMD instructions, then, not a 64-bit core.
> That's if you have SIMD instructions.

SIMD helps certain workloads, of course. And simply having larger registers helps others -- for example you might basically be doing large memcpy, in which case, sure, it's better to have 512 or 256 bit registers than 32 bit registers. That's a heck of a lot of extra silicon though, and you were just trying to tell me that no one would want 64 bit registers instead of 32 bit in deeply embedded :-)

> >> >As for high end .. we shall see. The U84 is only four years behind the A=
> >72. Last year's U74 was six years behind the A53. That looks like rapidly c=
> >atching up to me.
> >>=20
> >> ARM has been trying to play catch-up with Intel/AMD in the server
> >> market for quite a while. For now they have not succeeded.
> >
> >It's very different. That market has historically wanted to run legacy x86 =
> >applications.
>
> Not really. There are lots of servers running only free software that
> is also available for, e.g., Aarch64.
>
> The issues are not in the applications, but in the fringes:
>
> Booting and therefore installing an OS on an IA-32/AMD64 system is
> more standardized and better streamlined than for Aarch64; e.g., I can
> boot all kinds of systems from the same USB stick or HDD (the
> BIOS->UEFI transition introduced a little fragmentation, however); by
> contrast, I cannot use the SD cards of the Odroid C2, Odroid N2,
> RockPro64, and Raspi4 interchangeably, even though they all support
> Aarch64 and Aarch32.

Agreed. And we're working very hard to make sure booting is standardized across all RISC-V machines.

You inevitably need a little bit of board manufacturer FSBL in ROM just to get clocks set up and DRAM controllers initialized, but then everyone is working on the next stage (such as Das U-boot) being standard, followed by a standard hardware abstraction layer with things such as OpenSBI which the Linux (or other) kernel talks to.

> I have to invest extra work to get at the performance counters of
> these Aarch64 machines, and have not succeeded for the Odroid N2, and
> only partially for the RockPro64 (and have not tried on the Raspi4),
> because the custom kernels they come with are supplied without perf,
> and getting a stock kernel to work on them requires extra work. Of
> course, not that many people need performance counters, but many need
> some other fringe specialty that works out of the box on AMD64 boxes,
> but acts up on Aarch64.

I've been there. I built my own kernel for the Odroid C2 with perf enabled. Wasn't a big deal as Odroid supply the source for their customized kernel.

> The booting issue is also a problem on Alpha and PowerPC (and probably
> others), just read the relevant FAQs. Apparently the natural state of
> booting is fragmented, and it requires dedication to get to a common
> booting system (or now two) as present on AMD64.

We have this dedication :-)

> >To whatever extent ARM does succeed in making the server market ISA-agnosti=
> >c, we benefit from that too.
>
> Only if you have an ISA rather than a fragmented mess of options.
>
> Fragmentation hurts in general-purpose computing, big time.

At the smartphone through to supercomputer level of machine there is / will be no fragmentation in the base ISA that packaged software expects.

The only issue coming down the road is that there will soon be the Vector extension and older machines in the field (and some new machines) won't support it.

The good news is there will only be *one* standard vector extension and opcode set covering all vector lengths. If someone wants to make a CPU where the vectors are just one element long that will be relatively cheap to do in hardware (the vector register file will be the same size as the scalar FP register file), but exactly the same code will run as on machines with 512 bit or indeed 65536 bit vector registers. Just not any faster than scalar code :-)

But the only fragmentation will be "vector support or not", not N different vector lengths. That's manageable.

> >> > And a lot of them want predictable execution times.
> >>=20
> >> Actually the worst-case execution time of caches is predictable and
> >> better than that of main memory if the cache is designed appropriately
> >> (true LRU replacement is good). I expect that one can also predict
> >> the worst-case with dynamic branch predictors, but I don't know if the
> >> result is better than with static branch prediction.
> >
> >It's not caches vs dynamic RAM. It's caches vs about the same size of stati=
> >c RAM. In many applications a couple of KB is enough.
>
> Then of course you don't need caches. But you also don't need 64-bit
> addressing.

Again, 64 bit addressing is not the only reason to provide 64 bit registers. See your own argument about SIMD registers.

BGB

unread,

Nov 10, 2019, 4:35:16 PM11/10/19

to

This is generally how it works with PE/COFF, though it also has a
base-relocation table (for any absolute-address fix-ups or similar).

In my current PBO ABI, the Base-Reloc table is used mostly to fixup PBO
indices and similar.

In PBO though, rather than being a single big region, the PE image is
essentially split into two sub-regions:
One which includes ".text" and any constant data sections;
One which includes ".data" and ".bss" and similar.

When a new program instance is created, space is allocated for the data
sections, and the initial contents of are copied over by the loader
(excluding ".bss"). This region has a small table of pointers glued on
the front, which are used so that DLLs can get back to their own data
sections.

Plus side: No GOT necessary, so overhead is lower.

> The one issue that comes up is the size of the offset
> which will not be known until linkage.
> Ideally the compiler assumes a 64-bit offset, the linker
> discovers it is actually smaller, and compacts the code.
>

Slightly less of an issue in my case, since my compiler does all the
code-generation all at once; so it can effectively determine in advance
what the worst-case distances are.

The main alternative I would think is to use a "memory model", and
potentially require the user to intervene (via command-line options) if
the linked binary turns out to be much larger than expected.

For function imports/exports, it is generally possible to call through a
stub.

This doesn't cover global variables shared across multiple libraries,
but at least in PE/COFF land this generally isn't allowed as-such.

>>>> Means you have to track the PC pointer in the pipeline.
>>>
>>> The PC is already flowing through the pipeline from Fetch
>>> as you need it for branch relative & exceptions anyway.
>>>
>>> No special tracking is required if it copies the PC in as an
>>> operand when it queues the instruction at the front of the EX pipeline.
>>> Which it knows to do because you used PC-rel addressing.
>>>
>>
>> Generally true (ignoring cases where people try to use the PC being
>> used for instruction-fetch for this).
>
> I don't understand that statement.
>

A few RISC's, rather than capture PC and forward it through the
pipeline, used the PC value which was currently being fed into the IF
stage for address calculations.

The result was that for things like branches, rather than pointing at
the next instruction or similar, the PC used for branch calculations
pointed several words ahead of the current instruction (typically at the
instruction directly following the branch delay slot).

>> There is a trade-off here, while 4>1, the actual difference in clock
>> cycles and memory use is smaller than this.
>>
>> More so, the actual need to load full width 64-bit bit constants seems
>> to be rare in practice.
>>
>>
>> A possible ugly hack could be a "jumbo-load" in my ISA, eg:
>> F4jj_Cjjj F4jj_Cjjj FAjj_jjjj
>>
>> Which is an otherwise illegal encoding, but could be interpreted as a
>> single-cycle 64-bit load in 96 bits. There would be a few bits left
>> per instruction-word in the opcode bits, which could be used to encode
>> a destination register. This "instruction" would essentially hijack
>> all 3 execute lanes at the same time.
>
> I had a similar method in my gedanken-design RISC ISA.
> Except I called it a feature :-)
>

OK.

I realized I could generalize the design slightly, and it is possible it
could later be used for a few other 64 or 96 bit instruction-forms.

First Word:
* F4zz_Czzz..F4zz_Fzzz: "Jumbo Prefix"

Second Word:
* F0zz_zzzz..F3zz_zzzz: Jumbo64
** Possible 64-bit encoding space
* F4zz_zzzz..F7zz_zzzz: Jumbo96
** Serves as a data word.

Third Word:
* F0zz_zzzz..F3zz_zzzz: Possible 96-bit encoding space
* FAzz_zzzz/FBzz_zzzz: JLDI (Imm64)

The "Jumbo Prefix" here being an otherwise illegal instruction:
A wide-execute branch (Branch ops are only allowed as the last
instruction in a bundle).

The drawback is that making this work required some pretty ugly hacks at
various stages along the pipeline.

> I tossed out one of the dogma tenants of RISC,
> that register fields most be single purpose dest or source
> and had a register field that was both source and dest.
>

Most ops in my case use 1 or 2 source registers, and a destination
register. A few ops read from all three registers, but have some special
restrictions when executed in parallel (so that all the instructions can
be handled individually as 2R+1W), mostly by sharing some of the
register ports between execute lanes.

> RISC-ish 32 bit fixed size instruction, 6 bit major opcode,
> 5 bit register number.
>
> 3 instructions for load immediate, with 6-bit opcode, 5-bit reg,
> 21-bit immediate: the opcode specifies whether to 1 or 0 fill
> the high bits on the left, providing the crutial extra bit
> so 64 bits can be loaded in 3 instruction as 1+21+21+21 bits.
>
> LDI0 reg, imm21 // Load reg = 21-bit immediate with 0 filled on left
> LDI1 reg, imm21 // Load reg = 21-bit immediate with 1 filled on left
> LDIS reg, imm21 // Load Scaled reg = (reg << 21) | imm21
>
> These are fetched as normal, fixed size instructions,
> and it costs 12 bytes and a register to load a 64-bit immediate.
> (People often forget that it costs a register too.)
>

My case, the 32-bit LDI* ops mostly have 16-bit immediate values, apart
from a special case which loads a 24 bit value into a fixed register.

>> Implicitly, it will require 3-wide WEX support in order to work.
>>
>> So, a way of encoding:
>> MOV #imm64, Rn
>>
>> The open question here, is if there is really enough need for this to
>> justify using such an ugly hack...
>>
>>
>> Went and added this as an experimental feature, somehow still manages
>> to pass timing. Does increase CPU core cost by around 2k LUTs though,
>> so it seems to be a relatively expensive "instruction"...
>
> The other thing that immediates can do is operate directly,
> skipping loading into a register first.
> reg = reg op imm
>
> Eliminates instruction(s) and saves a register to load the immediate.
>

I have a lot of ops like this, though in the 32-bit instructions most
are limited to 9 or 10 bits:
OP Rm, Imm9, Rn
OP Imm10, Rn
With some as 16 bits:
OP Imm16, Rn

>>>> It’s an extra load, when you may be load limited.
>>>> The load may delay the pipeline.
>>>> The load adds to your load queue which may be your limit.
>>>> The load is more crap to track which reduces the effective size of
>>>> your OoO
>>>> window.
>>>> Even an ordinary load from a generic pointer is better than PC loads.
>>>
>>> An ISA needs PC-rel for position independent, self relative data.
>>> On fixed length RISC, it may be better than 4 load integer immediates.
>>> On fixed length RISC, its probably the only way to load float
>>> immediates.
>>
>> Yeah.
>>
>> Also doing stuff in 64-bit land via 64-bit absolute addressing is
>> kinda stupid and expensive...
>>
>> Better to do relative branches/calls whenever possible, and access
>> globals/etc relative to either PC or a base pointer.
>
> I just didn't want anyone thinking they can decommission
> PC-rel addressing, especially on philosophical or moral grounds.
>

Yeah.

>
>

already...@yahoo.com

unread,

Nov 10, 2019, 4:45:06 PM11/10/19

to

By my measurements (gcc8 -Os), ~5%. Not great, but not catastrophic.

Bruce Hoult

unread,

Nov 10, 2019, 5:24:54 PM11/10/19

to

On Sunday, November 10, 2019 at 9:11:07 AM UTC-8, MitchAlsup wrote:
> On Sunday, November 10, 2019 at 4:35:22 AM UTC-6, Anton Ertl wrote:
> > Bruce Hoult <bruce...@gmail.com> writes:
> > >On Saturday, November 9, 2019 at 1:39:29 AM UTC-8, Anton Ertl wrote:
> > >> Bruce Hoult <bruce...@gmail.com> writes:
> > >> >Don't agree on that. Aarch64 has some brilliant bits, but it also has so=
> > >me decisions which I think will prove to be very short-sighted.
> > >>=20
> > >> Such as?
> > >
> > >Such as the fixed size 32 bit instructions. ARM invented the modern RISC wi=
> > >th both 16 bit and 32 bit instructions, beating every other ISA style for c=
> > >ode density on compiled C code. It's what propelled them to superstardom.
>
> Err., no. Clipper had 16-bit and 32-bit instructions.

As did both IBM 801 and RISC-I. But then it was forgotten for twenty or twenty five years. Maybe it even made sense to do so given the combination of market and technology of the time, I don't know.

Let's say "rediscovered".

> > >> Why would one want a 64-bit core that is lower-end than a Cortex-A34?
> > >> The silicon (and thus cost) of the A34 is tiny compared to the silicon
> > >> of 4GB of DRAM or even flash, so having something even smaller does
> > >> not save much money.
> > >
> > >The only justification I *need* to give is that customers are lining up to =
> > >write six figure cheques to license such cores.
>
> 6 figure checks keep a CPU design team busy for 1 month.

Something like that. "Six figures" covers a 10:1 range. Passing 100 design wins was announced in a press release on June 6. I *thought* a considerably higher number had been announced recently but I can't find it (the announcement, not the number) right now, so I won't give it.

You do get to sell the same general CPU design to more than one customer :-) Building automated systems to minimize engineer time in customizing cores and SoCs by letting the customer do that directly is something we've put a lot of effort into.

https://scs.sifive.com/core-designer/

Right now we have:

- 2 series. 2-3 stage pipe, usually no cache, no branch prediction
- 3/5 series. The original 5 stage pipe classic RISC
- 7 series. Dual issue in order
- 8 series. Out of Order

All are available in both 32 and 64 bit, with or without FPU, with or without MMU etc. [3 series was 32 bit, 5 series 64 bit, but for more recent cores we instead distinguish by a letter prefix: E = 32 bit embedded, S = 64 bit embedded, U = 64 bit Linux (etc) capable.]

Bruce Hoult

unread,

Nov 10, 2019, 5:31:58 PM11/10/19

to

On Sunday, November 10, 2019 at 1:45:06 PM UTC-8, already...@yahoo.com wrote:
> On Sunday, November 10, 2019 at 11:25:41 PM UTC+2, Bruce Hoult wrote:
> > On Sunday, November 10, 2019 at 2:35:22 AM UTC-8, Anton Ertl wrote:
> > > Bruce Hoult <bruce...@gmail.com> writes:
> > > >The is a *ton* of customer =
> > > >demand for cores with basically a Cortex M0 microarchitecture and code size=
> > > > but with 64 bit registers.
> > >
> > > Apparently not enough for ARM to actually do that. Such a core should
> > > be relatively easy for ARM to do.
> >
> > You'd think so, yes. Except they, so far, refuse to subset their 64 bit ISA. And are completely hosed on code density even if they do.
>
> By my measurements (gcc8 -Os), ~5%. Not great, but not catastrophic.

On what code?

already...@yahoo.com

unread,

Nov 10, 2019, 5:53:51 PM11/10/19

to

strtod().
Implementation geared toward small code size. Not as precise as yours, but smaller, esp. tables are much smaller. And probably somewhat faster. The code does not use FP.

Of course, I realize that my test case is too tiny, but it gives an order of magnitude.

Bruce Hoult

unread,

Nov 10, 2019, 6:25:20 PM11/10/19

to

Yeah, far too small.

Anton came up with what I think is a pretty good methodology.

https://groups.google.com/d/msg/comp.arch/_htwHlLQU54/3jkBnP5dAAAJ

That showed ARM64 39.7% 27.6% 26.5% bigger than ARMHF.

It would be interesting to repeat the analysis on current code, and architectures.

already...@yahoo.com

unread,

Nov 11, 2019, 3:56:17 AM11/11/19

to

I am o.k. with his choice of bash, grep and gzip, but don't like the rest of his method.
I'd prefer if it would be compiled from source by ourselves. Preferably with clang -Os. gcc -Os is unreasonably underoptimized, while gcc -O2 is a subject to seemingly random optimization bloat.
If compiling bash from source is too complicated then it can be replaced by simpler common utility.

But I am quite bad at compiling ancient OSS programs (and hate autoconf) so for me it would be lot of work. I'd think, for you it would take few minutes. I mean, only aarch64 vs RV64C, we don't need the rest.

I would love to add nanomips to the test, but it looks like it's still not supported by clang. I am not sure, but it looks like it's not supported even by gcc trunk.

Terje Mathisen

unread,

Nov 11, 2019, 4:44:15 AM11/11/19

to

strtod() is actually a very interesting algorithm, particularly when
doing it with integer code only. If not else, it would give you a fairly
deep approciation for what goes into the 754 spec. :-)

Making the code both small and fast, while also being bit-accurate (i.e.
always returning the binary number which is a correctly rounded version
of the infinitely precise decimal value is NOT trivial. :-)

In the beginning it looks easy:

Just convert the decimal value to an unsigned integer, i.e. put the
decimal place after the last digit, and adjust the exponent by the same
amount.

Now you just scale this exact value by the requisite power of ten, right?

The problems start to occur when that power of ten is largish, either
positive or negative, because you don't want to round any intermediate
results...

Another issue: What if the decimal value has 50 digits (160+ bits),
selected so as to place the final (exponent-adjusted) value very close
to a binary rounding point? In this case there might not be any good
solution except to first try reasonably fast code, and then test the
trailing (rounding) bits and check if we're too close to such a rounding
point, in which case we might have to fall back on arbitrary precision
library code. :-(

Niklas Holsti

unread,

Nov 11, 2019, 4:48:04 AM11/11/19

to

Thanks for your informative answer, Ivan.

So it seems, AIUI, that the Mill architecture can indeed be squeezed
down to 32-bit addresses, although one may lose some of the functions
(GC, protection) and may have to accept some constraints on HW support
for threads and stacks. Perhaps one must also supply a somewhat larger
physical memory than for architectures where the stacks are SW-defined.

I assume that even a minimal Mill would have caches, right? Is there
some minimal size for the caches?

Would you consider a "minimal Mill" targeted for FPGA implementation,
where the users could define the config and thus choose the number of
threads and the stack fragment sizes according to their applications?

Another approach to flexibility could be to configure the number of
threads and the size and mapping of stack fragments at processor reset,
from say some small boot NVM, instead of in the hard config, unless
those numbers have a more profound impact on the rest of the processor
than seems likely to me.

Bruce Hoult

unread,

Nov 11, 2019, 5:28:20 AM11/11/19

to

I developed small and exactly precise code for conversion to and from strings for the AlcheMo Java to native compiler for pre-Android/iPhone phones back in 2006/7. It actually produces exact round trip results and minimal length string representations, until official Java libraries.

You don't need arbitrary precision code. For IEEE double it is always enough to have 1024+54*2 bits (or some figure very near to that). I don't remember the exact number of bits now, but I do 100% remember that 144 bytes was the next higher multiple of 32 bits.

Ivan Godard

unread,

Nov 11, 2019, 5:33:57 AM11/11/19

to

Note, all these answers are speculative and subject to marketing
considerations.

> So it seems, AIUI, that the Mill architecture can indeed be squeezed
> down to 32-bit addresses, although one may lose some of the functions
> (GC, protection) and may have to accept some constraints on HW support
> for threads and stacks. Perhaps one must also supply a somewhat larger
> physical memory than for architectures where the stacks are SW-defined.
>
> I assume that even a minimal Mill would have caches, right? Is there
> some minimal size for the caches?

No, cache size is configurable and can be zero.

> Would you consider a "minimal Mill" targeted for FPGA implementation,
> where the users could define the config and thus choose the number of
> threads and the stack fragment sizes according to their applications?

We do not expect to productize an FPGA implementation. We will do one,
but for internal use only.

> Another approach to flexibility could be to configure the number of
> threads and the size and mapping of stack fragments at processor reset,
> from say some small boot NVM, instead of in the hard config, unless
> those numbers have a more profound impact on the rest of the processor
> than seems likely to me.

Unclear whether that would be practical. We're unlike t put any time
into looking further without active customer interest.

already...@yahoo.com

unread,

Nov 11, 2019, 5:41:52 AM11/11/19

to

On Monday, November 11, 2019 at 11:44:15 AM UTC+2, Terje Mathisen wrote:
> already...@yahoo.com wrote:
> > On Monday, November 11, 2019 at 12:31:58 AM UTC+2, Bruce Hoult
> > wrote:
> >> On Sunday, November 10, 2019 at 1:45:06 PM UTC-8,
> >> already...@yahoo.com wrote:
> >>> By my measurements (gcc8 -Os), ~5%. Not great, but not
> >>> catastrophic.
> >>
> >> On what code?
> >
> > strtod(). Implementation geared toward small code size. Not as
> > precise as yours, but smaller, esp. tables are much smaller. And
> > probably somewhat faster. The code does not use FP.
> >
> > Of course, I realize that my test case is too tiny, but it gives an
> > order of magnitude.
>
> strtod() is actually a very interesting algorithm, particularly when
> doing it with integer code only. If not else, it would give you a fairly
> deep approciation for what goes into the 754 spec. :-)
>
> Making the code both small and fast, while also being bit-accurate (i.e.
> always returning the binary number which is a correctly rounded version
> of the infinitely precise decimal value is NOT trivial. :-)
>

My requirements were not only small code, but also small tables.
As to fast, it ended up faster than most libraries around, but that only because I am undisciplined. In practice, slow is good enough.
Since I knew upfront that bit-accurate is hard, it was not a requirement.
Only relatively short strings (slightly more than 20 significant digits, I don't remember an exact figure) are guaranteed to be rounded correctly. Plus those that consist of relatively short sequence of significant digits followed by long sequence on zeros and than another sequence of significant digits. IIRC, another class of inputs that is rounded exactly are inputs with small absolute value of decimal scale.
Another corner cut - I ignore current rounding settings.

> In the beginning it looks easy:
>
> Just convert the decimal value to an unsigned integer, i.e. put the
> decimal place after the last digit, and adjust the exponent by the same
> amount.
>
> Now you just scale this exact value by the requisite power of ten, right?
>
> The problems start to occur when that power of ten is largish, either
> positive or negative, because you don't want to round any intermediate
> results...
>
> Another issue: What if the decimal value has 50 digits (160+ bits),
> selected so as to place the final (exponent-adjusted) value very close
> to a binary rounding point? In this case there might not be any good
> solution except to first try reasonably fast code, and then test the
> trailing (rounding) bits and check if we're too close to such a rounding
> point, in which case we might have to fall back on arbitrary precision
> library code. :-(
>

Bruce invented different approach. According to my understanding, he utilizes existing (probbaly sligtly modified to support one extra bit of input) double-to-string library routines and compares midpoint between candidate result converted to string against original input.
This approach appears great when double-to-string is already part of the library, so all tables are here anyway. My case didn't have that luxury.

Bruce Hoult

unread,

Nov 11, 2019, 6:11:31 AM11/11/19

to

On Monday, November 11, 2019 at 12:56:17 AM UTC-8, already...@yahoo.com wrote:
> On Monday, November 11, 2019 at 1:25:20 AM UTC+2, Bruce Hoult wrote:
> > Anton came up with what I think is a pretty good methodology.
> >
> > https://groups.google.com/d/msg/comp.arch/_htwHlLQU54/3jkBnP5dAAAJ
> >
> > That showed ARM64 39.7% 27.6% 26.5% bigger than ARMHF.
> >
> > It would be interesting to repeat the analysis on current code, and architectures.
>
> I am o.k. with his choice of bash, grep and gzip, but don't like the rest of his method.
> I'd prefer if it would be compiled from source by ourselves. Preferably with clang -Os. gcc -Os is unreasonably underoptimized, while gcc -O2 is a subject to seemingly random optimization bloat.

While you have a point, the code in the Debian repo represents what large numbers of people actually run on their machines.

Also, gcc for RISC-V is still quite a bit more mature than Clang. Clang is now at a pretty much feature complete and everything produces working code stage, but little or no effort has been put to optimizations -- other than the generic machine-independent ones of course.

> I would love to add nanomips to the test, but it looks like it's still not supported by clang. I am not sure, but it looks like it's not supported even by gcc trunk.

NanoMIPS looks like a really really nice ISA effort. Sadly, so far they've only announced one core (I7200, May 2018) and it's only 32 bit and I can't immediately find any evidence that it's ever actually taped out in a chip.

MIPS looks dead. We've picked up number of refugees from them, including their former 2011-2017 "Principle Computer Architect Engineer" Andy "crazy" Glew (with a 21 month stint at NVIDIA between). I expect Andy knows a thing or two about NanoMIPS ... I should ask him. Art Swift left Esperanto (Dave Ditzel's company developing high performance RISC-V processors) to spearhead and champion the MIPS open-sourcing effort as CEO of Wave Computing. As of a couple of months ago he's back as president of Esperanto which may not bode well for the open-sourcing of MIPS technologies.

Terje Mathisen

unread,

Nov 11, 2019, 7:31:17 AM11/11/19

to

Obviously not, arbitrary was just my shorthand way of stating that you
might need a lot of bits. :-)

> enough to have 1024+54*2 bits (or some figure very near to that). I
> don't remember the exact number of bits now, but I do 100% remember
> that 144 bytes was the next higher multiple of 32 bits.

OK, that is good to know, thanks!

Did you in fact do the Q&D first approximation and then fall back on
more complicated (144 byte) code only when needed?

Quadibloc

unread,

Nov 11, 2019, 7:56:46 AM11/11/19

to

Some time back, I tried to address this issue. How can one have the benefits of
all the instructions being the same length, and the benefits of variable-length
instructions?

Particularly, I knew Mitch was right, and having full-length immediates in
instructions was beneficial, but that would require, in my opinion a complex
variable-length instruction format instead of just a very simple one.

What I came up with was this:

Fetch 256 bits of instructions at a time. The instruction block contains eight
32-bit instructions all the same length. But there is an indicator that only the
first N instruction slots contain instructions.

Relative pointers that are short, because they *only point within the 256-bit
instruction block* are used for immediates.

This is partly inspired, of course, by the heads and tails instruction format
idea.

John Savard

Bruce Hoult

unread,

Nov 11, 2019, 9:28:55 AM11/11/19

to

On Monday, November 11, 2019 at 4:31:17 AM UTC-8, Terje Mathisen wrote:

> Bruce Hoult wrote:
> > You don't need arbitrary precision code. For IEEE double it is always
>
> Obviously not, arbitrary was just my shorthand way of stating that you
> might need a lot of bits. :-)
>
> > enough to have 1024+54*2 bits (or some figure very near to that). I
> > don't remember the exact number of bits now, but I do 100% remember
> > that 144 bytes was the next higher multiple of 32 bits.
>
> OK, that is good to know, thanks!
>
> Did you in fact do the Q&D first approximation and then fall back on
> more complicated (144 byte) code only when needed?

Yes. If the number being parsed has 15 significant digits or fewer (or even up to 16 digits if less than or equal to 9007199254740991 = 2^53-1) then you can simply convert that integer to an exact double and then multiply or divide by the appropriate power of ten in FP from a table. If your concern is to make commonly used numbers go quickly then the table doesn't have to be very large -- and in fact can't be large because the FP representation of the power of ten should be exact. Note that as powers of 10 end up with the same number of trailing 0s in binary as in decimal you can actually store up to 10^22 exactly as a double.

I presented an email string from 2007 explaining the idea here in July last year.

https://groups.google.com/d/msg/comp.arch/YxsFvk2ek_s/FbWSE74SCwAJ

Kyle Hayes

unread,

Nov 11, 2019, 1:21:34 PM11/11/19

to

This is an interesting idea. It seems like it is a form of bundling. Where do the bits that are the indicator live? Another variation on this might be to have a set of bits that indicated where instructions started within the cache line or block of data.

Best,
Kyle

Anton Ertl

unread,

Nov 11, 2019, 1:34:48 PM11/11/19

to

Bruce Hoult <bruce...@gmail.com> writes:
>On Sunday, November 10, 2019 at 2:35:22 AM UTC-8, Anton Ertl wrote:
>> Bruce Hoult <bruce...@gmail.com> writes:

>> >At the same time, they're likely to run out of opcode space for extensio=
>ns =3D

>> >in the foreseeable future, as MIPS did.

>>=20

>> Can you make the "foreseeable" more concrete? Which year? Or what
>> ISA extension do you foresee that will make them run into this
>> problem?
>
>It's unknowable.

Then it's obviously not foreseeable.

Maybe ARM are not expecting as many extensions as you do. Or, given
their experience with adding alternative encodings, they plan to add
another encoding when they run out of opcodes.

>Do you want to use the same basic ISA forever? The part that compilers use =
>to generate code from medium to high level programming languages. Is it pra=

>ctical to think you could?
>

>We believe that enough has been learned about computer ISA design in the la=

>st 70 years to answer "yes" to those questions.

And?

>> >The is a *ton* of customer =3D
>> >demand for cores with basically a Cortex M0 microarchitecture and code s=
>ize=3D

>> > but with 64 bit registers.

>>=20

>> Apparently not enough for ARM to actually do that. Such a core should
>> be relatively easy for ARM to do.
>

>You'd think so, yes. Except they, so far, refuse to subset their 64 bit ISA=
>.

A Cortex-M0 with 64-bit registers would not implement a subset of the
A64 instruction set, but a 64-bit variant of the T32 instruction set.

>> So the fact that Aarch64 guarantees all these features is an
>> advantage, and RISC-V will feel the disadvantage of having so little
>> guaranteed once it tries to expand into the general-purpose computing
>> space.
>

>Application processors in servers / desktops / laptops / smartphones are ut=

>terly different to deeply embedded tiny processors.

Yes. ARM's architecture for "deeply embedded systems" are the
M-profile architectures, such as Armv8-M, which supports a subset of
T32, not A64.

>> IMO Code density is overrated in this newsgroup.
>

>You obviously haven't seen how actual customers complain if code size for R=
>ISC-V comes out 3% or 5% bigger than Thumb2 on some benchmark or inner loop=
>! Even though other benchmarks or loops work out with the advantage in the =

>other direction by a similar amount.

If you sell RISC-V on code size, and customers fall for it, they
obviously will complain if you don't deliver.

And sure, for a tiny embedded CPU it is relevant. But for a 64-bit
architecture? Overrated.

>> In any case, Intel has lived with 32KB I-caches on their
>> high-performance cores for many years and continues to do so in Sunny
>> Cove (while increasing the D-cache size), and according to you the
>> code density is the same as that of Aarch64, so the 48KB of the A72 in
>> the RK3399 is probably just because they could make it that large
>> without incurring another cycle of latency.
>

>It's hard to compare with Intel, because they have not only I-cache but als=
>o uop cache.

They only have that since the Sandy Bridge (2011), but the 32KB
I-cache was already there in the Conroe (2006).

Also, given that the uop cache makes the I-cache second-level, one
would expect a larger I-cache, but this has not happened (and in Zen2
the uop cache has grown, while the I-cache has shrunk).

>> A good reason to use SIMD instructions, then, not a 64-bit core.
>> That's if you have SIMD instructions.
>

>SIMD helps certain workloads, of course. And simply having larger registers=
> helps others -- for example you might basically be doing large memcpy, in =
>which case, sure, it's better to have 512 or 256 bit registers than 32 bit =
>registers.

Neon has 128-bit registers. It's not obvious what Helium (Armv8.1-M
vector extension) has.

>That's a heck of a lot of extra silicon though, and you were jus=
>t trying to tell me that no one would want 64 bit registers instead of 32 b=
>it in deeply embedded :-)

No, I was making a point about 64-bit architectures, i.e., 64-bit
addresses.

For memcpy, T32 already has LDM and STM, no need for 64-bit registers
or SIMD. But if you want to do more than memcpy, and want to save on
instruction processing power, SIMD might help; and ARMv8.1-M has that
as an option.

>> I have to invest extra work to get at the performance counters of
>> these Aarch64 machines, and have not succeeded for the Odroid N2, and
>> only partially for the RockPro64 (and have not tried on the Raspi4),
>> because the custom kernels they come with are supplied without perf,
>> and getting a stock kernel to work on them requires extra work. Of
>> course, not that many people need performance counters, but many need
>> some other fringe specialty that works out of the box on AMD64 boxes,
>> but acts up on Aarch64.
>

>I've been there. I built my own kernel for the Odroid C2 with perf enabled.=

> Wasn't a big deal as Odroid supply the source for their customized kernel.

It was certainly a lot more action than needed on any AMD64 system.

Terje Mathisen

unread,

Nov 11, 2019, 3:58:15 PM11/11/19

to

Bruce Hoult wrote:
> On Monday, November 11, 2019 at 4:31:17 AM UTC-8, Terje Mathisen
> wrote:
>> Bruce Hoult wrote:
>>> You don't need arbitrary precision code. For IEEE double it is
>>> always
>>
>> Obviously not, arbitrary was just my shorthand way of stating that
>> you might need a lot of bits. :-)
>>
>>> enough to have 1024+54*2 bits (or some figure very near to that).
>>> I don't remember the exact number of bits now, but I do 100%
>>> remember that 144 bytes was the next higher multiple of 32 bits.
>>
>> OK, that is good to know, thanks!
>>
>> Did you in fact do the Q&D first approximation and then fall back
>> on more complicated (144 byte) code only when needed?
>
> Yes. If the number being parsed has 15 significant digits or fewer
> (or even up to 16 digits if less than or equal to 9007199254740991 =
> 2^53-1) then you can simply convert that integer to an exact double
> and then multiply or divide by the appropriate power of ten in FP
> from a table. If your concern is to make commonly used numbers go
> quickly then the table doesn't have to be very large -- and in fact
> can't be large because the FP representation of the power of ten
> should be exact. Note that as powers of 10 end up with the same
> number of trailing 0s in binary as in decimal you can actually store
> up to 10^22 exactly as a double.

Very nice indeed, seems like a good solution.

>
> I presented an email string from 2007 explaining the idea here in
> July last year.
>
> https://groups.google.com/d/msg/comp.arch/YxsFvk2ek_s/FbWSE74SCwAJ
>

noted.

BGB

unread,

Nov 11, 2019, 10:22:28 PM11/11/19

to

On 11/11/2019 11:38 AM, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
>> On Sunday, November 10, 2019 at 2:35:22 AM UTC-8, Anton Ertl wrote:
>>> Bruce Hoult <bruce...@gmail.com> writes:
>>>> At the same time, they're likely to run out of opcode space for extensio=
>> ns =3D
>>>> in the foreseeable future, as MIPS did.
>>> =20
>>> Can you make the "foreseeable" more concrete? Which year? Or what
>>> ISA extension do you foresee that will make them run into this
>>> problem?
>>
>> It's unknowable.
>
> Then it's obviously not foreseeable.
>
> Maybe ARM are not expecting as many extensions as you do. Or, given
> their experience with adding alternative encodings, they plan to add
> another encoding when they run out of opcodes.
>

I would not be entirely surprised if 64-bit ARM later grows some new
appendages to allow both smaller and larger instruction encodings...

>> Do you want to use the same basic ISA forever? The part that compilers use =
>> to generate code from medium to high level programming languages. Is it pra=
>> ctical to think you could?
>>
>> We believe that enough has been learned about computer ISA design in the la=
>> st 70 years to answer "yes" to those questions.
>
> And?
>

People could probably make x86 last indefinitely, but, would you want to?...

>>>> The is a *ton* of customer =3D
>>>> demand for cores with basically a Cortex M0 microarchitecture and code s=
>> ize=3D
>>>> but with 64 bit registers.
>>> =20
>>> Apparently not enough for ARM to actually do that. Such a core should
>>> be relatively easy for ARM to do.
>>
>> You'd think so, yes. Except they, so far, refuse to subset their 64 bit ISA=
>> .
>
> A Cortex-M0 with 64-bit registers would not implement a subset of the
> A64 instruction set, but a 64-bit variant of the T32 instruction set.
>

Yeah.

It is pretty doable, though from my own experience, a fair bit of "hair"
emerges if one takes a 32-bit ISA, extends the GPRs to 64 bit, and then
tries to use it like a 64-bit ISA.

It works out a bit cleaner to use an ISA "actually designed" for the
word size and data types it is being used to operate on.

Either way, such a drastic change is liable to break binary
compatibility even if the basic ISA is otherwise the same.

>>> So the fact that Aarch64 guarantees all these features is an
>>> advantage, and RISC-V will feel the disadvantage of having so little
>>> guaranteed once it tries to expand into the general-purpose computing
>>> space.
>>
>> Application processors in servers / desktops / laptops / smartphones are ut=
>> terly different to deeply embedded tiny processors.
>
> Yes. ARM's architecture for "deeply embedded systems" are the
> M-profile architectures, such as Armv8-M, which supports a subset of
> T32, not A64.
>

This is true in this case; but I don't think it is such a sharp split,
nor that "deeply embedded" is so easily classified.

One can have traditional microcontroller applications, where performance
doesn't matter that much. Something like a Cortex-M holds up well here,
mostly need something that is reasonably cheap and energy efficient and
has sufficient capabilities for the task at hand.

There are also real-time DSP use-cases and similar, where the timing and
performance characteristics are fairly important; however, the wider
software ecosystem is largely irrelevant. The processor is too busy
processing camera/microphone/... data to try to identify patterns to
care whether or not it can run Linux or similar.

One use case may be happy with 16 or 32 bits.

The other may both want 64-bits, albeit for potentially somewhat
different reasons from what a user-facing / application processor would.

The application processor may want 64-bit for dealing with lots of RAM
and large address spaces; the DSP may want 64-bit mostly for
manipulating data and quickly moving it from one place to another, ...

>>> IMO Code density is overrated in this newsgroup.
>>
>> You obviously haven't seen how actual customers complain if code size for R=
>> ISC-V comes out 3% or 5% bigger than Thumb2 on some benchmark or inner loop=
>> ! Even though other benchmarks or loops work out with the advantage in the =
>> other direction by a similar amount.
>
> If you sell RISC-V on code size, and customers fall for it, they
> obviously will complain if you don't deliver.
>
> And sure, for a tiny embedded CPU it is relevant. But for a 64-bit
> architecture? Overrated.
>

This does seem to be the case, if one sees "64-bit architecture" as
synonymous with "user-oriented application processor", or "something one
might plausibly want to run in a web-server", or similar...

One possible area of debate, is for an architecture where the GPRs and
similar are 64-bit, but the address space is 32-bit (and where the
entire physical address space fits in the low 28 bits, ...).

...

MitchAlsup

unread,

Nov 11, 2019, 10:30:49 PM11/11/19

to

One could volunteer the UNIVAC 1108 and the CDC 6600

Pedro Pereira

unread,

Nov 12, 2019, 11:08:38 AM11/12/19

to

On Fri, 8 Nov 2019 23:11:48 -0800 (PST)
Bruce Hoult <bruce...@gmail.com> wrote:

> It's true that the ALU probably isn't the critical path in cycle time
> at the moment. Certainly our core designers have told me that when I've
> asked about the possible impact of including some more complex operations
> in the BitManip extensions (for example generalized reverse and
> generalized OR-combine).

In the latest bitmanip extension document, the popcount opcode is defined as:

rd = pcnt(rs)

a more useful primitive would be:

rd = pcnt(rs1 ^ rs2)

Since the RISC-V has a zero register (x0), the suggested version could
encode the first one as "rd = pcnt(rs ^ x0)".

I don't imagine that reading one extra register and
performing a xor would make the instruction need more cycles.

Pedro Pereira

Paul A. Clayton

unread,

Nov 13, 2019, 1:51:33 PM11/13/19

to

On Saturday, November 9, 2019 at 6:36:46 PM UTC-5, MitchAlsup wrote:
> On Saturday, November 9, 2019 at 5:05:07 PM UTC-6, Bruce Hoult wrote:
[snip]
>> - reusing one of the operands as the destination
>
> Several CISC machines recognize::
>
> MOV R5,R7
> OP R5,R3
> as:
> OP R5,R7,R3
>
> But, still, to me this seems to be a step backward.

Including two address *and* three address forms, even
though redundant, can result in better code density.

>> - making an operand or destination implicit (i.e. fixed)
>
> <Shivering> I just don't know what to say.......

Making the destination of a less common operation
fixed (function call is a common example) seems
unlikely to cause major headaches for the compiler.
(Function calls are a somewhat special case because
ABIs typically constrain register allocation at
function boundaries. I think I would prefer function
specific interfaces as adding such to linking and
debug information would not seem to be a big deal.
I doubt this would be a significant benefit —
possibly converting abstractions like this pointer
to values — but I believe in the principle of
don't put off to runtime what can be done at
compile time (when such does not introduce excessive
communication overhead).)

>> - giving a smaller choice of registers
>
> <Shivering>

In many cases, it seems that having fewer source
options would have little impact. If an operation
is less common, such restrictions would not seem
to be especially burdensome. Yes, the compiler
would be more complicated, but that is mainly a
one-time cost.

>> - giving a smaller choice of addressing modes
>
> Since I only have 2, this bothers me little.

Your addressing modes are typically supersets of
the desired modes. I.e., other modes are special
cases of existing modes.

[snip]

>> The assembly-language programmer (or compiler) writes
>> code for the full size instruction set and the assembler
>> (and/or linker) selects the smaller encoding when it can.
>
> Much preferable to having the compiler perform the choice.

If the compiler and assembler are integrated, the difference
may not be significant.

For certain embedded systems where code storage is a
major cost, sacrificing performance-oriented (or even
energy-efficiency-oriented) ISA design in favor of
code density may be justified. One issue seems to be
that people want one interface to meet all possible
uses. While considering commonalities often helps
improve even localized optimizations and provides
economies of scale, specialization does have actual
value.

MitchAlsup

unread,

Nov 13, 2019, 2:46:22 PM11/13/19

to

Which is why I can get away with having only 2.....

>
> [snip]
>
> >> The assembly-language programmer (or compiler) writes
> >> code for the full size instruction set and the assembler
> >> (and/or linker) selects the smaller encoding when it can.
> >
> > Much preferable to having the compiler perform the choice.
>
> If the compiler and assembler are integrated, the difference
> may not be significant.

In Mc 88K days, we had the linker choose the small forms and
had the compiler/assembler express in the larger form. The
reason is that when one shrinks the code one has a linear
problem in optimization, whereas when one expands the code
one has an NP-complete problem in optimization.

Paul A. Clayton

unread,

Nov 13, 2019, 5:36:38 PM11/13/19

to

On Wednesday, November 13, 2019 at 2:46:22 PM UTC-5, MitchAlsup wrote:
> On Wednesday, November 13, 2019 at 12:51:33 PM UTC-6, Paul A. Clayton wrote:

[snip]

>> Your addressing modes are typically supersets of
>> the desired modes. I.e., other modes are special
>> cases of existing modes.
>
> Which is why I can get away with having only 2.....

Which points out the futility of counting addressing
modes (similar to counting the number of instructions),
especially when using such a number for a component in
a measure of ISA "goodness". [Diversion available at end
of message.]

[snip]

> In Mc 88K days, we had the linker choose the small forms and
> had the compiler/assembler express in the larger form. The
> reason is that when one shrinks the code one has a linear
> problem in optimization, whereas when one expands the code
> one has an NP-complete problem in optimization.

That makes sense, though I suspect expansion rarely has
horrendous performance. (Avoiding the bad cases can also
justify choices even if the averages are not very different
and some other reason advantages an otherwise slightly
worse choice. I suspect expansion could have some very bad
cases even if branch ranges and such are usually generous.)

=== DIVERSION ===

A memory access instruction with a dereference count and
N immediates could be counted as many addressing modes
if each dereference count was treated as a distinct
mode — or as one addressing mode. (If zero dereference
count meant simple register access, one could "beat"
ISAs that have a distinction for simple register access
versus memory access.)

The implementation effects of addressing complexity
matter more than mode count. Furthermore target uses
influence the value of addressing modes. Bit reverse
and circular buffer addressing modes seem to be justified
for DSPs (with heavy use of FFT and simple buffers),
but they have not been adopted outside of DSPs (AFAIK).

Changes in process technology, implementation techniques,
microarchitecture, and targeted workloads can change
the quality of a design choice. Economics and social
factors can also make a technically correct design choice
impractical. (If one has very little ability to
manage the addition of instructions, having call take
up a quarter of the encoding space may be justified to
avoid ISA bloat. Engineering around human factors can
be easier than re-engineering humans and human societies.)

A design choice can also be justified/criticized by
assuming the most simple-minded implementation. Although
sophisticated implementations generally increase complexity,
a mature technique can have many of the costs already
paid.

MitchAlsup

unread,

Nov 13, 2019, 6:35:39 PM11/13/19

to

On Wednesday, November 13, 2019 at 4:36:38 PM UTC-6, Paul A. Clayton wrote:
> On Wednesday, November 13, 2019 at 2:46:22 PM UTC-5, MitchAlsup wrote:
> > On Wednesday, November 13, 2019 at 12:51:33 PM UTC-6, Paul A. Clayton wrote:
> [snip]
> >> Your addressing modes are typically supersets of
> >> the desired modes. I.e., other modes are special
> >> cases of existing modes.
> >
> > Which is why I can get away with having only 2.....
>
> Which points out the futility of counting addressing
> modes (similar to counting the number of instructions),
> especially when using such a number for a component in
> a measure of ISA "goodness". [Diversion available at end
> of message.]

Note, I don't use number of address modes as a sign of goodness.

>
>
> [snip]
> > In Mc 88K days, we had the linker choose the small forms and
> > had the compiler/assembler express in the larger form. The
> > reason is that when one shrinks the code one has a linear
> > problem in optimization, whereas when one expands the code
> > one has an NP-complete problem in optimization.
>
> That makes sense, though I suspect expansion rarely has
> horrendous performance.

It was not performance at all: consider::

You have already resolved "LABEL_916" to be at some location k,
and have resolved "BR LABEL_916" to contain the address (IP-LABEL_916).

Now, you get to the point and have to expand an instruction, and this
currently resolved LABEL can no longer be reached by BR LABEL_916.
Making the object larger can ruin already completed work.

Making the object smaller can never ruin already completed work, just
make the already completed work less optimal.

> (Avoiding the bad cases can also
> justify choices even if the averages are not very different
> and some other reason advantages an otherwise slightly
> worse choice. I suspect expansion could have some very bad
> cases even if branch ranges and such are usually generous.)

Exactly.

>
> === DIVERSION ===
>
> A memory access instruction with a dereference count and
> N immediates could be counted as many addressing modes
> if each dereference count was treated as a distinct
> mode — or as one addressing mode. (If zero dereference
> count meant simple register access, one could "beat"
> ISAs that have a distinction for simple register access
> versus memory access.)

Side-diversion::
I state that My 66000 has 2 address modes, the ISA also has exactly
2 forms in the instruction set to represent those 2 addressing
modes. However, the second addressing mode can manifest itself in
3 sizes {No DISP, DISP32, DISP64}.

>
> The implementation effects of addressing complexity
> matter more than mode count. Furthermore target uses
> influence the value of addressing modes. Bit reverse
> and circular buffer addressing modes seem to be justified
> for DSPs (with heavy use of FFT and simple buffers),
> but they have not been adopted outside of DSPs (AFAIK).

I use circular buffers all the time in my simulators::

for( i = 0; ; i = (i+1) & (MAX-1) )
{ }

>
> Changes in process technology, implementation techniques,
> microarchitecture, and targeted workloads can change
> the quality of a design choice. Economics and social
> factors can also make a technically correct design choice
> impractical. (If one has very little ability to
> manage the addition of instructions, having call take
> up a quarter of the encoding space may be justified to
> avoid ISA bloat. Engineering around human factors can
> be easier than re-engineering humans and human societies.)
>
> A design choice can also be justified/criticized by
> assuming the most simple-minded implementation. Although
> sophisticated implementations generally increase complexity,
> a mature technique can have many of the costs already
> paid.

I took particular care in my predication instructions to balance GBOoO
design points against LGIO design points.

I think I took particular care in my variable length encoding to balance
GBOoO with LBIO, but this did end up costing an Instruction Buffer in the
smaller but not smallest machines.

Quadibloc

unread,

Nov 13, 2019, 8:25:34 PM11/13/19

to

On Monday, November 11, 2019 at 11:21:34 AM UTC-7, Kyle Hayes wrote:

> This is an interesting idea. It seems like it is a form of bundling. Where
> do the bits that are the indicator live?

That was the pitfall. I had several instruction formats that had three spare
bits for the indicator, so the rule was that each block had to have one such
instruction at least.

That was messy and awkward, though. Making all the instructions 31 bits long,
and using the first bit for the indicator would be neater...

but I didn't like _that_ either, since it invited a serial implementation, and I
wanted the number of instructions in the block to be determined ahead of time,
so decoding would always be parallel and fast.

On further thought, maybe there *is* a way out. Make the instructions 31 bits
long, but demand that there be at least three instructions in each block. Then
the vacant bits in the first three give the number of instructions.

John Savard

Brett

unread,

Nov 13, 2019, 11:08:52 PM11/13/19

to

Just use a bundle like Itanic and others, note that there is no rule that
you can’t branch to inside a bundle.
Most bundle systems suck because they pad with no-ops.

Note that with small bundles you have significant waste when the last
instruction wants to be big and won’t fit, a instruction re-arrange pass
will fix many of these.

Terje Mathisen

unread,

Nov 14, 2019, 1:46:06 AM11/14/19

to

MitchAlsup wrote:
> On Wednesday, November 13, 2019 at 12:51:33 PM UTC-6, Paul A. Clayton wrote:
>> On Saturday, November 9, 2019 at 6:36:46 PM UTC-5, MitchAlsup wrote:
>>>> The assembly-language programmer (or compiler) writes
>>>> code for the full size instruction set and the assembler
>>>> (and/or linker) selects the smaller encoding when it can.
>>>
>>> Much preferable to having the compiler perform the choice.
>>
>> If the compiler and assembler are integrated, the difference
>> may not be significant.
>
> In Mc 88K days, we had the linker choose the small forms and
> had the compiler/assembler express in the larger form. The
> reason is that when one shrinks the code one has a linear
> problem in optimization, whereas when one expands the code
> one has an NP-complete problem in optimization.

Not quite: It is possible to have a set of interlinked branches, so that
every one of them are longer than the maximum for a short form, but as
soon as you pick the right one to reduce anyway, all or most of the rest
will also become short enough.

I.e. x86 with single-byte relative branch offsets (2-byte instruction)
had to be replaced with a branch past an immediate jump (2 instr., 5
bytes), so you did not need too many of those long forms to extend the
distance of a branch that went past them.

You could either start with all assumed short, or use a range for every
one and run a relaxation process where more and more of them would
settle down as either long or short.

I.e. starting with all long is easy but not at all certain to not end up
at a very local optimum. Finding the globally best (or near-optimal)
solution is still O(n!) in the worst case, but that is extremely
unlikely in real life.

Terje Mathisen

unread,

Nov 14, 2019, 1:55:04 AM11/14/19

to

Paul A. Clayton wrote:
> === DIVERSION ===

>
> The implementation effects of addressing complexity
> matter more than mode count. Furthermore target uses
> influence the value of addressing modes. Bit reverse
> and circular buffer addressing modes seem to be justified
> for DSPs (with heavy use of FFT and simple buffers),
> but they have not been adopted outside of DSPs (AFAIK).

The last time I looked, my FFT-like codes using bit reversal could all
be optimized to the point where the reversing cost pretty much disappeared.

I don't remember the details now, but one of the ideas was to block by
16 and note that for such a block the low bits of the result would stay
constant and the top 4 bits could come from a tiny table, or they could
be unrolled all the way.

Anton Ertl

unread,

Nov 14, 2019, 4:21:38 AM11/14/19

to

MitchAlsup <Mitch...@aol.com> writes:
>In Mc 88K days, we had the linker choose the small forms and
>had the compiler/assembler express in the larger form. The

>reason is that when one shrinks the code one has a linear=20

>problem in optimization, whereas when one expands the code
>one has an NP-complete problem in optimization.

No.

If you start small and only ever expand, for a program of size n, you
expand at most n instructions, at least one per pass, so you need at
most n passes, and the overall worst-case complexity is quadratic. In
practice, you typically need maybe 2 or 3 linear passes. And except
for esoteric cases (see below), the result is optimal (while the
start-big-and-shrink algorithm can produce a suboptimal result that
will stay suboptimal even with additional passes, and may fail in the
esoteric case, see below).

There is an NP-completeness result for a variant of this problem, but
as always with stuff coming from theoreticians, you need to look at
the exact details of the problem they discuss. In this case, they
deal with a variant where expanding one immediate value can result in
making it possible to shrink another immediate value (and vice versa),
and they want to solve that problem optimally.

If you have such a problem, the start-small-and-expand algorithm can
produce a suboptimal result, but it will still be polynomial.

Note that if you have such a problem, the start-big-and-shrink
algorithm can fail: After a pass, an immediate may have to grow again,
but the algorithm cannot do that.

I have discussed this misconception repeatedly, including not long
ago, here <2018Jan2...@mips.complang.tuwien.ac.at>. Makes me
wonder what makes this misconception so popular.

Quadibloc

unread,

Nov 14, 2019, 5:04:09 AM11/14/19

to

On Wednesday, November 13, 2019 at 9:08:52 PM UTC-7, Brett wrote:

> Just use a bundle like Itanic and others,

The immediate values will be nice neat multiples of 32 bits. If I use a bundle
to have an instruction length different from 32 bits, then I get wasted space
for some numbers of instructions in the bundle. If I do use instructions that
are 32 bits long, I have no extra space for the indicator showing the count of
instructions in the bundle.

So I don't view a bundle as a simple solution.

Of course, if I made my immediates all multiples of *12* bits instead, so that
there is now a mismatch between the basic unit of length and the hardware, then
I could use a bundle size where, say, a variable number of 36-bit instructions
and immediates that have lengths like 12, 24, 36, 48, and 60 bits coexist with a
small number of extra bits.

However, while that would be a "solution", it then restricts the architecture to
these data types, which I would rather leave as a rarely-used option for special
purposes.

John Savard

EricP

unread,

Nov 14, 2019, 9:12:56 AM11/14/19

to

MitchAlsup wrote:
>
> I use circular buffers all the time in my simulators::
>
> for( i = 0; ; i = (i+1) & (MAX-1) )
> { }

Did you build a cycle simulator for MY66000?
If so, did you use SimpleScalar or something else?

I'm building a cycle simulator right now in C for my uArch ideas.
From scratch, not using SimpleScalar, though I have looked at it.
I use circular buffers with an extra wrap bit in the msb
for easy buffer empty and full detection.
I didn't want to use compacting queues (bleech).

I just forwarded my first ALU result to a reservation station yesterday.
Includes cache fetch, parse, decode, rename, RR & dispatch,
instruction queue, EU (execute unit) schedule, RS issue, execute,
result bus arbitrate, forwarding tag match,
result write back and forwarding, and retire.
Now for the load/store queue...

It's my first cycle simulator sooo it's not a pretty sight.
Mostly held together with chewing gum and twine.

Ivan Godard

unread,

Nov 14, 2019, 9:40:43 AM11/14/19

to

The quantity of stuff you put together, and the speed at which you do
so, never fails to amaze me; I suspect most here who have ever actually
built this kind of stuff are equally amazed. Kudos.

Sims aren't small. Ours is 30K lines of C++, not counting the common
libraries (27k lines) and the machine-generated parts (themselves from
21k lines of specification logic). Of course, that's for well over a
thousand different kinds of instructions plus all the rest of the machine.

EricP

unread,

Nov 14, 2019, 10:14:08 AM11/14/19

to

Well, I didn't mean I wrote it yesterday, it worked first time yesterday.
It's taken a few weeks. Getting the clocking and pipeline stalls
to work was a bitch until I came to terms with the fact that it
has to clock the stages in reverse order, and how signals have
to propagate across stages (they get posted for deferred updates).
And it only does two things: Nop's and a few kinds of Add's.
But it does those Add's oh so well, with grace and elegance.
So I'm a bit chuffed.