branch avoidance on orthodox Stanford RISC

already...@yahoo.com

unread,

Dec 23, 2017, 1:07:40 PM12/23/17

to

First, let's define orthodox Stanford RISC (OSR):
1. No arithmetic flags
2. instructions have at most 2 register inputs and one register output.
3. Instructions that have relatively long immediate field (say, >6 bits) have at most 1 register input.

The rules above apply both to GPRs and to special-purpose registers, if architecture has such registers. But true orthodoxy suggest that the special-purpose registers can't serve as a source or destination for integer arithmetic and accessed exclusively through moves o/from GPRs.

Original MIPS is not an OSR
Alpha is not an OSR.
MIPS Release 6 is close enough to OSR.
RISC-V looks like OSR. But it is a moving target, I don't follow it daily, so I'd leave the judgment to RISC-V experts.
Altera Nios2 is OSR. And the only one in the list that I really care about.

Nios2 also happens be the one which have especially weak support for branch avoidance, if not to say, no support.
In general, Nios2 compiler (gcc is the only one available) never generates branchless code for statements like 'r = x < y ? a : b;'.
And if I would want to generate it manually, I can't come with anything shorter then 5 instructions:
sel_a = x < y;
seln_a = 0 - sel_a;
axb = a ^ b;
axbm = axb & seln_a;
r = b ^ axbm;

That does not compare too well with two instruction on something like aarch64 and, depending on context, 2, 3 or 4 instructions on x86.

MIPSr6 is just a little better:
sel_a = x < y;
am = sel_a ? a : 0;
bm = sel_a ? 0 : b;
r = am | bm;

RISC-V, according to my understanding, is the same, as Nios2. But I am not an expert.

So, what are possible conclusions?
A. Branch avoidance is a stupid idea.
B. Orthodox Stanford RISC is a stupid idea.
C. Michael_S is stupid, because he doesn't see how branch avoidance can be implemented nicely and efficiently on OSR.
D. ?????
E. The post is off topic, because it is not about religion (with which I would violently disagree, because Stanford RISC is certainly a form of religion).

Bruce Hoult

unread,

Dec 23, 2017, 2:36:59 PM12/23/17

to

On Saturday, December 23, 2017 at 9:07:40 PM UTC+3, already...@yahoo.com wrote:
> First, let's define orthodox Stanford RISC (OSR):
> 1. No arithmetic flags
> 2. instructions have at most 2 register inputs and one register output.
> 3. Instructions that have relatively long immediate field (say, >6 bits) have at most 1 register input.
>
> The rules above apply both to GPRs and to special-purpose registers, if architecture has such registers. But true orthodoxy suggest that the special-purpose registers can't serve as a source or destination for integer arithmetic and accessed exclusively through moves o/from GPRs.
>
> Original MIPS is not an OSR
> Alpha is not an OSR.
> MIPS Release 6 is close enough to OSR.
> RISC-V looks like OSR. But it is a moving target, I don't follow it daily, so I'd leave the judgment to RISC-V experts.

It is. The base integer instruction set has been frozen forever.

> Altera Nios2 is OSR. And the only one in the list that I really care about.
>
> Nios2 also happens be the one which have especially weak support for branch avoidance, if not to say, no support.
> In general, Nios2 compiler (gcc is the only one available) never generates branchless code for statements like 'r = x < y ? a : b;'.
> And if I would want to generate it manually, I can't come with anything shorter then 5 instructions:
> sel_a = x < y;
> seln_a = 0 - sel_a;
> axb = a ^ b;
> axbm = axb & seln_a;
> r = b ^ axbm;
>
> That does not compare too well with two instruction on something like aarch64 and, depending on context, 2, 3 or 4 instructions on x86.
>
> MIPSr6 is just a little better:
> sel_a = x < y;
> am = sel_a ? a : 0;
> bm = sel_a ? 0 : b;
> r = am | bm;
>
> RISC-V, according to my understanding, is the same, as Nios2. But I am not an expert.

The usual code generated in RISC-V (RV64GC) for...

int avoid(int x, int y, int a, int b){
return x < y ? a : b;
}

.. is ..

With -O1, -O2, -O3

00000000000101b2 <avoid>:
101b2: 00b54463 blt a0,a1,101ba <avoid+0x8>
101b6: 8536 mv a0,a3
101b8: 8082 ret
101ba: 8532 mv a0,a2
101bc: 8082 ret

With -Os

00000000000101b2 <avoid>:
101b2: 00b54363 blt a0,a1,101b8 <avoid+0x6>
101b6: 8636 mv a2,a3
101b8: 8532 mv a0,a2
101ba: 8082 ret

If you really want to avoid branches, you can do:

00000000000101b2 <avoid>:
101b2: 00b52533 slt a0,a0,a1
101b6: 8e35 xor a2,a2,a3
101b8: 40a0053b negw a0,a0
101bc: 8d71 and a0,a0,a2
101be: 8d35 xor a0,a0,a3
101c0: 8082 ret

So, yes, three extra instructions. That's pretty borderline to avoid a branch some fraction of the time.

> So, what are possible conclusions?
> A. Branch avoidance is a stupid idea.

With modern branch predictors (Pentium MMX, Pentium Pro and newer) it's a pretty stupid idea.

The RISCV designers assumed that all CPUs that want it will have excellent branch prediction. The microcontroller-class SiFive FE310-G000 does (it's the only RISCV CPU you can currently buy off the shelf).

They also believe that conditional select is a very bad idea in out-of-order microarchitectures.

> B. Orthodox Stanford RISC is a stupid idea.

It's a pretty good idea for minimising the overall cost of a CPU that performs well and has excellent performance-power-area. Integer instructions that need three read ports are not common enough to be worth the cost of providing them. The resulting increase in area (power consumption) and cycle time outweigh some micro-benchmarks taking a couple of cycles less.

I note that a conditional select like sel_a?a:0 needs three read ports, in contravention of OSR, unless the is some special "select x or 0" instruction.

One slight improvement for this case would be a conditional set instruction that produces 0 or -1, not 0 or 1. This would not fit C semantics as well, but would work out better for a lot of bit manipulation tasks.

If you have a BIC instruction (also known as ANDNOT or ANDN) then you can avoid the XOR stuff, but it doesn't save any instructions:

sel_a = x < y;
seln_a = -sel_a;
am = a & seln_a;
bm = b & ~seln_a;
r = am | bm;

Ivan Godard

unread,

Dec 23, 2017, 2:37:12 PM12/23/17

to

#E: Ain't it the truth!

lsss(bX, bY) %SEL, pick(bSEL, bA, bB) %r;

"%N" is a pseudo-identifier for the result, "bN" is a reference to that
value. Two operations, less than one instruction, one cycle.

BUT - not orthodox. Bad ISA! Heathen! Apostate!

Bruce Hoult

unread,

Dec 23, 2017, 2:53:47 PM12/23/17

to

That's really great, if it proves to work well. Right now, we have no way of knowing if it's gonna be great or gonna suck.

Ship it! Benchmark it! Please...

already...@yahoo.com

unread,

Dec 23, 2017, 3:36:14 PM12/23/17

to

Ivan, Mill is *really* off topic here.
At least as much off topic as x86 or ARM or SPARC or PPC or IPF or zArch. More off topic than original MIPS and Alpha, both of which are also off topic.
And this time I am not joking.

MitchAlsup

unread,

Dec 23, 2017, 3:57:30 PM12/23/17

to

On Saturday, December 23, 2017 at 12:07:40 PM UTC-6, already...@yahoo.com wrote:

> In general, Nios2 compiler (gcc is the only one available) never generates branchless code for statements like 'r = x < y ? a : b;'.

My ISA would encode this as:

CMP Rt,Rx,Ry
PLT Rt,0110 // predicate on less than
MOV Rr,Ra
MOV Rr,Rb

The PLT (Predicate on less than) casts a predicate shadow forward over
(up to) the next 8 instructions (using the 16-bit offset that would have
been used as a branch offset if it were a branch instruction). The encodings
provide for:

00 execute
10 do not execute
01 execute if condition false
11 execute if condition true

Predicates can be combined so that fairly complicated branchless code
paths can be encoded. Execute allows a subsequent predicate instruction
to turn off execute or an instruction common to both paths, no-execute
allows a subsequent predicate to enable execution.

> First, let's define orthodox Stanford RISC (OSR):
> 1. No arithmetic flags
> 2. instructions have at most 2 register inputs and one register output.
> 3. Instructions that have relatively long immediate field (say, >6 bits) > > have at most 1 register input.

1. no arithmetic flags -- check
2. 2-operand 1 result -- makes it hard to encode FMAD which is required IEEE FP instruction. In any event, I found it necessary to have an encoding space for eight 3-operand 1 result instructions, 3 are used.
3. long immediates -- check: 16-bit, 32-bit and 64-bit are available; integer, floating point, memory refs, and flow control. No instructions are
ever needed to paste constants together.

Bruce Hoult

unread,

Dec 23, 2017, 4:10:47 PM12/23/17

to

Integer and FP have different requirements. If there are two different register files they can have different numbers of read ports/operands.

It could maybe be interesting -- assuming an in-order, single dispatch machine -- to have the forwarding path (only) be able to transmit the full unrounded result of a multiply to an immediately-following add.

already...@yahoo.com

unread,

Dec 23, 2017, 4:15:15 PM12/23/17

to

Branch predictor are great when they work. Which is most of the time.
But sometimes they just do not work.
Even then they can be better than branch avoidance if at least one of the sources *and the destination* are on the critical latency path, esp. one that is likely to contain L2 cache miss. Because even non-working branch predictor is right 50% per cents of the time.
But it is not that uncommon (pulling a number out of hot air I'd say >0.5% of the time) when both branch prediction does not work and the whole thing is not in the middle of problematic latency path.
And then branch avoidance is handy.

And sometimes, esp in microcontroller workloads, worst case execution time is the only execution time that matters. And branch prediction is not the best in achieving that goal.

In my case that triggered this thread it was not even a worst case time itself, but a simple fact that wothout branches (except for branch at the end of the loop) it is so much easier to measure a worst case execution time.

> The RISCV designers assumed that all CPUs that want it will have excellent branch prediction. The microcontroller-class SiFive FE310-G000 does (it's the only RISCV CPU you can currently buy off the shelf).
>
> They also believe that conditional select is a very bad idea in out-of-order microarchitectures.
>

Designer of Alpha were generally very orthodox. But even they had chosen an unorthodox way here.
And yes, in EV6/7 it was a special case that had to be cracked (decomposed by the Ibox pipeline, as they politely call it in 21264hrm). But it sounds as it worth it.

> > B. Orthodox Stanford RISC is a stupid idea.
>
> It's a pretty good idea for minimising the overall cost of a CPU that performs well and has excellent performance-power-area. Integer instructions that need three read ports are not common enough to be worth the cost of providing them. The resulting increase in area (power consumption) and cycle time outweigh some micro-benchmarks taking a couple of cycles less.

My personal opinion:
I have little doubt that OSR is a bad idea on the high end, where things are either cracked according to needs or, sometimes, you uArch just handles 3-input instructions natively. The later is becoming a new norm on the FP/SIMD side, but still not universal on GPR side.
I also have little doubt that OSR is a bad idea [for hard cores] on the very low end (Cortex-M3/M4 class) where 3 inputs are not a cycle time limiter for a short scalar pipeline.

May be, there is a window in the middle where OSR makes sense. Or may be not.

Now low end *soft* (soft as for FPGA, not "soft IP" supplied in RTL form for ASICs) cores are different. Here register ports are really expensive. Here OSR has merits.

Bruce Hoult

unread,

Dec 23, 2017, 4:39:40 PM12/23/17

to

Here's what the RISC-V ISA manual V2.2 has to say:

We considered but did not include conditional moves or predicated
instructions, which can effectively replace unpredictable short
forward branches. Conditional moves are the simpler of the two, but
are difficult to use with conditional code that might cause exceptions
(memory accesses and floating-point operations). Predication adds
additional flag state to a system, addi- tional instructions to set
and clear flags, and additional encoding overhead on every
instruction.

Both conditional move and predicated instructions add complexity to
out-of-order microarchitec- tures, adding an implicit third source
operand due to the need to copy the original value of the destination
architectural register into the renamed destination physical register
if the predicate is false. Also, static compile-time decisions to use
predication instead of branches can result in lower performance on
inputs not included in the compiler training set, especially given
that unpredictable branches are rare, and becoming rarer as branch
prediction techniques improve.

We note that various microarchitectural techniques exist to
dynamically convert unpredictable short forward branches into
internally predicated code to avoid the cost of flushing pipelines on
a branch mispredict [13, 17, 16] and have been implemented in
commercial processors [27].

The simplest techniques just reduce the penalty of recovering from a
mispredicted short forward branch by only flushing instructions in the
branch shadow instead of the entire fetch pipeline, or by fetching
instructions from both sides using wide instruction fetch or idle
instruction fetch slots. More complex techniques for out-of-order
cores add internal predicates on instructions in the branch shadow,
with the internal predicate value written by the branch instruction,
allowing the branch and following instructions to be executed
speculatively and out-of-order with respect to other code [27].

Reference [27] is:

[27] Balaram Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand,
B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F.
Marino, E. Retter, and P. Williams. IBM POWER7 multicore server processor. IBM Journal
of Research and Development, 55(3):1–1, 2011.

> And sometimes, esp in microcontroller workloads, worst case execution time is the only execution time that matters. And branch prediction is not the best in achieving that goal.
>
> In my case that triggered this thread it was not even a worst case time itself, but a simple fact that wothout branches (except for branch at the end of the loop) it is so much easier to measure a worst case execution time.

Branch mispredicts don't add a lot of time to the worst case in simple CPUs with short pipelines. Three cycles, just like the three extra instructions for your branchless version, is pretty common.

If you're that close to not being in spec then you probably want to use a faster processor anyway :-)

> I also have little doubt that OSR is a bad idea [for hard cores] on the very low end (Cortex-M3/M4 class) where 3 inputs are not a cycle time limiter for a short scalar pipeline.

I note that SiFive's RISC-V Cortex-MS/M4 competitor runs at 320 MHz in 180 nm. The fastest ARMs I know of in that process are 180 MHz.

Whether something is a cycle time limiter depends on what else is limiting the cycle time instead :-) :-)

already...@yahoo.com

unread,

Dec 23, 2017, 5:23:32 PM12/23/17

to

Yes, I had read this piece of Asanovich propaganda too.
The techniques, he mentioned, were implemented only on ulra-high-end IBM processors. irrelevant for most of us.

> > And sometimes, esp in microcontroller workloads, worst case execution time is the only execution time that matters. And branch prediction is not the best in achieving that goal.
> >
> > In my case that triggered this thread it was not even a worst case time itself, but a simple fact that wothout branches (except for branch at the end of the loop) it is so much easier to measure a worst case execution time.
>
> Branch mispredicts don't add a lot of time to the worst case in simple CPUs with short pipelines. Three cycles, just like the three extra instructions for your branchless version, is pretty common.

You are thinking 4-inst pipeline. Nios II/f pipeline is longer - 6 stages.
And dynamic branch predictor greatly increase variability in timing.

>
> If you're that close to not being in spec then you probably want to use a faster processor anyway :-)

Not when you have very simple fixed task and requirements will never be changed.
Which was my case.
In such situation something like 90% is o.k. Sometimes even 95%.

>
> > I also have little doubt that OSR is a bad idea [for hard cores] on the very low end (Cortex-M3/M4 class) where 3 inputs are not a cycle time limiter for a short scalar pipeline.
>
> I note that SiFive's RISC-V Cortex-MS/M4 competitor runs at 320 MHz in 180 nm. The fastest ARMs I know of in that process are 180 MHz.

Did you check Intel PXA250? I am not 100% certain, but it sounds likely that it was fabricated at 180nm.

As to Cortex-M3/M4 at 180nm, people don't do it not because it's impossible (I have no idea if it's possible or not) but because it's very bad idea. M3/M4 has no instruction cache and the flash at 180nm is not exactly fast. Flash accelerator helps somewhat, but not enough for usefull operation above something like 120MHz.
And SRAM at 180nm is relatively expensive, so running all code out of SRAM tend to be bad economy. One tends to by MCU with as much SRAM as needed for data, but no more.

I believe you that there exist 180 MHz parts, but even 180 MHz is more a gimmick than something really useful. 320 MHz - even more so.

As to SiFive being "Cortex-MS/M4 competitor" if they said it by themselves then to me it sounds as megalomania.

Ivan Godard

unread,

Dec 23, 2017, 6:03:58 PM12/23/17

to

On 12/23/2017 12:36 PM, already...@yahoo.com wrote:

> Ivan, Mill is *really* off topic here.
> At least as much off topic as x86 or ARM or SPARC or PPC or IPF or zArch. More off topic than original MIPS and Alpha, both of which are also off topic.
> And this time I am not joking.
>

My apologies - I thought the topic was religion.

Quadibloc

unread,

Dec 23, 2017, 7:21:08 PM12/23/17

to

In any case, the Mill is only off-topic for the thread, not for the newsgroup,
and thread drift is a perfectly legitimate phenomenon.

John Savard

already...@yahoo.com

unread,

Dec 23, 2017, 7:47:55 PM12/23/17

to

According to William Blake, Mill is closely related to religion, so, may be, you are right.
But I am not sure that you agree with Old Bill's characteristic of your Mill.

David Brown

unread,

Dec 23, 2017, 7:50:52 PM12/23/17

to

On 23/12/17 23:23, already...@yahoo.com wrote:
> On Saturday, December 23, 2017 at 11:39:40 PM UTC+2, Bruce Hoult
> wrote:
>> On Sunday, December 24, 2017 at 12:15:15 AM UTC+3,
>> already...@yahoo.com wrote:

<snip>

>>
>>> I also have little doubt that OSR is a bad idea [for hard cores]
>>> on the very low end (Cortex-M3/M4 class) where 3 inputs are not a
>>> cycle time limiter for a short scalar pipeline.
>>
>> I note that SiFive's RISC-V Cortex-MS/M4 competitor runs at 320 MHz
>> in 180 nm. The fastest ARMs I know of in that process are 180 MHz.
>
> Did you check Intel PXA250? I am not 100% certain, but it sounds
> likely that it was fabricated at 180nm.
>
> As to Cortex-M3/M4 at 180nm, people don't do it not because it's
> impossible (I have no idea if it's possible or not) but because it's
> very bad idea. M3/M4 has no instruction cache and the flash at 180nm
> is not exactly fast. Flash accelerator helps somewhat, but not enough
> for usefull operation above something like 120MHz. And SRAM at 180nm
> is relatively expensive, so running all code out of SRAM tend to be
> bad economy. One tends to by MCU with as much SRAM as needed for
> data, but no more.
>
> I believe you that there exist 180 MHz parts, but even 180 MHz is
> more a gimmick than something really useful. 320 MHz - even more so.

There are indeed 180 MHz Cortex-M4 parts, such as the NXP Kinetis K65.
It has 8 KB shared instruction/data cache. That is the fastest M4 that
I know of. Until recently, the fastest M7 I had heard of was 240 MHz.
But there are new M7 devices coming out at 600 MHz. I don't know the
process size, but I think it is small - a key difference here is that
these microcontrollers do not have any on-board flash. As I understand
it, the speed limitation of M3/M4/M7 microcontrollers has not been the
core itself, but using a process that is also cheap for flash. Once the
decision was made to drop the flash (in the NXP i.MX RT series, for
example), a smaller and faster process could be used.

>
> As to SiFive being "Cortex-MS/M4 competitor" if they said it by
> themselves then to me it sounds as megalomania.
>

I suspect whoever said it meant competitive in terms of processor power
for a given area, cost, or power usage - rather than competing on
numbers of chips produced!

already...@yahoo.com

unread,

Dec 23, 2017, 8:46:49 PM12/23/17

to

If it has cache, it certainly changes the trade-offs.
Instruction cache even could be not such a bad idea.
But data cache? I certainly don't want data cache on M4!
It kills 3/4th of beautiful simplicity of using DMA with peripherals.
The simplicity that allows even to unexperienced programmers that know nothing about cache coherence to delivery useful results with these little things.
Then again, my experience with M3/M4 was exclusively with STMicro parts. I can believe that with NXP it never was beautifully simple so a little more pain matters less.
Or, may be, they understood that data cache is a bad idea, but since ARM does not provide caches as part of M4 IP, adding unified cache was much simpler for their designers than adding ICache alone?

Anyway, there is no chance that I will use M4 with D$ in those designs where I have control.

MitchAlsup

unread,

Dec 23, 2017, 9:47:19 PM12/23/17

to

Bit field insert is the 3rd instruction in the 3-operand group.

>
> It could maybe be interesting -- assuming an in-order, single dispatch machine -- to have the forwarding path (only) be able to transmit the full unrounded result of a multiply to an immediately-following add.

Driving all of the numerical analysts crazy--not necessarily a bad thing.

Bruce Hoult

unread,

Dec 23, 2017, 11:42:43 PM12/23/17

to

The SiFive part has 16 KB instruction cache, but no data cache (16 KB data scratchpad). And external flash, at least in the SoC they are selling off the shelf. But that's only a demo. Their business model is to build customized SoCs containing exactly what the customer wants, at low NRE cost.

Depending on the customer needs, you might want to include flash on chip and run at lower MHz, or you might want to put a small program in mask ROM, or ...

Bruce Hoult

unread,

Dec 24, 2017, 12:19:08 AM12/24/17

to

Yes.

Variable position or size bit field insert is unusual. The usual case of a fixed size and position need read only two register operands if the other information is immediate in the instruction. See PowerPC RLWIMI/RLIMI.

If everything is variable at runtime then you need (assuming the operand to be inserted is right justified at the start and zero-padded) to read four registers, not three: start, size, src, dst.

If you're restricted to reading two and writing one, then that means an absolute minimum of three instructions are needed. However I don't see any way to design such instructions. The best I can do is:

bitmask tmp,shift,size // rotl(~((1<<size)-1), shift)
and dst,dst,tmp
lsl tmp,src,shift
or dst,dst,tmp

If src is not already isolated (zero padded) then an additional bic/andc/andn is needed after the lsl. If you have a bic instruction then bitmask doesn't need the ~ and the rotl can be an lsl (and swap the and and bic).

> > It could maybe be interesting -- assuming an in-order, single dispatch machine -- to have the forwarding path (only) be able to transmit the full unrounded result of a multiply to an immediately-following add.
>
> Driving all of the numerical analysts crazy--not necessarily a bad thing.

I should have emphasised "BE ABLE TO". Meaning it can not do that too. Only expansion of the FMADD family would suppress rounding/normalisation of the forwarded result.

Anton Ertl

unread,

Dec 24, 2017, 10:24:41 AM12/24/17

to

already...@yahoo.com writes:
>
>First, let's define orthodox Stanford RISC (OSR):
>1. No arithmetic flags
>2. instructions have at most 2 register inputs and one register output.

>3. Instructions that have relatively long immediate field (say, >6 bits) ha=

>ve at most 1 register input.

...

>Original MIPS is not an OSR
>Alpha is not an OSR.

Why not?

>So, what are possible conclusions?
>A. Branch avoidance is a stupid idea.

I don't think so. There are branches that are hard to predict, and in
those cases branch avoidance can be beneficial. OTOH, how often do
compilers know in practice that branches are hard to predict?

>B. Orthodox Stanford RISC is a stupid idea.

It looks like a sweet spot in instruction set design, certainly for
the kind of implementation that the first RISC implementations were
targeting, but it has more enduring value, as RISC-V and Aarch64 (more
orthodox than ARM, although less than RISC-V) demonstrate.

But of course, for every design, there are some cases where it does
not fit well, and that require more effort than one would like.
Branch avoidance is such a case for OSR. Given that it is needed
relatively rarely, it may not be so bad that it requires more effort.

Your rule 1 is the one that the fewest RISC architectures follow;
there is a large variation here. Having a separate register file for
flags (as in IA-64 and PowerPC) appears to be a good way to avoid both
problems: having to deal with a fixed flag register (SPARC, ARM,
IA-32); and needing an extra GPR port for branch avoidance;
disadvantage: you need an extra instruction for getting a comparison
result into a GPR, and that may cost more than longer branch avoidance
sequences.

>C. Michael_S is stupid, because he doesn't see how branch avoidance can be =

>implemented nicely and efficiently on OSR.

The Alpha way is shorter than the ones you presented:

For r=x<y?a:b you get

sel_a = x<y
r = b
if sel_a then r = a (CMOVE)

On an implementation without register renaming, CMOVE needs only two
read ports (sel_a and a). On an implementation with register
renaming, it needs three; but if you can afford that kind of
implementation, you should be able to afford the workaround that's
necessary for that, or maybe even a more flexible register port
regime. Of course, if register renaming implementations are your
primary target, it may be better to have a three-address CMOVE from
the start:

sel_a = x<y
r = sel_a ? a : b;

Or maybe such an implementation can combine the last two instructions
of the former sequence into the last instruction of the latter
sequence.

The C-inspired comparison instructions that produce 1 as truth values
don't help, as your Nios2 example shows. If you have comparison
instructions that produce all-bits-set as truth value, you could avoid
the second instruction in the Nios2 sequence without needing the
special MIPSr6 instructions. If you have an ANDNOT instruction, you
can do

seln_a = x<y

am = a & seln_a;
bm = b & ~seln_a
r = am|bm;

One cycle less latency than the xor-and-xor variant that Nios2 uses.

In addition to conditional move, conditional store could be helpful.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

already...@yahoo.com

unread,

Dec 24, 2017, 11:02:53 AM12/24/17

to

On Sunday, December 24, 2017 at 5:24:41 PM UTC+2, Anton Ertl wrote:
> already...@yahoo.com writes:
> >
> >First, let's define orthodox Stanford RISC (OSR):
> >1. No arithmetic flags
> >2. instructions have at most 2 register inputs and one register output.
> >3. Instructions that have relatively long immediate field (say, >6 bits) ha=
> >ve at most 1 register input.
> ...
> >Original MIPS is not an OSR
> >Alpha is not an OSR.
>
> Why not?

Alpha - only because of cmove (3 inputs)

MIPS had many non-OSR instructions:
MOVN/MOVZ(3 register inputs), multiplication (2 destination registers), Load Word Right (2 register inputs + 16-bit immediate input), MADD/MADDU (4 sources, 2 destinations).

already...@yahoo.com

unread,

Dec 24, 2017, 11:28:08 AM12/24/17

to

On Sunday, December 24, 2017 at 5:24:41 PM UTC+2, Anton Ertl wrote:

> already...@yahoo.com writes:
> >
> >First, let's define orthodox Stanford RISC (OSR):
> >1. No arithmetic flags
> >2. instructions have at most 2 register inputs and one register output.
> >3. Instructions that have relatively long immediate field (say, >6 bits) ha=
> >ve at most 1 register input.
> ...
> >Original MIPS is not an OSR
> >Alpha is not an OSR.
>
> Why not?
>
> >So, what are possible conclusions?
> >A. Branch avoidance is a stupid idea.
>
> I don't think so. There are branches that are hard to predict, and in
> those cases branch avoidance can be beneficial. OTOH, how often do
> compilers know in practice that branches are hard to predict?
>
> >B. Orthodox Stanford RISC is a stupid idea.
>
> It looks like a sweet spot in instruction set design, certainly for
> the kind of implementation that the first RISC implementations were
> targeting, but it has more enduring value, as RISC-V and Aarch64 (more
> orthodox than ARM, although less than RISC-V) demonstrate.
>

Aarch64 violates every single OSR rule except two (32bit instructions, 32 general-purpose registers). That includes very basic rule that I didn't mention, because it was irrelevant in discussion of branch avoidance:
Effective address of [integer] loads and stores depends on [no more than] one register.

> But of course, for every design, there are some cases where it does
> not fit well, and that require more effort than one would like.
> Branch avoidance is such a case for OSR. Given that it is needed
> relatively rarely, it may not be so bad that it requires more effort.
>
> Your rule 1 is the one that the fewest RISC architectures follow;
> there is a large variation here. Having a separate register file for
> flags (as in IA-64 and PowerPC) appears to be a good way to avoid both
> problems: having to deal with a fixed flag register (SPARC, ARM,
> IA-32); and needing an extra GPR port for branch avoidance;
> disadvantage: you need an extra instruction for getting a comparison
> result into a GPR, and that may cost more than longer branch avoidance
> sequences.
>
> >C. Michael_S is stupid, because he doesn't see how branch avoidance can be =
> >implemented nicely and efficiently on OSR.
>
> The Alpha way is shorter than the ones you presented:
>
> For r=x<y?a:b you get
>
> sel_a = x<y
> r = b
> if sel_a then r = a (CMOVE)
>

Non OSR, so doesn't count.

> On an implementation without register renaming, CMOVE needs only two
> read ports (sel_a and a). On an implementation with register
> renaming, it needs three; but if you can afford that kind of
> implementation, you should be able to afford the workaround that's
> necessary for that, or maybe even a more flexible register port
> regime.

Or it could be cracked, as in EV6/EV7.

> Of course, if register renaming implementations are your
> primary target, it may be better to have a three-address CMOVE from
> the start:

Sounds logical. For non-OSR.

>
> sel_a = x<y
> r = sel_a ? a : b;
>
> Or maybe such an implementation can combine the last two instructions
> of the former sequence into the last instruction of the latter
> sequence.
>
> The C-inspired comparison instructions that produce 1 as truth values
> don't help, as your Nios2 example shows. If you have comparison
> instructions that produce all-bits-set as truth value, you could avoid
> the second instruction in the Nios2 sequence without needing the
> special MIPSr6 instructions. If you have an ANDNOT instruction, you
> can do
>
> seln_a = x<y
> am = a & seln_a;
> bm = b & ~seln_a
> r = am|bm;
>
> One cycle less latency than the xor-and-xor variant that Nios2 uses.

Not Nios2 (compiler never does branch avoidance on its own), but myself programming Nios2.
BTW, all existing Nios2 implementations are scalar, so the latency does not matter.
According to my understanding, the same applies to RISC-V, but in this case there are plans for superscalar implementations.

>
> In addition to conditional move, conditional store could be helpful.

Conditional store is pretty rare, even outside of Stanford orthodoxy. I don't know why.

EricP

unread,

Dec 24, 2017, 12:40:31 PM12/24/17

to

MitchAlsup wrote:
>
> My ISA would encode this as:
>
> CMP Rt,Rx,Ry
> PLT Rt,0110 // predicate on less than
> MOV Rr,Ra
> MOV Rr,Rb
>
> The PLT (Predicate on less than) casts a predicate shadow forward over
> (up to) the next 8 instructions (using the 16-bit offset that would have
> been used as a branch offset if it were a branch instruction). The encodings
> provide for:
>
> 00 execute
> 10 do not execute
> 01 execute if condition false
> 11 execute if condition true
>
> Predicates can be combined so that fairly complicated branchless code
> paths can be encoded. Execute allows a subsequent predicate instruction
> to turn off execute or an instruction common to both paths, no-execute
> allows a subsequent predicate to enable execution.

Interesting idea, but doesn't this cause big headaches later?
For example, a page fault part way through the predicated sequence.

It seems there are 2 ways to implement PLT:
in the decoder or in the dispatcher.

For a standard pipelined, the decoder would be simplest
because the predicated sequence can be 8 inst long and the
decoder->write back pipeline stages are 3 or 4 long.
But that means we have to feed the predicate data
register back into the decoder.
For data dependent decoding that's getting Vax-y bad there.

If its in the dispatcher, it decodes all instructions and
it has to retain a the mask as to which to execute,
which gets merged with the predicate register for issue.
So again data dependent feedback loop but not as bad as
in the decoder. If these are OoO issue then the executed
state can be mixed up as there as data dependencies
as well as predicate dependencies.

In both cases if an exception or interrupt occurs during
the dependent sequence we have to remember where we were,
what was supposed to execute, what has been done,
and what is still to do. Restoring the pipeline or
OoO issue queue would be a challenge because we now
have to know the internal micro state of the machine.

Since the instructions are variable length, then you have
to restart the decode sequence from PLT but with a different
mask accounting for what has already been executed.
But that means a special return from exception code path
for just that instruction. Yeech.

PLT just seems more trouble in the long run.

Eric

EricP

unread,

Dec 24, 2017, 1:23:40 PM12/24/17

to

EricP wrote:
>
> If its in the dispatcher, it decodes all instructions and
> it has to retain a the mask as to which to execute,
> which gets merged with the predicate register for issue.
> So again data dependent feedback loop but not as bad as
> in the decoder. If these are OoO issue then the executed
> state can be mixed up as there as data dependencies
> as well as predicate dependencies.

Actually, this wouldn't work because any register renamings are
predicated too. If predicated instruction the writes a register
does not execute, the destination register would be invalid.
So the execute decision has to occur before the renamer.

Eric

MitchAlsup

unread,

Dec 24, 2017, 2:32:10 PM12/24/17

to

On Sunday, December 24, 2017 at 11:40:31 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> >
> > My ISA would encode this as:
> >
> > CMP Rt,Rx,Ry
> > PLT Rt,0110 // predicate on less than
> > MOV Rr,Ra
> > MOV Rr,Rb
> >
> > The PLT (Predicate on less than) casts a predicate shadow forward over
> > (up to) the next 8 instructions (using the 16-bit offset that would have
> > been used as a branch offset if it were a branch instruction). The encodings
> > provide for:
> >
> > 00 execute
> > 10 do not execute
> > 01 execute if condition false
> > 11 execute if condition true
> >
> > Predicates can be combined so that fairly complicated branchless code
> > paths can be encoded. Execute allows a subsequent predicate instruction
> > to turn off execute or an instruction common to both paths, no-execute
> > allows a subsequent predicate to enable execution.
>
> Interesting idea, but doesn't this cause big headaches later?
> For example, a page fault part way through the predicated sequence.

Part of the machine logout on interrupts or exceptions is the cast shadow.
Thus recovering is straightforward.

The big trick here is that EVERY instruction can be in the shadow of a
predicate and yet expend not bits supporting predication.

>
> It seems there are 2 ways to implement PLT:
> in the decoder or in the dispatcher.
>
> For a standard pipelined, the decoder would be simplest
> because the predicated sequence can be 8 inst long and the
> decoder->write back pipeline stages are 3 or 4 long.
> But that means we have to feed the predicate data
> register back into the decoder.
> For data dependent decoding that's getting Vax-y bad there.
>
> If its in the dispatcher, it decodes all instructions and
> it has to retain a the mask as to which to execute,

It has to retain the mask in any event.

> which gets merged with the predicate register for issue.
> So again data dependent feedback loop but not as bad as
> in the decoder. If these are OoO issue then the executed
> state can be mixed up as there as data dependencies
> as well as predicate dependencies.
>
> In both cases if an exception or interrupt occurs during
> the dependent sequence we have to remember where we were,

The machine keeps its current shadow in a register (along with other
flags such as FP inexact, and other) so dumping and reloading state
is straightforward and does not incur any additional overhead.

> what was supposed to execute, what has been done,
> and what is still to do. Restoring the pipeline or
> OoO issue queue would be a challenge because we now
> have to know the internal micro state of the machine.

Which is readily available.

> Since the instructions are variable length, then you have
> to restart the decode sequence from PLT but with a different
> mask accounting for what has already been executed.

No, the shadow is part of save/restore machine state. In fact
if the exception handler wants to simulate an instruction to completion
rather than have it re-execute (for whatever reason) the handler sets
the low order bit of the shadow register before returning, causing that
instruction to be skipped. This actually simplifies things rather than
making it harder.

> But that means a special return from exception code path
> for just that instruction. Yeech.

Wrong.

MitchAlsup

unread,

Dec 24, 2017, 2:39:30 PM12/24/17

to

Given the above code snippet, the decoder sees that there are 2 writes to
logical register Rr. One is Alive, the other is Dead, so both instructions
"get" the same physical register allocation and there is no artificial
sequentiality imposed.

There are rules under which the compiler operates in generating branchless
code sequences (i.e., casting of shadows). The rules are derived from the
"reducible" flow graph" coding style of modern languages. Should the flow
graph not be reducible, the compiler is restricted from producing branchless
sequences. Random code generators will end up taking machine check::compiler
faults if they misuse the facility.

Terje Mathisen

unread,

Dec 24, 2017, 4:07:30 PM12/24/17

to

Anton Ertl wrote:
> For r=x<y?a:b you get
>
> sel_a = x<y
> r = b
> if sel_a then r = a (CMOVE)
>
> On an implementation without register renaming, CMOVE needs only two
> read ports (sel_a and a). On an implementation with register
> renaming, it needs three; but if you can afford that kind of
> implementation, you should be able to afford the workaround that's
> necessary for that, or maybe even a more flexible register port
> regime. Of course, if register renaming implementations are your
> primary target, it may be better to have a three-address CMOVE from
> the start:
>
> sel_a = x<y
> r = sel_a ? a : b;

I believe modern x64 cpus can use the rename network to suppress
reg-to-reg moves, in which case the first version effectively becomes
the second.

Since I have written megabytes of x86 asm, almost all of it before such
optimizations were available in the cpus, I tend to dislike CMOV
specifically due to the additional latency it causes.

>
> Or maybe such an implementation can combine the last two instructions
> of the former sequence into the last instruction of the latter
> sequence.

If the MOV is handled before the EX stage then that is what should happen.

>
> The C-inspired comparison instructions that produce 1 as truth values
> don't help, as your Nios2 example shows. If you have comparison
> instructions that produce all-bits-set as truth value, you could avoid
> the second instruction in the Nios2 sequence without needing the
> special MIPSr6 instructions. If you have an ANDNOT instruction, you
> can do

This was obvious long before the SIMD instructions appeared, since all
the comparison operations there tend to generate -1/0 instead of 1/0.

>
> seln_a = x<y
> am = a & seln_a;
> bm = b & ~seln_a
> r = am|bm;
>
> One cycle less latency than the xor-and-xor variant that Nios2 uses.
>
> In addition to conditional move, conditional store could be helpful.

The fun part about conditional store is that any architecture that
supports partial word updates must have byte enable lines, right?

On x86 those are reachable via a masked store that conditionally writes
up to 16 bytes from a SSE register. It is unfortunatly quite slow afaik?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Dec 24, 2017, 4:25:47 PM12/24/17

to

Eric, I had the same immediate reaction as you, but then I realized that
one implementation would (blindly) execute every instruction in the
shadow, but that instruction result writeback would be predicated.

At this point the problem would seem to change into a dynamic rename
race, where subsequent instructions which had already started executing
would need to get new inputs, i.e. a micro-fault and restart.
Alternatively any PLT shadow could effectively turn off pipelining but
that would kill the performance, something Mitch would never be willing
to do!

To me the question boils down to: "How late is it possible to redirect
the inputs of an already-started instructions?"

It seems obvious that any instruction that modifies the PLT shadow must
be serializing, at least to the event that no following instructions can
start in the same cycle.

On a wide issue machine it seems quite likely that all the
to-be-executed parts of a maximum size (8 instr, 50% to be suppressed)
PLT could run in the same cycle, but this would require very early
determination of the PLT shadow (during decoding?) and that seems
impossible.

It is a really neat concept as long as it can be done fast and without
spending a lot of power.

MitchAlsup

unread,

Dec 24, 2017, 4:57:36 PM12/24/17

to

There are restrictions placed on the compiler (or random code generator)
when producing these kind of sequences. These restrictions greatly
ameliorate the issues you are discussing. These restrictions are along the
same lines as I had to impose on the compiler when performing multi-inst
ATOMIC events.

> To me the question boils down to: "How late is it possible to redirect
> the inputs of an already-started instructions?"

Forwarding

>
> It seems obvious that any instruction that modifies the PLT shadow must
> be serializing, at least to the event that no following instructions can
> start in the same cycle.

The logic of merging multiple Pcc (predicate on some condition) instructions
into the "current running shadow" is only 3 gates deap and a shift of the
number of instructions between the first and subsequent. With a small
shadow (8), the amount of logic is inconsequential, and far from speed
paths.

> On a wide issue machine it seems quite likely that all the
> to-be-executed parts of a maximum size (8 instr, 50% to be suppressed)
> PLT could run in the same cycle, but this would require very early
> determination of the PLT shadow (during decoding?) and that seems
> impossible.
>
> It is a really neat concept as long as it can be done fast and without
> spending a lot of power.

The trick is the restrictions placed on the compiler.

Terje Mathisen

unread,

Dec 24, 2017, 5:21:56 PM12/24/17

to

MitchAlsup wrote:
> On Sunday, December 24, 2017 at 3:25:47 PM UTC-6, Terje Mathisen >

> The logic of merging multiple Pcc (predicate on some condition)
> instructions into the "current running shadow" is only 3 gates deap
> and a shift of the number of instructions between the first and
> subsequent. With a small shadow (8), the amount of logic is
> inconsequential, and far from speed paths.
>
>> On a wide issue machine it seems quite likely that all the
>> to-be-executed parts of a maximum size (8 instr, 50% to be
>> suppressed) PLT could run in the same cycle, but this would require
>> very early determination of the PLT shadow (during decoding?) and
>> that seems impossible.
>>
>> It is a really neat concept as long as it can be done fast and
>> without spending a lot of power.
>
> The trick is the restrictions placed on the compiler.
>

OK, thanks!

So this basically means that as long as the compiler follows all the
rules, this is a _very_ effective way to avoid branches.

Disobeying the rules still has to work, but in that case you get OoO
replays or stalls, just like a branch predictor miss or the classic PRS
stall back in the PPro days, right?

Terje
PS. Merry Christmas everyone, here in Norway we've unwrapped all the
gifts and it is getting late, tomorrow we'll do some more xc skiing. :-)

Anton Ertl

unread,

Dec 24, 2017, 5:45:23 PM12/24/17

to

already...@yahoo.com writes:
>On Sunday, December 24, 2017 at 5:24:41 PM UTC+2, Anton Ertl wrote:
>> already...@yahoo.com writes:
>> >
>> >First, let's define orthodox Stanford RISC (OSR):
>> >1. No arithmetic flags
>> >2. instructions have at most 2 register inputs and one register output.
>> >3. Instructions that have relatively long immediate field (say, >6 bits) ha=
>> >ve at most 1 register input.
>> ...
>> >Original MIPS is not an OSR
>> >Alpha is not an OSR.
>>
>> Why not?
>
>Alpha - only because of cmove (3 inputs)

But CMOVE has only two inputs in the instruction format, and can be
implemented as reading only two inputs. On register-renaming
implementations it is more effective to implement it as reading three
inputs, though (even if it means splitting the instruction into two
micro-instructions).

Anton Ertl

unread,

Dec 24, 2017, 6:00:53 PM12/24/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>On x86 those are reachable via a masked store that conditionally writes
>up to 16 bytes from a SSE register. It is unfortunatly quite slow afaik?

Depends on the implementation. I wrote about vmaskmovps: "On Skylake,
it's ok, Haswell not so great, and on Zen it is abysmally slow."

already...@yahoo.com

unread,

Dec 24, 2017, 6:43:07 PM12/24/17

to

Even on in-order implementations, except for absolutely simplest and primitive, destination register of cmove will play some of the roles of register input.
You don't have to read it from register file, but on any half advanced uarch it would have to be protected by interlock in the same way as input registers.

Also, if the pipeline has stages between register read and writeback (all Alphas had them) then you will need a bypass. Well, I am not sure about it, actually, and it's to late to sit and think.

And Alpha never had simple and primitive implementation. Even EV4 had advanced features.

Melzzzzz

unread,

Dec 24, 2017, 6:53:07 PM12/24/17

to

On 2017-12-24, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>>On x86 those are reachable via a masked store that conditionally writes
>>up to 16 bytes from a SSE register. It is unfortunatly quite slow afaik?
>
> Depends on the implementation. I wrote about vmaskmovps: "On Skylake,
> it's ok, Haswell not so great, and on Zen it is abysmally slow."

On my i7 4790 it's 4 cycles to store 256 bits in memory.

>
> - anton

--
press any key to continue or any other to quit...

Melzzzzz

unread,

Dec 24, 2017, 11:14:14 PM12/24/17

to

On 2017-12-24, Melzzzzz <Melz...@zzzzz.com> wrote:
> On 2017-12-24, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>>On x86 those are reachable via a masked store that conditionally writes
>>>up to 16 bytes from a SSE register. It is unfortunatly quite slow afaik?
>>
>> Depends on the implementation. I wrote about vmaskmovps: "On Skylake,
>> it's ok, Haswell not so great, and on Zen it is abysmally slow."
>
> On my i7 4790 it's 4 cycles to store 256 bits in memory.

Actually it's 2 cycles, I made an error in array index ;)

MitchAlsup

unread,

Dec 25, 2017, 12:56:47 AM12/25/17

to

On Sunday, December 24, 2017 at 4:21:56 PM UTC-6, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Sunday, December 24, 2017 at 3:25:47 PM UTC-6, Terje Mathisen >
> > The logic of merging multiple Pcc (predicate on some condition)
> > instructions into the "current running shadow" is only 3 gates deap
> > and a shift of the number of instructions between the first and
> > subsequent. With a small shadow (8), the amount of logic is
> > inconsequential, and far from speed paths.
> >
> >> On a wide issue machine it seems quite likely that all the
> >> to-be-executed parts of a maximum size (8 instr, 50% to be
> >> suppressed) PLT could run in the same cycle, but this would require
> >> very early determination of the PLT shadow (during decoding?) and
> >> that seems impossible.
> >>
> >> It is a really neat concept as long as it can be done fast and
> >> without spending a lot of power.
> >
> > The trick is the restrictions placed on the compiler.
> >
> OK, thanks!
>
> So this basically means that as long as the compiler follows all the
> rules, this is a _very_ effective way to avoid branches.
>
> Disobeying the rules still has to work, but in that case you get OoO
> replays or stalls, just like a branch predictor miss or the classic PRS
> stall back in the PPro days, right?

Disobeying the compiler DF rules invokes Illegal OpCode <sequence>.
Disobeying the rules should never work! That is how exploits get into
the system.

MitchAlsup

unread,

Dec 25, 2017, 1:01:15 AM12/25/17

to

On Sunday, December 24, 2017 at 4:45:23 PM UTC-6, Anton Ertl wrote:
> already...@yahoo.com writes:
> >On Sunday, December 24, 2017 at 5:24:41 PM UTC+2, Anton Ertl wrote:
> >> already...@yahoo.com writes:
> >> >
> >> >First, let's define orthodox Stanford RISC (OSR):
> >> >1. No arithmetic flags
> >> >2. instructions have at most 2 register inputs and one register output.
> >> >3. Instructions that have relatively long immediate field (say, >6 bits) ha=
> >> >ve at most 1 register input.
> >> ...
> >> >Original MIPS is not an OSR
> >> >Alpha is not an OSR.
> >>
> >> Why not?
> >
> >Alpha - only because of cmove (3 inputs)
>
> But CMOVE has only two inputs in the instruction format, and can be
> implemented as reading only two inputs. On register-renaming
> implementations it is more effective to implement it as reading three
> inputs, though (even if it means splitting the instruction into two
> micro-instructions).

This is one of the things people who have designed x86s see that people
who have not do not.

in x86-land your typical micro-op has 5-RS comparators, X, Y, C, O,
ZAPS! Yes, there are more comparators in the Reservation Stations dealing
with condition codes (CZAPS) than there are dealing with operands X, Y!

The growth of the RSs is not inconsequential, but enables a lot better
data flow than lumping all of the CCs into a single dynamic tag.

Anton Ertl

unread,

Dec 25, 2017, 5:01:39 AM12/25/17

to

already...@yahoo.com writes:
>On Monday, December 25, 2017 at 12:45:23 AM UTC+2, Anton Ertl wrote:
>> But CMOVE has only two inputs in the instruction format, and can be
>> implemented as reading only two inputs. On register-renaming
>> implementations it is more effective to implement it as reading three
>> inputs, though (even if it means splitting the instruction into two
>> micro-instructions).

>Even on in-order implementations, except for absolutely simplest and primitive, destination register of cmove will play some of the roles of register input.
>You don't have to read it from register file, but on any half advanced uarch it would have to be protected by interlock in the same way as input registers.

You mean for the case:

x = load(a)
if f then x = y (CMOVE)

A simple implementation where the CMOVE just conditionally suppresses
the writeback does not have to wait for the result of the previous
instruction to become available as input. It does have to wait for
the writeback of the previous instruction to complete; in classical
five-stage implementations this does not need extra interlocking,
because writeback is at the same stage of the pipeline for all
instructions, and the pipeline is stopped for longer-latency
instructions (such as on cache misses).

A more complex implementation could suppress the writeback of earlier
instructions if a later instruction writes to the same register and
there is no read in between. With CMOVE, the writeback of the earlier
instruction would only be suppressed if CMOVE does not suppress its
own writeback.

>Also, if the pipeline has stages between register read and writeback (all Alphas had them) then you will need a bypass.

Bypasses are pretty common (even in the five-stage pipeline), but no
problem: Even without CMOVE, there is a multiplexer (or, for deeper
and wider machines, several) for choosing between the register content
and the bypass value. If the flag input of CMOVE is true, the bypass
value is chosen, if it is false, the register value is chosen.

>And Alpha never had simple and primitive implementation. Even EV4 had advanced features.

And yet they chose to include CMOVE. What does that tell you?

They only cracked it into two instructions in the register-renaming
EV6. What does that tell you?

Terje Mathisen

unread,

Dec 25, 2017, 5:03:55 AM12/25/17

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> On x86 those are reachable via a masked store that conditionally writes
>> up to 16 bytes from a SSE register. It is unfortunatly quite slow afaik?
>
> Depends on the implementation. I wrote about vmaskmovps: "On Skylake,
> it's ok, Haswell not so great, and on Zen it is abysmally slow."
>

Thanks Anton!

Terje Mathisen

unread,

Dec 25, 2017, 5:05:35 AM12/25/17

to

Melzzzzz wrote:
> On 2017-12-24, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>> On x86 those are reachable via a masked store that conditionally
>>> writes up to 16 bytes from a SSE register. It is unfortunatly
>>> quite slow afaik?
>>
>> Depends on the implementation. I wrote about vmaskmovps: "On
>> Skylake, it's ok, Haswell not so great, and on Zen it is abysmally
>> slow."
>
> On my i7 4790 it's 4 cycles to store 256 bits in memory.

Presumably independent of the byte mask enable pattern?

I.e. any bit pattern takes the same time?

Terje Mathisen

unread,

Dec 25, 2017, 5:13:48 AM12/25/17

to

MitchAlsup wrote:
> On Sunday, December 24, 2017 at 4:21:56 PM UTC-6, Terje Mathisen >> So this basically means that as long as the compiler follows all the

>> rules, this is a _very_ effective way to avoid branches.
>>
>> Disobeying the rules still has to work, but in that case you get OoO
>> replays or stalls, just like a branch predictor miss or the classic PRS
>> stall back in the PPro days, right?
>
> Disobeying the compiler DF rules invokes Illegal OpCode <sequence>.
> Disobeying the rules should never work! That is how exploits get into
> the system.
>

I think you misunderstood me Mitch:

What I assumed was that a PLT shadow could contain any otherwise valid
instructions but that only some specific patterns would run full speed.

I.e. for debugging it is very useful to be able to single-step through
machine code, and be able to alter anything between steps.

On a classic non-pipelined/non-superscalar cpu PLT would be almost
trivial to implement, now it seems to me that you don't have any form of
ultimate fallback to "do one instruction from start to finish before
starting the next" as a way to allow arbitrary shadow code?

A performance bug is simply a bug and should trap instead of running
10-100X slower, sort of like the problem with how many FPUs have been
handling denormals.

Terje

Anton Ertl

unread,

Dec 25, 2017, 5:16:14 AM12/25/17

to

Melzzzzz <Melz...@zzzzz.com> writes:
>On 2017-12-24, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>>On x86 those are reachable via a masked store that conditionally writes
>>>up to 16 bytes from a SSE register. It is unfortunatly quite slow afaik?
>>
>> Depends on the implementation. I wrote about vmaskmovps: "On Skylake,
>> it's ok, Haswell not so great, and on Zen it is abysmally slow."
>
>On my i7 4790 it's 4 cycles to store 256 bits in memory.

[corrected to 2 cycles in the next posting]

My judgement is based on the results for my memcpy implementations
<http://www.complang.tuwien.ac.at/anton/move/move.zip>
<https://github.com/AntonErtl/move>.

Here I show those block sizes where vmaskmovps is chosen in the AVX
code:

Haswell (Core i7-4790K)
8 32 64 block size
11 14 sse
14 14 avx
11 12 sse aligned
15 13 avx aligned
11 13 17 sse blksz-1
14 14 13 avx blksz-1

Skylake (Core i5-6600K)
8 32 64 block size
10 12 sse
10 10 avx
12 14 sse aligned
12 12 avx aligned
10 12 14 sse blksz-1
10 10 10 avx blksz-1

The SSE variant compensates for the absence of maskmovps with
additional branches (predictable in this benchmark setup), yet is
faster for small block sizes on the Haswell. Only on the Skylake does
vmaskmovps give an advantage over branching.

Anton Ertl

unread,

Dec 25, 2017, 6:25:39 AM12/25/17

to

Could you elaborate on what this has to do with the stuff you cited?

Just in case there is a misunderstanding: We were discussing Alpha
CMOVE, not IA-32/AMD64 CMOVcc.

What is the A in ZAPS? The flags that affect CMOVcc and Jcc are CF,
OF, PF, SF, ZF.

Given this complexity, and your ample experience, what is your current
position on implementing flags?

1) Do them in general-purpose registers

a) like MIPS/Alpha, with different comparison instructions and true/false
in the GPR

b) like the 88000, with one integer comparison instruction
producing a bunch of flags in the GPR, and many branch
instructions (or a branch instruction with an immediate
parameter that indicates which flag is tested).

2) Do them in multiple special-purpose registers

a) like IA-64 with different comparison instructions and true/false
in the SPR.

b) with one integer comparison instruction and multiple flags in
the SPR (somewhat like PowerPC, but the selection of flags there
requires different comparison instructions IIRC), and consequently
multiple branch instructions.

3) Have just one special-purpose register per flag

a) instructions set all flags or none (SPARC, ARM?), allowing one
SPR and one RS-comparator for all these flags.

b) different instructions set different subsets of these flags, so
it's better to treat them as different registers as far as
RS-comparators are concerned.

Anton Ertl

unread,

Dec 25, 2017, 6:33:24 AM12/25/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>A performance bug is simply a bug and should trap instead of running
>10-100X slower, sort of like the problem with how many FPUs have been
>handling denormals.

Depends on context. As an end user who can do nothing about the bug,
you probably prefer the slowness over the trap (if it's too slow, you
still have the option to terminate the program). As a developer, you
may prefer the trap, but only after you have hunted down the
functionality bugs.

Melzzzzz

unread,

Dec 25, 2017, 9:38:20 AM12/25/17

to

On 2017-12-25, Terje Mathisen <terje.m...@tmsw.no> wrote:
> Melzzzzz wrote:
>> On 2017-12-24, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>>> On x86 those are reachable via a masked store that conditionally
>>>> writes up to 16 bytes from a SSE register. It is unfortunatly
>>>> quite slow afaik?
>>>
>>> Depends on the implementation. I wrote about vmaskmovps: "On
>>> Skylake, it's ok, Haswell not so great, and on Zen it is abysmally
>>> slow."
>>
>> On my i7 4790 it's 4 cycles to store 256 bits in memory.
>
> Presumably independent of the byte mask enable pattern?

Yeah, I just measured instruction without changing mask
and data source.

>
> I.e. any bit pattern takes the same time?

I heaven't tested changing source mask.

>
> Terje

Terje Mathisen

unread,

Dec 25, 2017, 10:55:12 AM12/25/17

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> A performance bug is simply a bug and should trap instead of running
>> 10-100X slower, sort of like the problem with how many FPUs have been
>> handling denormals.
>
> Depends on context. As an end user who can do nothing about the bug,
> you probably prefer the slowness over the trap (if it's too slow, you
> still have the option to terminate the program). As a developer, you
> may prefer the trap, but only after you have hunted down the
> functionality bugs.

I agree here, correct code should always execute correctly, but if the
slowdown is too severe then it needs to be easy to discover why.

I.e. I think the various denormal mishandling/abuses/performance bugs to
be pretty bad.

Particularly for SIMD code truncating to zero is almost always better
than a sw trap & fixup.

Anton Ertl

unread,

Dec 25, 2017, 11:21:15 AM12/25/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>I.e. I think the various denormal mishandling/abuses/performance bugs to
>be pretty bad.
>
>Particularly for SIMD code truncating to zero is almost always better
>than a sw trap & fixup.

Get a real processor:-). They don't do sw trap and fixup there (maybe
microcode fixups, though). And if the code is written for denormals,
flush-to-zero won't cut it.

On Zen denormal handling is pretty fast, and flush-to-zero is hardly
faster. From <2017Aug1...@mips.complang.tuwien.ac.at>:

2.7GHz 4GHz 4GHz
Sandy Skylake Zen
i7-2620 i5-6600K R5 1600X
micro
Large: 0.051161 0.028260 0.018399
Small: 0.016829 0.012012 0.011826
micro-ignore (with flush-to-zero)
Large: 0.029944 0.012030 0.017606
Small: 0.017007 0.012000 0.011438

Small and Large do the same number of computations, Small without
denormals/flush-to-zero, Large with.

already...@yahoo.com

unread,

Dec 25, 2017, 11:54:53 AM12/25/17

to

On Monday, December 25, 2017 at 12:01:39 PM UTC+2, Anton Ertl wrote:
> already...@yahoo.com writes:
> >On Monday, December 25, 2017 at 12:45:23 AM UTC+2, Anton Ertl wrote:
> >> But CMOVE has only two inputs in the instruction format, and can be
> >> implemented as reading only two inputs. On register-renaming
> >> implementations it is more effective to implement it as reading three
> >> inputs, though (even if it means splitting the instruction into two
> >> micro-instructions).
>
> >Even on in-order implementations, except for absolutely simplest and primitive, destination register of cmove will play some of the roles of register input.
> >You don't have to read it from register file, but on any half advanced uarch it would have to be protected by interlock in the same way as input registers.
>
> You mean for the case:
>
> x = load(a)
> if f then x = y (CMOVE)
>

As an example. But there could be other cases not less common:
A: x is a result of multicycle computational instructions (on EV4 that includes not only [non-pipelined] mul/div, but also [pipelined] shifts, cmovs and even compares)
B: x is a result of singlecycle computational instructions on superscalar machine

> A simple implementation where the CMOVE just conditionally suppresses
> the writeback does not have to wait for the result of the previous
> instruction to become available as input. It does have to wait for
> the writeback of the previous instruction to complete; in classical
> five-stage implementations this does not need extra interlocking,
> because writeback is at the same stage of the pipeline for all
> instructions, and the pipeline is stopped for longer-latency
> instructions (such as on cache misses).

Repeating myself, Alphas like those never existed.
BTW, I think that your reference to R2000 pipeline as "classical" is a mother duck syndrome. I started to take interest in CPU performance issues on TI C20 DSPs, so in my mind 4-stage pipeline is more "classical". Others could have yet different perspectives. For those, who grew on ARM, 3-stage is classical.

>
> A more complex implementation could suppress the writeback of earlier
> instructions if a later instruction writes to the same register and
> there is no read in between. With CMOVE, the writeback of the earlier
> instruction would only be suppressed if CMOVE does not suppress its
> own writeback.
>
> >Also, if the pipeline has stages between register read and writeback (all Alphas had them) then you will need a bypass.
>
> Bypasses are pretty common (even in the five-stage pipeline), but no
> problem:
> Even without CMOVE, there is a multiplexer (or, for deeper
> and wider machines, several) for choosing between the register content
> and the bypass value. If the flag input of CMOVE is true, the bypass
> value is chosen, if it is false, the register value is chosen.

And what if register is not written yet?

>
> >And Alpha never had simple and primitive implementation. Even EV4 had advanced features.
>
> And yet they chose to include CMOVE. What does that tell you?
>

It tells me (wild guess) that after they implemented it they became sorry that Alpha ISA (that was mostly frozen few years earlier) didn't have more powerful 3-input-1-output CMOVE instruction.

> They only cracked it into two instructions in the register-renaming
> EV6. What does that tell you?
>

The same, but with higher certainty.

already...@yahoo.com

unread,

Dec 25, 2017, 12:29:21 PM12/25/17

to

Please read 'C30' DSPs.
C20 is fix-point with short pipeline. Many years later I wrote code for two of its derivatives (C55 and C28), but never with original.

Anton Ertl

unread,

Dec 25, 2017, 12:53:50 PM12/25/17

to

already...@yahoo.com writes:
>On Monday, December 25, 2017 at 12:01:39 PM UTC+2, Anton Ertl wrote:
>> A simple implementation where the CMOVE just conditionally suppresses
>> the writeback does not have to wait for the result of the previous
>> instruction to become available as input. It does have to wait for
>> the writeback of the previous instruction to complete; in classical
>> five-stage implementations this does not need extra interlocking,
>> because writeback is at the same stage of the pipeline for all
>> instructions, and the pipeline is stopped for longer-latency
>> instructions (such as on cache misses).
>
>Repeating myself, Alphas like those never existed.

>BTW, I think that your reference to R2000 pipeline as "classical" is a moth=
>er duck syndrome.

My RISC mother duck is HPPA (HP9000/840 and /825 IIRC), then the next
one was an 88100, and I did scheduler work for that where the
writeback placement actually played a role. I don't think that the
88100 writeback arbitration mechanism is in any way classical, though,
and after reading about the 5-stage pipeline, I wonder why Mitch Alsup
chose this mechanism instead of inserting an extra stage to the
integer unit, so that at least integer instructions and loads could
avoid writeback collisions.

>> A more complex implementation could suppress the writeback of earlier
>> instructions if a later instruction writes to the same register and
>> there is no read in between. With CMOVE, the writeback of the earlier
>> instruction would only be suppressed if CMOVE does not suppress its
>> own writeback.

>>=20
>> >Also, if the pipeline has stages between register read and writeback (al=

>l Alphas had them) then you will need a bypass.

>>=20

>> Bypasses are pretty common (even in the five-stage pipeline), but no
>> problem:
>> Even without CMOVE, there is a multiplexer (or, for deeper
>> and wider machines, several) for choosing between the register content
>> and the bypass value. If the flag input of CMOVE is true, the bypass
>> value is chosen, if it is false, the register value is chosen.
>
>And what if register is not written yet?

Then the reader will grab the result from another bypass, and CMOVE
influences the mux control for the bypasses in a more complex way.

What if the result of the earlier instruction is not yet done? The
reader can wait until the both the earlier instruction and the CMOVE
are finished. I.e., the CMOVE does not read-interlock, but the next
reader of the target register of the CMOVE does.

Or, alternatively, and probably simpler, you have a write-interlock on
the CMOVE (you need such a mechanism anyway on an in-order
architecture with out-of-order results), and the following reader then
just waits for the CMOVE.

Alternatively, let the CMOVE flag also control what happens with the
output of the earlier instruction. If the flag is true, just suppress
the writeback of the earlier instruction, and the following reader can
continue one stage after the CMOVE. If the flag is false, the CMOVE
turns into a noop.

In a register-renaming machine, one could also do this, but that would
mean that the renamer has to wait for the flag, and the CMOVE would
get a relatively long latency wrt the flag, but eliminate the
dependence on the unused value. It also reduces the number of
register reads, but is probably a loss in most situations.

>> >And Alpha never had simple and primitive implementation. Even EV4 had ad=
>vanced features.
>>=20

>> And yet they chose to include CMOVE. What does that tell you?

>>=20
>
>
>It tells me (wild guess) that after they implemented it they became sorry t=
>hat Alpha ISA (that was mostly frozen few years earlier) didn't have more p=
>owerful 3-input-1-output CMOVE instruction.

They had played around with RISC projects for a decade before
releasing Alpa, so they probably had a good idea of how the pieces
come together, therefore I have my doubts. Adding a fifth and sixth
read port to the EV4's register file would have been expensive in
area, and possibly in power and maybe clock rate, and CMOVE is not so
frequent to justify this cost.

already...@yahoo.com

unread,

Dec 25, 2017, 1:18:12 PM12/25/17

to

They would need no additional read ports. It's not like EV4 was able to pair anything with anything. It had very restrictive pairing rules in place. All it would cost is one more rule: "cmove pairs only with integer loads, branches and floating-point instructions". As a payback, in addition to non-destruction, they would likely be able to reduce the latency of CMOVE to 1 clock.

Anton Ertl

unread,

Dec 25, 2017, 1:45:47 PM12/25/17

to

already...@yahoo.com writes:
>On Monday, December 25, 2017 at 7:53:50 PM UTC+2, Anton Ertl wrote:

[Three-input CMOVE and EV4]

>> Adding a fifth and sixth
>> read port to the EV4's register file would have been expensive in
>> area, and possibly in power and maybe clock rate, and CMOVE is not so
>> frequent to justify this cost.
>
>

>They would need no additional read ports. It's not like EV4 was able to pai=
>r anything with anything. It had very restrictive pairing rules in place. A=
>ll it would cost is one more rule: "cmove pairs only with integer loads, br=
>anches and floating-point instructions".

I.e., three-input CMOVE pairs like an integer instruction, but needs
one additional register read. So they would need an additional
read port (not 2, but also not 0).

already...@yahoo.com

unread,

Dec 25, 2017, 2:25:18 PM12/25/17

to

21064 pairing table is not very easy to parse, but my understanding, both from the table and from Figure 2-2 in HRM was that integer ALU instructions can be paired both with Integer Loads and with Integer Stores.
Our hypothetical CMOVE4 would be paired only with integer loads, but not stores.
It will use the same register read port as store data.

MitchAlsup

unread,

Dec 25, 2017, 4:03:33 PM12/25/17

to

On Monday, December 25, 2017 at 4:13:48 AM UTC-6, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Sunday, December 24, 2017 at 4:21:56 PM UTC-6, Terje Mathisen >> So this basically means that as long as the compiler follows all the
> >> rules, this is a _very_ effective way to avoid branches.
> >>
> >> Disobeying the rules still has to work, but in that case you get OoO
> >> replays or stalls, just like a branch predictor miss or the classic PRS
> >> stall back in the PPro days, right?
> >
> > Disobeying the compiler DF rules invokes Illegal OpCode <sequence>.
> > Disobeying the rules should never work! That is how exploits get into
> > the system.
> >
> I think you misunderstood me Mitch:
>
> What I assumed was that a PLT shadow could contain any otherwise valid
> instructions but that only some specific patterns would run full speed.

Any instruction can have a shadow cast upon it.
Those instructions can have any OpCode.
Not all combinations of destination registers are legal. In particular
those combinations that make it hard to rename DST are restricted, checked,
and cause faults.
All combinations of source registers are legal.
All instructions operate under standard data flow (under shadow or not).

> I.e. for debugging it is very useful to be able to single-step through
> machine code, and be able to alter anything between steps.

To single step the machine, the debugger examines the current shadow
and if the next instruction is supposed to execute, the subsequent inst
has a Trap cast upon it. If the instruction is not supposed to execute
it is simply skipped and the debugger returns control to human. Control
is transferred to the thread, one instruction executes, the subsequent
transfers control back.

On the other hand it remains illegal to single step through
multi-instruction ATOMIC events. The taking of a Trap necessarily
exits the ATOMIC nature of the sequence. If one attempts to single
step through an ATOMIC event the only reasonable course of events
is for control to transfer to the fail control point.

> On a classic non-pipelined/non-superscalar cpu PLT would be almost
> trivial to implement, now it seems to me that you don't have any form of
> ultimate fallback to "do one instruction from start to finish before
> starting the next" as a way to allow arbitrary shadow code?

I am sure I don't want to allow arbitrary shadow code; especially as far as
destination registers are concerned. The code under any shadow must be as
if it came from a reducible flow graph. I don't want arbitrary code jumping
into a sequence of code that is covered by a shadow, either.

If/when the code being compiled is irreducible, predication is not allowed
on irreducible sections.

There are also certain restrictions on code sequences for security that
compilers won't produce and I see no particular reason HW should put up
with such sequences, either.

MitchAlsup

unread,

Dec 25, 2017, 4:18:51 PM12/25/17

to

On Monday, December 25, 2017 at 5:25:39 AM UTC-6, Anton Ertl wrote:

> MitchAlsup <?????@aol.com> writes:
> >On Sunday, December 24, 2017 at 4:45:23 PM UTC-6, Anton Ertl wrote:
> >> But CMOVE has only two inputs in the instruction format, and can be
> >> implemented as reading only two inputs. On register-renaming
> >> implementations it is more effective to implement it as reading three
> >> inputs, though (even if it means splitting the instruction into two
> >> micro-instructions).
> >
> >This is one of the things people who have designed x86s see that people
> >who have not do not.
> >
> >in x86-land your typical micro-op has 5-RS comparators, X, Y, C, O,
> >ZAPS! Yes, there are more comparators in the Reservation Stations dealing
> >with condition codes (CZAPS) than there are dealing with operands X, Y!
> >
> >The growth of the RSs is not inconsequential, but enables a lot better
> >data flow than lumping all of the CCs into a single dynamic tag.
>
> Could you elaborate on what this has to do with the stuff you cited?
>
> Just in case there is a misunderstanding: We were discussing Alpha
> CMOVE, not IA-32/AMD64 CMOVcc.

You were discussing reservation stations (maybe directed towards Alpha)
but making statements about RSs that are not universal. I was adding
data to the kind of reservations stations that most computers use.

>
> What is the A in ZAPS? The flags that affect CMOVcc and Jcc are CF,
> OF, PF, SF, ZF.

Auxilliary Flag (AF bit 4 in condition codes field), mainly used to support BCD.

> Given this complexity, and your ample experience, what is your current
> position on implementing flags?
>
> 1) Do them in general-purpose registers
>
> a) like MIPS/Alpha, with different comparison instructions and true/false
> in the GPR
>
> b) like the 88000, with one integer comparison instruction
> producing a bunch of flags in the GPR, and many branch
> instructions (or a branch instruction with an immediate
> parameter that indicates which flag is tested).

I prefer this format. It allows for a comparison to perform boundry
checking ( 0 < Rx && Rx < Ry ) for {(<,<),(<=,<),(<,<=),<=,<=)}
and it allows access to other machine data (memory interference)
that should be querriable but otherwise non-accessible. It also allows
the compare instruction to classify INT and FP data which can then be
querried.

There is a OPENGL FP function where one supplies a 10-bit mask and an
FP operand and gets back a TRUE if the FP operand is classified as
one of the bits set in the given mask. This is useful in isolating
things like -0.0 from +0.0 for ATAN2 operand processing.

MitchAlsup

unread,

Dec 25, 2017, 4:21:46 PM12/25/17

to

On Monday, December 25, 2017 at 9:55:12 AM UTC-6, Terje Mathisen wrote:
> Anton Ertl wrote:
> > Terje Mathisen <terje.m...@tmsw.no> writes:
> >> A performance bug is simply a bug and should trap instead of running
> >> 10-100X slower, sort of like the problem with how many FPUs have been
> >> handling denormals.
> >
> > Depends on context. As an end user who can do nothing about the bug,
> > you probably prefer the slowness over the trap (if it's too slow, you
> > still have the option to terminate the program). As a developer, you
> > may prefer the trap, but only after you have hunted down the
> > functionality bugs.
>
> I agree here, correct code should always execute correctly, but if the
> slowdown is too severe then it needs to be easy to discover why.

Perhaps it is simply Apple trying to encourage its customer to buy the
next model of the currently owned device.

> I.e. I think the various denormal mishandling/abuses/performance bugs to
> be pretty bad.
>
> Particularly for SIMD code truncating to zero is almost always better
> than a sw trap & fixup.

Once the FP is based on FMAD functionality (IEEE 754-2008 requirement) there
is no longer ANY excuse for the implementation NOT to support Denorms at
full speed and with full IEEE 754 fidelity.

MitchAlsup

unread,

Dec 25, 2017, 4:32:34 PM12/25/17

to

On Monday, December 25, 2017 at 11:53:50 AM UTC-6, Anton Ertl wrote:
> already...@yahoo.com writes:
> >On Monday, December 25, 2017 at 12:01:39 PM UTC+2, Anton Ertl wrote:
> >> A simple implementation where the CMOVE just conditionally suppresses
> >> the writeback does not have to wait for the result of the previous
> >> instruction to become available as input. It does have to wait for
> >> the writeback of the previous instruction to complete; in classical
> >> five-stage implementations this does not need extra interlocking,
> >> because writeback is at the same stage of the pipeline for all
> >> instructions, and the pipeline is stopped for longer-latency
> >> instructions (such as on cache misses).
> >
> >Repeating myself, Alphas like those never existed.
> >BTW, I think that your reference to R2000 pipeline as "classical" is a moth=
> >er duck syndrome.
>
> My RISC mother duck is HPPA (HP9000/840 and /825 IIRC), then the next
> one was an 88100, and I did scheduler work for that where the
> writeback placement actually played a role. I don't think that the
> 88100 writeback arbitration mechanism is in any way classical, though,
> and after reading about the 5-stage pipeline, I wonder why Mitch Alsup
> chose this mechanism instead of inserting an extra stage to the
> integer unit, so that at least integer instructions and loads could
> avoid writeback collisions.

Die size, team size, and schedule pressure.

It could have been fixed with an additional port to the register file,
Die would not fit in reticle--project cancelled. I looked at this.

It could have been fixed with an additional stage of pipelining, die
size pressure, and several months of delay--project cancelled. I looked
at this.

It could have been fixed with an addition stage in the write back pipeline
(similar to the above) but merging the integer and LD write backs. Looked
to have no way to tell the compiler which way the pipeline chose to work,
had performance problems--and would have cost a bunch of time we did not
have--result project cancelled.

Remember the whole 88100 was implemented with 10 people--including the
compiler team, implementation team, verification team, architecture team.
The 88200 had another 6 whole people.

Terje Mathisen

unread,

Dec 25, 2017, 5:56:53 PM12/25/17

to

MitchAlsup wrote:
> On Monday, December 25, 2017 at 4:13:48 AM UTC-6, Terje Mathisen

> On the other hand it remains illegal to single step through
> multi-instruction ATOMIC events. The taking of a Trap necessarily
> exits the ATOMIC nature of the sequence. If one attempts to single
> step through an ATOMIC event the only reasonable course of events is
> for control to transfer to the fail control point.

As you know x86 has a few interesting exceptions, like a 32-bit
segment:offset load of the (16-bit) stack: If you try to single-step
through that the cpu will execute both instructions before returning to
the debugger.

>
>> On a classic non-pipelined/non-superscalar cpu PLT would be almost
>> trivial to implement, now it seems to me that you don't have any
>> form of ultimate fallback to "do one instruction from start to
>> finish before starting the next" as a way to allow arbitrary shadow
>> code?
>
> I am sure I don't want to allow arbitrary shadow code; especially as
> far as destination registers are concerned. The code under any shadow
> must be as if it came from a reducible flow graph. I don't want
> arbitrary code jumping into a sequence of code that is covered by a
> shadow, either.

That is understandable, but reminds me of both The Story of Mel and my
own jumping into the immediate part of another instruction: It works but
is significantly slower than normal code.

>
> If/when the code being compiled is irreducible, predication is not
> allowed on irreducible sections.
>
> There are also certain restrictions on code sequences for security
> that compilers won't produce and I see no particular reason HW should
> put up with such sequences, either.
>

:-)

Terje Mathisen

unread,

Dec 25, 2017, 6:00:38 PM12/25/17

to

MitchAlsup wrote:
> On Monday, December 25, 2017 at 9:55:12 AM UTC-6, Terje Mathisen

>> I agree here, correct code should always execute correctly, but if
>> the slowdown is too severe then it needs to be easy to discover
>> why.
>
> Perhaps it is simply Apple trying to encourage its customer to buy
> the next model of the currently owned device.

:-)

>
>> I.e. I think the various denormal mishandling/abuses/performance
>> bugs to be pretty bad.
>>
>> Particularly for SIMD code truncating to zero is almost always
>> better than a sw trap & fixup.
>
> Once the FP is based on FMAD functionality (IEEE 754-2008
> requirement) there is no longer ANY excuse for the implementation NOT
> to support Denorms at full speed and with full IEEE 754 fidelity.
>

We're in violent agreement, I also believe strongly that there is no
longer any valid reason to not support denorm at full speed.

MitchAlsup

unread,

Dec 25, 2017, 10:50:25 PM12/25/17

to

We used to put the only instruction of an ELSE clause as the immediate of
a MOV instruction in 680x0 so as to avoid the delay inducing branch around
the (single instruction) ELSE clause.

BUt that was yesteryear, and there is no longer any reason that such codes
should be allowed to execute.

Quadibloc

unread,

Dec 25, 2017, 11:56:29 PM12/25/17

to

On Monday, December 25, 2017 at 8:50:25 PM UTC-7, MitchAlsup wrote:

> We used to put the only instruction of an ELSE clause as the immediate of
> a MOV instruction in 680x0 so as to avoid the delay inducing branch around
> the (single instruction) ELSE clause.

> BUt that was yesteryear, and there is no longer any reason that such codes
> should be allowed to execute.

I suppose that there is a legitimate way to get the same result; a special
no-op instruction that is billed as an unconditional forwards branch...
which performs the branch by being a long instruction, but to the system
that detects shenanigans, it's noted that branching to the second byte of
*this particular special instruction* is legitimate.

There may be no reason for that on modern out-of-order CPUs, but what
about tiny ones for really small microcontrollers?

John Savard

Bruce Hoult

unread,

Dec 26, 2017, 3:50:46 AM12/26/17

to

Hmm.

That may be possible in Thumb2, but it would be tricky as determining whether an instruction is 32 bit requires particular values in the hi four bits of both 16 bit chunks.

On RISC-V it looks easy enough. An "LUI r0,#imm20" (r0 := #imm20 << 12) is legal, is a no-op (r0 is wired to zero), and the entire 2nd 16 bit chunk can have arbitrary contents.

Niklas Holsti

unread,

Dec 26, 2017, 3:54:54 AM12/26/17

to

The compilers from Byte Craft still glory in such tricks, for small systems.

For example, see http://bytecraft.com/node/291.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

Terje Mathisen

unread,

Dec 26, 2017, 4:58:05 AM12/26/17

to

Quadibloc wrote:
> On Monday, December 25, 2017 at 8:50:25 PM UTC-7, MitchAlsup wrote:
>
>> We used to put the only instruction of an ELSE clause as the
>> immediate of a MOV instruction in 680x0 so as to avoid the delay
>> inducing branch around the (single instruction) ELSE clause.
>
>> BUt that was yesteryear, and there is no longer any reason that
>> such codes should be allowed to execute.
>
> I suppose that there is a legitimate way to get the same result; a
> special no-op instruction that is billed as an unconditional forwards
> branch... which performs the branch by being a long instruction, but
> to the system that detects shenanigans, it's noted that branching to
> the second byte of *this particular special instruction* is
> legitimate.

This have existed a long time, it is a single-instruction SKIP.

>
> There may be no reason for that on modern out-of-order CPUs, but what
> about tiny ones for really small microcontrollers?

That's the type of cpu where they were most common, right?

Bruce Hoult

unread,

Dec 26, 2017, 6:23:08 AM12/26/17

to

Of course such a thing is not *necessary* in Thumb2, because you can use the IT instruction to predicate off the next 1-4 instructions anyway.

Interesting, by the way, that ARMv8 deprecates most uses of IT. Now you are supposed to use it only to predicate a single instruction, and that instruction should be a 16 bit Thumb instruction. i.e. The IT and the following instruction effectively make up a pseudo 32-bit predicated instruction similar to ARM32, but with massively inefficient encoding (very few options for what the instruction can actually do).

David Brown

unread,

Dec 26, 2017, 7:21:42 AM12/26/17

to

On 24/12/17 02:46, already...@yahoo.com wrote:
> On Sunday, December 24, 2017 at 2:50:52 AM UTC+2, David Brown wrote:
>> On 23/12/17 23:23, already...@yahoo.com wrote:
>>> On Saturday, December 23, 2017 at 11:39:40 PM UTC+2, Bruce Hoult
>>> wrote:
>>>> On Sunday, December 24, 2017 at 12:15:15 AM UTC+3,
>>>> already...@yahoo.com wrote:
>>
>> <snip>
>>
>>>>
>>>>> I also have little doubt that OSR is a bad idea [for hard cores]
>>>>> on the very low end (Cortex-M3/M4 class) where 3 inputs are not a
>>>>> cycle time limiter for a short scalar pipeline.
>>>>
>>>> I note that SiFive's RISC-V Cortex-MS/M4 competitor runs at 320 MHz
>>>> in 180 nm. The fastest ARMs I know of in that process are 180 MHz.
>>>
>>> Did you check Intel PXA250? I am not 100% certain, but it sounds
>>> likely that it was fabricated at 180nm.
>>>
>>> As to Cortex-M3/M4 at 180nm, people don't do it not because it's
>>> impossible (I have no idea if it's possible or not) but because it's
>>> very bad idea. M3/M4 has no instruction cache and the flash at 180nm
>>> is not exactly fast. Flash accelerator helps somewhat, but not enough
>>> for usefull operation above something like 120MHz. And SRAM at 180nm
>>> is relatively expensive, so running all code out of SRAM tend to be
>>> bad economy. One tends to by MCU with as much SRAM as needed for
>>> data, but no more.
>>>
>>> I believe you that there exist 180 MHz parts, but even 180 MHz is
>>> more a gimmick than something really useful. 320 MHz - even more so.
>>
>>
>> There are indeed 180 MHz Cortex-M4 parts, such as the NXP Kinetis K65.
>> It has 8 KB shared instruction/data cache.
>
> If it has cache, it certainly changes the trade-offs.
> Instruction cache even could be not such a bad idea.
> But data cache? I certainly don't want data cache on M4!
> It kills 3/4th of beautiful simplicity of using DMA with peripherals.

I have not used one of these chips, but I have used other
microcontrollers with cache (such as a dual PPC core microcontroller at
about 150-200 MHz, IIRC). There the DMA was synchronised with the cache
through cache snooping, so there was no problem.

You also have control over which areas of memory are cached. Peripheral
regions are not cached, for example. And on devices where DMA does not
synchronise with the cache, you can have an area of ram that is uncached
to make DMA work easier.

Caches make it more complicated to do accurate or predictable timings,
but usually such high-end microcontrollers have good enough peripherals
that you use timers, DMA, and such like to get accurate timings.

> The simplicity that allows even to unexperienced programmers that know nothing about cache coherence to delivery useful results with these little things.

These are more advanced parts than the cheaper M3/M4 devices. You have
to know how to use them - or at the very least, how to work with
existing software for them.

> Then again, my experience with M3/M4 was exclusively with STMicro parts. I can believe that with NXP it never was beautifully simple so a little more pain matters less.

There is little significant difference between the families offered
here, in terms of complexity to use.

> Or, may be, they understood that data cache is a bad idea, but since ARM does not provide caches as part of M4 IP, adding unified cache was much simpler for their designers than adding ICache alone?

Data cache is useful if you have slow ram. On most M3/M4
microcontrollers, you usually use the on-chip ram - and it will be
single cycle. Therefore it does not benefit from cache. Peripherals,
of course, do not benefit from cache. You use data cache on devices
that have external ram - DDR, SDRAM, etc. Most M3/M4 microcontrollers
don't have data cache simply because it is of no use - not because some
manufacturers "understand" caches and some do not!

>
> Anyway, there is no chance that I will use M4 with D$ in those designs where I have control.
>

That's your choice. But I think you are misunderstanding the purpose of
these caches, and how they work.

>
>
>
>> That is the fastest M4 that
>> I know of. Until recently, the fastest M7 I had heard of was 240 MHz.
>> But there are new M7 devices coming out at 600 MHz. I don't know the
>> process size, but I think it is small - a key difference here is that
>> these microcontrollers do not have any on-board flash. As I understand
>> it, the speed limitation of M3/M4/M7 microcontrollers has not been the
>> core itself, but using a process that is also cheap for flash. Once the
>> decision was made to drop the flash (in the NXP i.MX RT series, for
>> example), a smaller and faster process could be used.
>>
>>
>>
>>>
>>> As to SiFive being "Cortex-MS/M4 competitor" if they said it by
>>> themselves then to me it sounds as megalomania.
>>>
>>
>> I suspect whoever said it meant competitive in terms of processor power
>> for a given area, cost, or power usage - rather than competing on
>> numbers of chips produced!
>>
>>>>
>>>> Whether something is a cycle time limiter depends on what else is
>>>> limiting the cycle time instead :-) :-)
>>>
>

Anton Ertl

unread,

Dec 26, 2017, 8:23:15 AM12/26/17

to

already...@yahoo.com writes:
>21064 pairing table is not very easy to parse, but my understanding, both from the table and from Figure 2-2 in HRM was that integer ALU instructions can be paired both with Integer Loads and with Integer Stores.
>Our hypothetical CMOVE4 would be paired only with integer loads, but not stores.
>It will use the same register read port as store data.

Ok, so it will cost some additional decoding effort, then a mux and
control logic to use the resulting register number instead of the one
from the other instruction, and similar complications for the
bypasses, then lines to ship the result to the CMOVE4 unit. Does not
look cheap enough to me and the instruction not frequent enough that
they would have regretted not going that way.

Anton Ertl

unread,

Dec 26, 2017, 8:36:42 AM12/26/17

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, December 25, 2017 at 5:25:39 AM UTC-6, Anton Ertl wrote:
>> MitchAlsup <?????@aol.com> writes:
>> >On Sunday, December 24, 2017 at 4:45:23 PM UTC-6, Anton Ertl wrote:
>> >> But CMOVE has only two inputs in the instruction format, and can be
>> >> implemented as reading only two inputs. On register-renaming
>> >> implementations it is more effective to implement it as reading three
>> >> inputs, though (even if it means splitting the instruction into two
>> >> micro-instructions).
>> >
>> >This is one of the things people who have designed x86s see that people
>> >who have not do not.
>> >
>> >in x86-land your typical micro-op has 5-RS comparators, X, Y, C, O,
>> >ZAPS! Yes, there are more comparators in the Reservation Stations dealing
>> >with condition codes (CZAPS) than there are dealing with operands X, Y!
>> >
>> >The growth of the RSs is not inconsequential, but enables a lot better
>> >data flow than lumping all of the CCs into a single dynamic tag.
>>
>> Could you elaborate on what this has to do with the stuff you cited?

...

>You were discussing reservation stations (maybe directed towards Alpha)
>but making statements about RSs that are not universal.

I did not have reservation stations in mind, but a more abstract
notion of register renaming. Do you disagree with my statement "On

register-renaming implementations it is more effective to implement it

as reading three inputs"? Or which other statement is not universal
wrt reservation stations?

already...@yahoo.com

unread,

Dec 26, 2017, 8:37:41 AM12/26/17

to

What you described sounds, in today's terms, more like very slow Cortex-A5 than M3/M4 or even M7.
Reminds me PowerQuick 850 with which I worked in a several projects. Most of them went nowhere after the first burst of the first Internet bubble, but their demise had nothing to do with merits of 850, which was damn nice chip. But its niche was completely different from what today occupied by M3/M4-based MCUs.

> You also have control over which areas of memory are cached. Peripheral
> regions are not cached, for example. And on devices where DMA does not
> synchronise with the cache, you can have an area of ram that is uncached
> to make DMA work easier.
>
> Caches make it more complicated to do accurate or predictable timings,
> but usually such high-end microcontrollers have good enough peripherals
> that you use timers, DMA, and such like to get accurate timings.
>
>
> > The simplicity that allows even to unexperienced programmers that know nothing about cache coherence to delivery useful results with these little things.
>
> These are more advanced parts than the cheaper M3/M4 devices. You have
> to know how to use them - or at the very least, how to work with
> existing software for them.
>

I know. But I want to get results from developers that don't know. And with cacheless M3/M4s I have a good rate of success.

> > Then again, my experience with M3/M4 was exclusively with STMicro parts. I can believe that with NXP it never was beautifully simple so a little more pain matters less.
>
> There is little significant difference between the families offered
> here, in terms of complexity to use.

May be, between NXP and STMicro you are right. Not when you add TI to the list, esp. their Hercules family (their Stellaris looks fine on paper, but I have no 1st hand experience with them).

>
> > Or, may be, they understood that data cache is a bad idea, but since ARM does not provide caches as part of M4 IP, adding unified cache was much simpler for their designers than adding ICache alone?
>
> Data cache is useful if you have slow ram. On most M3/M4
> microcontrollers, you usually use the on-chip ram - and it will be
> single cycle. Therefore it does not benefit from cache. Peripherals,
> of course, do not benefit from cache. You use data cache on devices
> that have external ram - DDR, SDRAM, etc. Most M3/M4 microcontrollers
> don't have data cache simply because it is of no use - not because some
> manufacturers "understand" caches and some do not!
>

M3/M4 was never intended for apps with external DRAM.
If it hurts, don't do it.
There are other, more suitable cores for that purpose.

> >
> > Anyway, there is no chance that I will use M4 with D$ in those designs where I have control.
> >
>
> That's your choice. But I think you are misunderstanding the purpose of
> these caches, and how they work.
>

I think that it's no mare than attempt to push M4 into alien territory.
Want faster Cortex-M? That's what M7 is invented for.

Anton Ertl

unread,

Dec 26, 2017, 9:01:09 AM12/26/17

to

MitchAlsup <Mitch...@aol.com> writes:
[88100 writeback arbitration]

>It could have been fixed with an addition stage in the write back pipeline
>(similar to the above) but merging the integer and LD write backs. Looked
>to have no way to tell the compiler which way the pipeline chose to work

How does the compiler come into play here? Functionally, this should
be transparent, and for performance, this should be both easier for
the compiler and produce slightly better performance (fewer writeback
conflicts).

>Remember the whole 88100 was implemented with 10 people--including the
>compiler team, implementation team, verification team, architecture team.
>The 88200 had another 6 whole people.

A theory I have is that the 68000 and most RISCs fell back behind
Intel because Intel was able to deal with the complexity that the
growth in transistors allowed (they had their own failures, though:
iAPX432, i860, and IA-64), and most of the competition was not
prepared for that beyond a certain point, and ran into delays, or they
could not keep up in the MHz race. DEC was able, until they could no
longer afford it. ARM survived by focussing on low-end CPUs and then
built their empire from that base, and now they are able to design
complex CPUs. AMD and later P.A. Semi/Apple also managed that
transition.

BGB

unread,

Dec 26, 2017, 12:56:13 PM12/26/17

to

On 12/25/2017 5:30 AM, Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> A performance bug is simply a bug and should trap instead of running
>> 10-100X slower, sort of like the problem with how many FPUs have been
>> handling denormals.
>
> Depends on context. As an end user who can do nothing about the bug,
> you probably prefer the slowness over the trap (if it's too slow, you
> still have the option to terminate the program). As a developer, you
> may prefer the trap, but only after you have hunted down the
> functionality bugs.
>

though my attempt at a CPU core is arguably pretty terrible, it has been
a similar issue with dealing with some obscure legacy SuperH ops which
violate the normal load/store design (there are a few mostly unused ops
which would do multiple memory accesses, and a few others which do
read/modify/write).

as-is, deviating from normal load/store would basically make a big mess,
so doesn't seem worth it (more so for edge-case instructions which are
basically unused). I may make a partial exception for the "CAS"
instruction, which serves a special purpose (implementing mutex locks),
but am not entirely sure as of yet how it will be implemented.

in my 64-bit ISA, these ops are effectively deprecated/dropped (and a
lot of the space from various ops was reassigned to things which are
arguably more useful, like "MOV.Q Rm, @(SP, disp4)" and similar, 1).

1: 64-bit ops which were given a direct encoding via 16-bit I-Forms
MOV.Q Rm, @(SP, disp4) //R0..R15
MOV.Q @(SP, disp4), Rn //R0..R15
MOV.Q Rm, @Rn //R8..R14, R0
MOV.Q @Rm, Rn //R8..R14, R0
MOV.Q Rm, @(Rn, R0) //R8..R14, R0
MOV.Q @(Rm, R0), Rn //R8..R14, R0
CMPQ/EQ Rm, Rn //R8..R14, R0
CMPQ/GT Rm, Rn //R8..R14, R0
...
many use 3-bit registers to save encoding space.
these are the most common regs in my compiler output.
generally, these are the regs used for holding local vars.
use of other registers here results in a 32-bit instruction form.
these at the cost of ops like 16-bit "MOV.B R0, @(GBR, disp8)", ...
the GBR-based MOV.x forms were demoted to 32-bit instruction-forms.
where, GBR is generally used similarly to FS or GS in x86-64.

for the 32-bit ISA, the existing plan is to leave these ops
unimplemented and to trap and emulate in firmware if they are
encountered (and, if used in an ISR, this will effectively crash the
processor; but neither my C compiler nor GCC appears to use them, so...).

also, the FPU doesn't bother with supporting denormals either (only does
"denormals as zero").

a CMOV style instruction is tempting though, as in my current core,
branches will cost several cycles whereas a CMOV should be single-cycle.
likely:
CMOVT Rm, Rn //MOV if SR.T is Set
CMOVF Rm, Rn //MOV if SR.T is Clear
but, could instead make a case for a 3-register form:
CMOVTF Rs, Rt, Rn //Rn=SR.T?Rs:Rt
(but, actually worth the encoding space?)
but, dunno...

but, at this point, it is more an open question if I can make "something
that works" than "something that is good" (as there is a fair bit of
subtleties and wonk in Verilog, like "reg[255:0] somearr[1023:0];" is
seemingly ok, but "reg[31:0] somearr[8191:0];" and synthesis will lose
its crap and try to eat the whole FPGA with LUTRAMs rather than use
RAMB36E1 blocks; despite both essentially representing the same amount
of memory...).

well, and nevermind the situation right at the moment of computer
dumping due to a power-failure resulting in Vivado no longer being able
to start (it just giving "Runtime Error" messages when an attempt is
made to launch it), so now running software updates in an attempt to fix
it, grr...

but, my first iteration of CPU core may end up looking more like a
microcontroller than what I originally imagined.

also, nevermind if all this is technically pointless...

Ivan Godard

unread,

Dec 26, 2017, 2:39:33 PM12/26/17

to

On 12/26/2017 9:56 AM, BGB wrote:
> On 12/25/2017 5:30 AM, Anton Ertl wrote:
>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>> A performance bug is simply a bug and should trap instead of running
>>> 10-100X slower, sort of like the problem with how many FPUs have been
>>> handling denormals.
>>
>> Depends on context. As an end user who can do nothing about the bug,
>> you probably prefer the slowness over the trap (if it's too slow, you
>> still have the option to terminate the program). As a developer, you
>> may prefer the trap, but only after you have hunted down the
>> functionality bugs.
>>
>
> though my attempt at a CPU core is arguably pretty terrible, it has been
> a similar issue with dealing with some obscure legacy SuperH ops which
> violate the normal load/store design (there are a few mostly unused ops
> which would do multiple memory accesses, and a few others which do
> read/modify/write).
>
> as-is, deviating from normal load/store would basically make a big mess,
> so doesn't seem worth it (more so for edge-case instructions which are
> basically unused). I may make a partial exception for the "CAS"
> instruction, which serves a special purpose (implementing mutex locks),
> but am not entirely sure as of yet how it will be implemented.

CAS is outdated due to the ABA problem. DCAS is marginally better is you
use it right, but forces additional lock fields in the data structures.
As you have a D$ IIRC, why not use a modern cache-based atomic primitive?

Nick Maclaren

unread,

Dec 26, 2017, 2:59:39 PM12/26/17

to

In article <p1u8hi$vun$1...@dont-email.me>,

Ivan Godard <iv...@millcomputing.com> wrote:
>
>CAS is outdated due to the ABA problem. DCAS is marginally better is you
>use it right, but forces additional lock fields in the data structures.
>As you have a D$ IIRC, why not use a modern cache-based atomic primitive?

CAS works fine for many purposes; the ABA problem matters only if you
want to use a potentially non-unique value as an agent id, which is
generally a bad idea for many other reasons. I agree that DCAS has
advantages, such as being able to update a value and id simultaneously
without taking out a lock.

I am not sure what you mean by D$ - could you expand it?

Regards,
Nick Maclaren.

already...@yahoo.com

unread,

Dec 26, 2017, 3:23:31 PM12/26/17

to

Ivan believes in what he calls "optimistic locking".
Modern, in his terms =198x, years when disko music and LL/SC were fashionable.

Personally, I have no technical opinion about the merits of "pessimistic" vs 'optimistic', but I see industry trends. And trends are opposite to what Ivan say: both POWER and ARM that were originally in LL/CC camp, recently added comapare-and-swap and fetch-and-add to their arsenal.
On the other hand, x86 that was always in "pessimistic" camp, does not harry up to add lite optimistic stuff, advocated by Ivan. They (specifically, Intel) added heavy optimistic stuff in form of TSX, bit so far it looks like a failure.

Ivan Godard

unread,

Dec 26, 2017, 3:27:13 PM12/26/17

to

D$ - top level data cache

In addition to other legacy locking primitives, IBM and Intel offer a
limited atomic transaction primitive the uses the D$ to hold the write
set. Mill, having no legacy, offers only the cache-based form, from
which all the legacy primitives can be constructed if desired.

Note that these are atomic primitives, not general HTM.

Chris M. Thomasson

unread,

Dec 26, 2017, 3:27:56 PM12/26/17

to

On 12/26/2017 11:59 AM, Nick Maclaren wrote:

> In article <p1u8hi$vun$1...@dont-email.me>,
> Ivan Godard <iv...@millcomputing.com> wrote:
>>
>> CAS is outdated due to the ABA problem.

Not true.

Chris M. Thomasson

unread,

Dec 26, 2017, 3:34:44 PM12/26/17

to

Wrt, ABA, the free-pool manipulation is still in use over in Appendix A.

http://www-01.ibm.com/support/docview.wss?uid=isg26480faec85f44e2385256d5200627dee&aid=1
(refer to appendix A-48, Free-Pool Manipulation)

ABA galore... :^)

Ivan Godard

unread,

Dec 26, 2017, 3:36:06 PM12/26/17

to

You are out of date. We replaced LL/SC (actually DLL/DSC) maybe a decade
ago when we stumbled on the work that also led to TSX and the similar
IBM facility.

Does anyone know what the underlying hardware is on PPC and ARM?
Underneath the microcode that is?

Chris M. Thomasson

unread,

Dec 26, 2017, 3:45:49 PM12/26/17

to

On 12/26/2017 12:27 PM, Chris M. Thomasson wrote:

Fwiw, check out the following links for some more detailed background
information. The thread in this group just happens to be loaded with
insightful comments from some very smart individuals:

https://groups.google.com/d/topic/comp.arch/Bx-OAJTjU-U/discussion
(try to read all, or at least search the threads text for (CAS,
lock-free, ect...) ;^)

http://people.csail.mit.edu/shanir/publications/K-Compare.pdf
(obstruction free at best wrt LL/SC?)

Chris M. Thomasson

unread,

Dec 26, 2017, 3:54:14 PM12/26/17

to

Sorry for all of the posts! Here is an older, detailed conversation I
had with Joe Seigh about KCSS:

https://groups.google.com/d/topic/comp.programming.threads/shshLdF1uqs/discussion

Nick Maclaren

unread,

Dec 26, 2017, 4:07:01 PM12/26/17

to

In article <p1ubav$ise$1...@dont-email.me>,

Ivan Godard <iv...@millcomputing.com> wrote:
>>
>> I am not sure what you mean by D$ - could you expand it?
>

>D$ - top level data cache

Do you mean some special memory that has globally sequential semantics?
If so, I have favoured that for several decades, though no architectures
seemed to provide it :-(

Regards,
Nick Maclaren.

Chris M. Thomasson

unread,

Dec 26, 2017, 4:42:16 PM12/26/17

to

On 12/26/2017 1:06 PM, Nick Maclaren wrote:
> In article <p1ubav$ise$1...@dont-email.me>,
> Ivan Godard <iv...@millcomputing.com> wrote:
>>>
>>> I am not sure what you mean by D$ - could you expand it?
>>
>> D$ - top level data cache
>
> Do you mean some special memory that has globally sequential semantics?

Why do that? What about memory that is like SPARC in RMO mode? Imvvvvho,
the latter should be able to scale much better. Heck, at least dependent
loads issue a consumer memory barrier in C11 and C++11. Unlike the DEC
Alpha! That is raw.

MitchAlsup

unread,

Dec 26, 2017, 5:37:44 PM12/26/17

to

I always considered that if you have a data path that is inherently
3-operand (like FMAD) that you simply support the FU capabilities at
the register file and all the paths in between. It may be that in many
circumstances that one of the 3 ports is unnecessary, but the third port
also enables the compiler to unroll loops and not get squashed by lack of
register ports.

I tried to do a design that sorted register read requests (from a bundle
of preformatted instructions) to the available register read ports, but
it ended up being a routing nightmare both into and out of the register
file. I came away for this with the notion "don't do that again" {similar
to the way the vast majority of CPU designers don't try to do a logical
L1 cache.}

You may choose otherwise at your own peril.

MitchAlsup

unread,

Dec 26, 2017, 5:49:40 PM12/26/17

to

Basically because you can never guarantee that 2 cache lines remain in the cache for any specifiable amount of time.

What you should do, instead, is to make and ATOMIC attempt fail in
certain situations where it smells like it should succeed (DCAS checks
out OK but its been a week between when cache was read and DCAS is applied).
THis is what ASF at AMD did and what ESM does in my new ISA. It is similar
to how MIPS does LL/SC.

MitchAlsup

unread,

Dec 26, 2017, 5:52:04 PM12/26/17

to

THis reminds me of a problem SUN has back in the S-bus days; where
cacheable locks were used to lock uncacheable data. The cacheable
locks could be transferred between CPUs faster than the data would
arrive at the target location. This lead to all sorts of "interesting
effects".

Anne & Lynn Wheeler

unread,

Dec 26, 2017, 6:54:31 PM12/26/17

to

"Chris M. Thomasson" <invalid_chr...@invalid.invalid> writes:
> Wrt, ABA, the free-pool manipulation is still in use over in Appendix A.
>
> http://www-01.ibm.com/support/docview.wss?uid=isg26480faec85f44e2385256d5200627dee&aid=1
> (refer to appendix A-48, Free-Pool Manipulation)
>
> ABA galore... :^)

compare&swap was invented by Charlie when he was working on fine grain
CP/67 (kernel) multiprocessor locking at the science center (mneumonic
chosen because CAS are Charlie's initials). Trying to get it added to
370 architecture was rebuffed because the POK favorite son operating
system people said that all they needed was 360 test&set (supporting
multiprocessor single global kernel spin-lock).

The 370 architecture owners said to get it added to 370, additional uses
would be needed for justification. Thus was born several examples for
use by multi-threaded applications (like large DBMS systems) ... that
still appear in current mainframe principles of operation. Original
addition to 370 was CS (compare and swap) for locks, counters, pointers,
push/pop lists, etc ... as well as CDS (compare double and swap) for
more complex structures (like forward/backward linked lists ... above
cited example ... all cutesy of Charlie and the science scenter).

Over the next docade or two, saw other platforms adding instructions
with similar semantics ... especially uptake by large commercial RDBMS
operations (regardless of whether multiprocessor or not). Original POWER
(RIOS) for RS/6000 had no multiprocessor and/or cache serialization
capability and didn't bother with compare&swap. However, RS/6000 trying
to move into the commercial RDBMS market put it performance disadvantage
because even simple serialization had to be done with unix kernel
locking calls. Eventually AIX implemented compare&swap simulation as
fastptath in the kernel call interrupt routine.

note following the above mentioned section in mainframe principles of
operation appendix is about the more recently defined PLO ... which can
perform operations on two or more discontiguous storage locations.

--
virtualization experience starting Jan1968, online at home since Mar1970

BGB

unread,

Dec 26, 2017, 7:01:54 PM12/26/17

to

On 12/26/2017 1:59 PM, Nick Maclaren wrote:
> In article <p1u8hi$vun$1...@dont-email.me>,
> Ivan Godard <iv...@millcomputing.com> wrote:
>>
>> CAS is outdated due to the ABA problem. DCAS is marginally better is you
>> use it right, but forces additional lock fields in the data structures.
>> As you have a D$ IIRC, why not use a modern cache-based atomic primitive?
>
> CAS works fine for many purposes; the ABA problem matters only if you
> want to use a potentially non-unique value as an agent id, which is
> generally a bad idea for many other reasons. I agree that DCAS has
> advantages, such as being able to update a value and id simultaneously
> without taking out a lock.
>

adding CAS to the ISA wasn't originally my idea, but a feature which the
J-Core people had added (for sake of supporting multicore SH chips), and
my design sort of inherited it. the drawback though, is that it doesn't
really match the hardware capabilities of what I am implementing.

a possible workaround would be doing a sort of mutex in hardware, which
is kept "locked" during the duration of a synchronization op (like CAS);
and any other cores will essentially block on this HW mutex until it is
released (then the CAS operation would be emulated or done in microcode,
as with most of the other similar edge-case ops).

then again, I could possibly just do my own thing here, and see if I can
implement mutexes in some other way that is friendlier to load/store and
a traditional 5-stage pipeline.

I guess, the alternative being, granted, to actually implement support
for this sort of stuff in hardware.

eg:
IF, ID, EX //most basic ops (1 cycle)
IF, ID, EX, MA+ //store (2+ cycles)
IF, ID, EX, MA+, WB //load (2+ cycles)
IF, ID, EX1, MA+, EX2, MA+ //edge-case ops (eg: CAS.L)
IF, ID, EX1, MA+, EX2, MA+, EX3 //MAC.L and MAC.W

possibly by simply allowing each execute stage to daisy-chain to another
execute stage after the prior MA+ stage completes.

in this case, IF/ID could be stalled, and an EX2 value would be injected
back into the EX stage (which may in turn produce another chained EX2
operation, with IF/ID unstalling once there is no longer a chained EX2)?

note that the functionality to store the result into a register is glued
onto EX, so an explicit WB is only needed on a load. but, I guess if EX
chaining were used, the WB could essentially just become a dummy EX2
operation (which stores the loaded value into a register).

sane?...

> I am not sure what you mean by D$ - could you expand it?
>

I am not familiar with this either.

>
> Regards,
> Nick Maclaren.
>

Quadibloc

unread,

Dec 26, 2017, 7:23:56 PM12/26/17

to

As long as you acknowledge that the cause was not that Intel was
inherently more clever, but instead had the cash flow rolling in thanks
to the x86 being used in the IBM PC and consequently for Windows.

MitchAlsup

unread,

Dec 26, 2017, 9:10:02 PM12/26/17

to

On Tuesday, December 26, 2017 at 6:01:54 PM UTC-6, BGB wrote:

> sane?...

If you would like to see another way to do this send me an e-mail address.

Ivan Godard

unread,

Dec 26, 2017, 9:43:52 PM12/26/17

to

Well, we don't have false sharing, which removes most ping-ponging. But
random exponential backoff is near trivial hardware. Yes, it's not a
zero chance of livelock, but it's arbitrarily small and that's good
enough for real work as opposed to ivory tower algorithms.

David Brown

unread,

Dec 27, 2017, 7:00:28 AM12/27/17

to

It is much more like a Cortex-R device. It is faster than a Cortex-M
(until recent M7, anyway), has dual cores that can run in lock-step for
high reliability operations, has "real time" support like predictable
timings, fast and consistent interrupt support, controllable bus
priorities, etc. It is heavily targeted to automotive usage.

But it just happens to be a microcontroller I have used that has data
cache, which is why I brought it up.

>
>> You also have control over which areas of memory are cached.
>> Peripheral regions are not cached, for example. And on devices
>> where DMA does not synchronise with the cache, you can have an area
>> of ram that is uncached to make DMA work easier.
>>
>> Caches make it more complicated to do accurate or predictable
>> timings, but usually such high-end microcontrollers have good
>> enough peripherals that you use timers, DMA, and such like to get
>> accurate timings.
>>
>>
>>> The simplicity that allows even to unexperienced programmers that
>>> know nothing about cache coherence to delivery useful results
>>> with these little things.
>>
>> These are more advanced parts than the cheaper M3/M4 devices. You
>> have to know how to use them - or at the very least, how to work
>> with existing software for them.
>>
>
> I know. But I want to get results from developers that don't know.
> And with cacheless M3/M4s I have a good rate of success.
>

That sounds completely wrong to me. I never expect to get good results
from developers that are working with something that they don't
understand. It does not matter whether the chip is an 8051 or a
super-scaler SMP device with 3 layers of cache - if the developer does
not know how to work with it correctly, they need to learn (or be
taught). It is fine to re-use existing knowledge and avoid spending
time and money on learning something new when all else is equal, but it
seems silly to dismiss a class of devices out of hand, merely because
your programmers would have to learn something.

>>> Then again, my experience with M3/M4 was exclusively with STMicro
>>> parts. I can believe that with NXP it never was beautifully
>>> simple so a little more pain matters less.
>>
>> There is little significant difference between the families offered
>> here, in terms of complexity to use.
>
> May be, between NXP and STMicro you are right. Not when you add TI to
> the list, esp. their Hercules family (their Stellaris looks fine on
> paper, but I have no 1st hand experience with them).

The Hercules family are Cortex-R devices, as far as I know, aimed at
high reliability and high safety systems. There is much more involved
in using them than in using "ordinary" microcontrollers such as the NXP
Kinetis family or the TI Stellaris devices. Cortex-R is a different
core for a different kind of job.

>
>>
>>> Or, may be, they understood that data cache is a bad idea, but
>>> since ARM does not provide caches as part of M4 IP, adding
>>> unified cache was much simpler for their designers than adding
>>> ICache alone?
>>
>> Data cache is useful if you have slow ram. On most M3/M4
>> microcontrollers, you usually use the on-chip ram - and it will be
>> single cycle. Therefore it does not benefit from cache.
>> Peripherals, of course, do not benefit from cache. You use data
>> cache on devices that have external ram - DDR, SDRAM, etc. Most
>> M3/M4 microcontrollers don't have data cache simply because it is
>> of no use - not because some manufacturers "understand" caches and
>> some do not!
>>
>
> M3/M4 was never intended for apps with external DRAM. If it hurts,
> don't do it. There are other, more suitable cores for that purpose.
>

M3/M4 works perfectly well with external memory, and there are many
microcontrollers with those cores and with support for external RAM.
There are lots of applications for which an M3 or M4 core is a suitable
size and speed, but where you might want more ram than is cost-effective
on board the microcontroller.

I happen to have a copy of the ARM "Cortex-M4 Devices Generic User
Guide" open - I don't have the M3 equivalent on hand to compare. But in
the memory model diagram, the range 0x6000'0000 to 0x9fff'ffff is
explicitly marked as "1.0 GB External RAM". I think that is a good
indication that the core was made with external RAM in mind.

>
>>>
>>> Anyway, there is no chance that I will use M4 with D$ in those
>>> designs where I have control.
>>>
>>
>> That's your choice. But I think you are misunderstanding the
>> purpose of these caches, and how they work.
>>
>
> I think that it's no mare than attempt to push M4 into alien
> territory. Want faster Cortex-M? That's what M7 is invented for.

Stepping from an M4 to an M7 is a bit of a jump, especially for the
manufacturers. Making an existing M4 design a bit faster is easier -
especially given that the M4 designs with external memory support were
around for many years before the M7 core became available.

Sure, an M7 is a faster processor and therefore likely to be a nicer
choice when you are dealing with a lot of memory. But there is in fact
very little difference between an M7 and an M4 from the programmer's
viewpoint - other than the speed.

David Brown

unread,

Dec 27, 2017, 7:14:33 AM12/27/17

to

On a two-core PPC microcontroller I used, there was a "semaphore"
peripheral block. Each semaphore was either unlocked (0), or holds a
core number (1 or 2 in this case). A core can write its number to the
semaphore when it is unlocked, or it can write 0 if it currently owns
the semaphore and wants to unlock it. Such a system could easily be
extended to include an id or thread number (with or without having a
core number included) and give a fast and convenient locking mechanism.

Nick Maclaren

unread,

Dec 27, 2017, 7:46:39 AM12/27/17

to

In article <p1v1d6$irn$1...@dont-email.me>,

Ivan Godard <iv...@millcomputing.com> wrote:
>
>>> CAS is outdated due to the ABA problem. DCAS is marginally better is you
>>> use it right, but forces additional lock fields in the data structures.
>>> As you have a D$ IIRC, why not use a modern cache-based atomic primitive?
>>
>> Basically because you can never guarantee that 2 cache lines remain in
>the cache for any specifiable amount of time.
>>
>> What you should do, instead, is to make and ATOMIC attempt fail in
>> certain situations where it smells like it should succeed (DCAS checks
>> out OK but its been a week between when cache was read and DCAS is applied).
>> THis is what ASF at AMD did and what ESM does in my new ISA. It is similar
>> to how MIPS does LL/SC.

Such rules are a Real Pain for HPC, just as the equivalents are for
the internet. It isn't unreasonable for extreme search algorithms to
go a week between updating the best-so-far location. What it means
is that you will get unpredictable failures unless the software always
generates a redundant load before the CAS. If the rule is 'read by
another thread', it's a Right Royal Pain to fix up. Been there - done
that (though not in the same context) :-(

>Well, we don't have false sharing, which removes most ping-ponging. But
>random exponential backoff is near trivial hardware. Yes, it's not a
>zero chance of livelock, but it's arbitrarily small and that's good
>enough for real work as opposed to ivory tower algorithms.

It depends on what you mean by random. There are zillions of cases
where problems have arisen because of 'almost impossible' resonances.
Provided that you ensure that the 'randomness' is independent not
only of anything the program can do, but of how (including how fast)
it actually executes, then fine.

Regards,
Nick Maclaren.

already...@yahoo.com

unread,

Dec 27, 2017, 8:13:21 AM12/27/17

to

Remember, right now Ivan is not interested in large-scale SMP.
At most few dozens of core on a single die. In case of Gold, probably less than 3 dozens.

Terje Mathisen

unread,

Dec 27, 2017, 9:18:56 AM12/27/17

to

already...@yahoo.com wrote:
> On Tuesday, December 26, 2017 at 9:59:39 PM UTC+2, Nick Maclaren
> wrote:
>> In article <p1u8hi$vun$1...@dont-email.me>, Ivan Godard
>> <iv...@millcomputing.com> wrote:
>>>
>>> CAS is outdated due to the ABA problem. DCAS is marginally better
>>> is you use it right, but forces additional lock fields in the
>>> data structures. As you have a D$ IIRC, why not use a modern
>>> cache-based atomic primitive?
>>
>> CAS works fine for many purposes; the ABA problem matters only if
>> you want to use a potentially non-unique value as an agent id,
>> which is generally a bad idea for many other reasons. I agree that
>> DCAS has advantages, such as being able to update a value and id
>> simultaneously without taking out a lock.
>>
>> I am not sure what you mean by D$ - could you expand it?
>

> Ivan believes in what he calls "optimistic locking". Modern, in his
> terms =198x, years when disko music and LL/SC were fashionable.

:-)

>
> Personally, I have no technical opinion about the merits of
> "pessimistic" vs 'optimistic', but I see industry trends. And trends
> are opposite to what Ivan say: both POWER and ARM that were
> originally in LL/CC camp, recently added comapare-and-swap and
> fetch-and-add to their arsenal. On the other hand, x86 that was

If I could have just a single primitive I would vote for XADD i.e.
interlocked_fetch_and_add().

> always in "pessimistic" camp, does not harry up to add lite
> optimistic stuff, advocated by Ivan. They (specifically, Intel) added
> heavy optimistic stuff in form of TSX, bit so far it looks like a
> failure.
>

Their previous extension was XADD which it much easier to use for a guy
like me.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Dec 27, 2017, 9:29:35 AM12/27/17

to

BGB wrote:
>
> though my attempt at a CPU core is arguably pretty terrible, it has been
> a similar issue with dealing with some obscure legacy SuperH ops which
> violate the normal load/store design (there are a few mostly unused ops
> which would do multiple memory accesses, and a few others which do
> read/modify/write).
>
> as-is, deviating from normal load/store would basically make a big mess,
> so doesn't seem worth it (more so for edge-case instructions which are
> basically unused). I may make a partial exception for the "CAS"
> instruction, which serves a special purpose (implementing mutex locks),
> but am not entirely sure as of yet how it will be implemented.

If you move the atomic operations to the cache controller
you can implement them while retaining a RISC style pipeline.

Atomic Fetch and ADD, Fetch and AND, Fetch and OR, Fetch and XOR
all look to the pipeline like a specialty type of load,
just a different command sent from core to cache controller.
The CacheCon however fetches the line exclusive if not
already so and performs the simple ALU op itself.

Atomic Swap can be cracked into a specialty store followed
by a specialty load. The store has the physical address and
new value which is saved in CacheCon in a special register,
and it begins to fetch the line exclusive if not already so,
and ack's the store to the core.
When the specialty load arrives the CacheCon does the exchange.

Atomic CAS can be similarly cracked into a specialty store of
the physical address and compare value which is saved in CacheCon,
a specialty store of the new value which is saved,
and a specialty load of status that triggers the
compare and swap and returns the status value.

Mods to the cache controller are a simple ALU, a couple of
registers to hold temps, and some sequencing and glue logic.

Eric

Nick Maclaren

unread,

Dec 27, 2017, 9:46:48 AM12/27/17

to

In article <b1b835ac-4775-4ffe...@googlegroups.com>,

<already...@yahoo.com> wrote:
>>
>> >Well, we don't have false sharing, which removes most ping-ponging. But
>> >random exponential backoff is near trivial hardware. Yes, it's not a
>> >zero chance of livelock, but it's arbitrarily small and that's good
>> >enough for real work as opposed to ivory tower algorithms.
>>
>> It depends on what you mean by random. There are zillions of cases
>> where problems have arisen because of 'almost impossible' resonances.
>> Provided that you ensure that the 'randomness' is independent not
>> only of anything the program can do, but of how (including how fast)
>> it actually executes, then fine.
>

>Remember, right now Ivan is not interested in large-scale SMP.
>At most few dozens of core on a single die. In case of Gold, probably
>less than 3 dozens.

That doesn't matter for this problem. Most of the problems arise
because the delay causes lines to be lost from the cache, pages
from the TLB etc., which then slow down the algorithm, which causes
it to fail and the whole cycle repeats. But it can occur in more
subtle ways, too.

Yes, it's fair to blame the authors of the (kernel) scheduler, memory
management and algorithm for this, but they are often working with no
knowledge of what the others are assuming (though not documenting).
Given that mess, it's terribly easy to add an association in the delay,
and cause weird failures. Getting exponential backoff to work almost
all of the time is easy; avoiding extremely rare failure modes (which
often cause a particular use to fail) is much harder.

Regards,
Nick Maclaren.

EricP

unread,

Dec 27, 2017, 11:28:01 AM12/27/17

to

EricP wrote:
>
> If you move the atomic operations to the cache controller
> you can implement them while retaining a RISC style pipeline.
>
> Atomic Fetch and ADD, Fetch and AND, Fetch and OR, Fetch and XOR
> all look to the pipeline like a specialty type of load,

^^^^
ooops, just waking up...

all look to the pipeline like a specialty type of store of the
second operand which is saved in the cache controller,
and then load of the original value which triggers the operation,

> just a different command sent from core to cache controller.
> The CacheCon however fetches the line exclusive if not
> already so and performs the simple ALU op itself.

Eric

Quadibloc

unread,

Dec 27, 2017, 1:17:13 PM12/27/17

to

On Wednesday, December 27, 2017 at 7:29:35 AM UTC-7, EricP wrote:

> If you move the atomic operations to the cache controller
> you can implement them while retaining a RISC style pipeline.

But that would make them privileged operations on a typical system.

John Savard

MitchAlsup

unread,

Dec 27, 2017, 1:22:03 PM12/27/17

to

On Wednesday, December 27, 2017 at 6:46:39 AM UTC-6, Nick Maclaren wrote:
> In article <p1v1d6$irn$1...@dont-email.me>,
> Ivan Godard <iv...@millcomputing.com> wrote:
> >
> >>> CAS is outdated due to the ABA problem. DCAS is marginally better is you
> >>> use it right, but forces additional lock fields in the data structures.
> >>> As you have a D$ IIRC, why not use a modern cache-based atomic primitive?
> >>
> >> Basically because you can never guarantee that 2 cache lines remain in
> >the cache for any specifiable amount of time.
> >>
> >> What you should do, instead, is to make and ATOMIC attempt fail in
> >> certain situations where it smells like it should succeed (DCAS checks
> >> out OK but its been a week between when cache was read and DCAS is applied).
> >> THis is what ASF at AMD did and what ESM does in my new ISA. It is similar
> >> to how MIPS does LL/SC.
>
> Such rules are a Real Pain for HPC, just as the equivalents are for
> the internet. It isn't unreasonable for extreme search algorithms to
> go a week between updating the best-so-far location. What it means
> is that you will get unpredictable failures unless the software always
> generates a redundant load before the CAS. If the rule is 'read by
> another thread', it's a Right Royal Pain to fix up. Been there - done
> that (though not in the same context) :-(

Reads do not alter the value in memory and do not necessarily count as
interference. Writes do.

MitchAlsup

unread,

Dec 27, 2017, 1:23:19 PM12/27/17

to

On Wednesday, December 27, 2017 at 8:29:35 AM UTC-6, EricP wrote:
> BGB wrote:
> >
> > though my attempt at a CPU core is arguably pretty terrible, it has been
> > a similar issue with dealing with some obscure legacy SuperH ops which
> > violate the normal load/store design (there are a few mostly unused ops
> > which would do multiple memory accesses, and a few others which do
> > read/modify/write).
> >
> > as-is, deviating from normal load/store would basically make a big mess,
> > so doesn't seem worth it (more so for edge-case instructions which are
> > basically unused). I may make a partial exception for the "CAS"
> > instruction, which serves a special purpose (implementing mutex locks),
> > but am not entirely sure as of yet how it will be implemented.
>
> If you move the atomic operations to the cache controller
> you can implement them while retaining a RISC style pipeline.

In my opinion, every layer in the memory hierarchy should be able to
perform fetch-integerOP-store operations, DRAM included.