AMD Bulldozer optimization guide

Tim McCaffrey

unread,

Jan 11, 2012, 1:15:30 PM1/11/12

to

http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf

Talks a little about internals of the processor.

I haven't had time to dig, but it seems like the Intel processors (since
Core2) have been a bit more lenient with regards to memory access. If so, I
suspect that is where most of the performance difference lies.

- Tim

Tim McCaffrey

unread,

Jan 11, 2012, 1:19:53 PM1/11/12

to

In article <jekjk2$1v5$1...@USTR-NEWS.TR.UNISYS.COM>, timca...@aol.com says...

A poster on RealWorldTech.com (Y) did a nice summary of the models:

>That updated guide contains following models and features:
>00h-0Fh - current BD(?)

>00h-1Fh - 2 DDR3 channels

>10h-1fh - no HT; no L3; 1-2 modules; L1 DTLB has been increased to 64M; 2
>DDR3 channels; FMA, F16C, BMI and TBM; IOMMUv2

>10h-2Fh - different FPU inst. latencies from 00h-0Fh; the depth of the load
>queue is increased to 44 entries; L1 DTLB has been increased to 64M

>20h-2Fh - FMA, F16C, BMI and TBM; L1 DTLB has been increased to 64M; 10
>cores per node ~ up to 5 modules; 4 DDR3 channels

>In addition, the guide refers to 30h-3Fh and 40h-4Fh models through their
>BIOS and Kernel Developer Guides.

>Some instructions can be issued in EXx and also in AGx pipes in 20+h models.

James Van Buskirk

unread,

Jan 12, 2012, 11:57:24 AM1/12/12

to

"Tim McCaffrey" <timca...@aol.com> wrote in message
news:jekjk2$1v5$1...@USTR-NEWS.TR.UNISYS.COM...

> http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf

> Talks a little about internals of the processor.

I am having a hard time reading this document. In section 2.11 it
says that two 128-bit FMAC and two 128-bit integer ALU ops can be
issued and executed (to the FPU) per cycle, so that would seem to
me to say the one FMAC pipe could do 4 single-precision FADDs or
FMULs or FMACs each cycle and the other could do so also at the same
time.
However, it says at the start of chapter 10 that the SIMD
instructions provide a theoretical single-precision peak
throughput of four additions and four multiplications per clock
cycle which seems to me to contradict what was said in section 2.11.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Paul A. Clayton

unread,

Jan 12, 2012, 2:39:03 PM1/12/12

to

On Jan 12, 11:57 am, "James Van Buskirk" <not_va...@comcast.net>
wrote:
[snip]

> I am having a hard time reading this document. In section 2.11 it
> says that two 128-bit FMAC and two 128-bit integer ALU ops can be
> issued and executed (to the FPU) per cycle, so that would seem to
> me to say the one FMAC pipe could do 4 single-precision FADDs or
> FMULs or FMACs each cycle and the other could do so also at the same
> time.
> However, it says at the start of chapter 10 that the SIMD
> instructions provide a theoretical single-precision peak
> throughput of four additions and four multiplications per clock
> cycle which seems to me to contradict what was said in section 2.11.

My guess is that it means four FADDs and four separate FMULs
where each uses a distinct 128-bit wide FMAC pipeline. I.e.,
if FMACs cannot be used, the peak throughput is half the number
of floating point operations (where a FMAC counts as two).

Andy (Super) Glew

unread,

Jan 13, 2012, 2:39:49 AM1/13/12

to

On 1/11/2012 10:15 AM, Tim McCaffrey wrote:
> http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf
>
> Talks a little about internals of the processor.

Some notes, and some thoughts:

p. 21 - a single macro-op can handle load and store to the same address,
whereas micro-ops can only be load and store.

This isn't new to Bulldozer - e.g. the Family 10h and 12h processors had
it - but I still find it interesting. Heck, I think that it would be
entirely reasonable to perform a single address calculation and TLB
lookup for both the load part of an instruction like INC mem (unlocked,
to avoid other issues). Saves power, reduces uops, and

Many folks discuss whether or not stores should be split into separate
address and store operations within the processor. P6 did this in
microcode, but more recent Intel processors have fused them, fired store
address and store data twice out of a single scheduler entry. Mitch has
discussed his design (and I think also CDC? or Burroughs?) where you
sent an address to an address register, and then read or wrote the
corresponding data register to perform a memory reference
- which to some extent exposes the separation of address and data parts
of both loads and stores.

With separate loads and stores, an x86 instruction like INC mem might
look like:
INC mem
tmp := load( M[seg:+basereg+indexreg+offset] )
tmp := tmp + 1
store_address( M[seg:+basereg+indexreg+offset] += tmp )

With split store_address and store_data+
INC mem
tmp := load( M[seg:+basereg+indexreg+offset] )
tmp := tmp + 1
store_address( M[seg:+basereg+indexreg+offset] )
store_data( tmp )

which you can really think of as:

INC mem
tmp := load( M[seg:+basereg+indexreg+offset] )
tmp := tmp + 1
store_buffer_entry.addr := &M[seg:+basereg+indexreg+offset] )
store_buffer_entry.data := tmp
// and, much later, after retirement, not an OOO uop:
// M[store_buffer_entry.addr] := store_buffer_entry.data

Sharing the address computation for loads and stores
INC mem
lsq_entry.addr := &M[seg:+basereg+indexreg+offset]
tmp := load( M[lsu_entry.addr] )
tmp += tmp+1
lsq_entry.store_data := tmp
// and, much later, after retirement, not an OOO uop:
// M[store_buffer_entry.addr] := store_buffer_entry.data

Optimizing
INC mem
tmp := load( M[lsq_entry.addr := &M[seg:+basereg+indexreg+offset]])
lsq_entry.store_data := tmp+1
// and, much later, after retirement, not an OOO uop:
// M[store_buffer_entry.addr] := store_buffer_entry.data

p. 24 - the usual melange of partial registers and merging

p. 25 - load-execute instructions for unaligned data

32B instruction fetch
p. 26 - align hot loops to 32 bytes instead of 16
(32B instruction fetch)

p. 30 - superforwarding

p. 33
D$ - 4 ways, way prediction
16 banks, 16 bytes wide (i.e. intra cache line banks + inter)
4 cycles load-to-use

L2$ load to use latency 20+ cycles
mostly inclusive

L3$ - non-inclusive victim cache
- not strictly exclusive, if only because of other cores

p. 34
Branch Prediction
Predicted taken bubble
- 1 cycle predicted by L1 BTB
- 4 cyce predicted by L2 BTB
Minimum branch misprediction penalty
- 20 cycles conditional and indirect
- 15 cycles unconditional direct and returns

(GLEW COMMENT: I'm a little puzzled by this. I think that it means
that the unconditional direct branches and returns are recognized by the
decoder, even if BTB hitting. Which is standard. But it also seems to
imply that they have not optimized, much, the path reporting these,
what P6 called "shortstop branch mispredictions".
Also, unconditional direct branches can always be handled at the
dcecoder, but returns can USUALLY but not always. You can recognize a
return that missed the BTB, and use your return stack predictor - but it
is always possible that your return stack predictor may be wrong. So I
think this 15 cycle penalty applies when the return misses both BTBs,
but hits the return stack predictor.)

"... accessed using the fetch address of the current window. Each BTB
entry includes information about a branch and its target."
- since there may be more than one branch in a fetch block, I think they
are using the "ways" to handle that possibility. Which is a slight
terminology stretch, but one that I have already encountered.

"Most of the time, as calls are fetched, the next return address is
pushed onto the return stack and subsequent returns pop a predicted
return address off the top of the stack. However, mispredictions
sometimes arise during speculative execution. Mechanisms exist to
restore the stack to a consistent state after these mispredictions."
- i.e. the stack isn't purely a stack.

Q: does Intel do this yet?

p. 35 TLBs

L1-DTLB - 32 entries, fully associative, multiple page sizes
L2-DTLB - 1024 with 4K, 2M, 1G, 8 way set associative
- Q: how does this work?

p. 38 LSU

2 128 bit loads/cycle (to different L1 banks)
+ 1 128 bit store
Q: how does this reconcile with
"The LS unit is composed of two largely independent pipelines
enabling the execution of two memory operations per cycle"
-- I think it may mean that store commit is special. (Or ???)

24 entry store queue.
40 entry load buffer

p. 39

4 write combining buffers
*and* Write Coalescing Cache

- I suspect the write-through L1$ was causing pains.

Prefetchers

One prefetcher from DRAM ... into a prefetch buffer, NOT the L1, L2, or
L2. (Again, not new with Bulldozer, but still interesting.) They don't
say how big a prefetch buffer.

And another prefetcher into the L1 or L2.

At page 104, they talk about 3 prefetchers: L1, L2 region prefetcher,
and the DRAM prefetcher.

p. 40 - 2 or 4 DRAM channels.

p. 41 - Hypertransport assist, aka probe filtering (not new).

Steals part of the L3 to use as a directory.

p. 52 store-to-laod-forwarding prediction
"The AMD Family 15h processor contains hardware to accelerate such
store-to-load dependencies, allowing the load to obtain the store data
before it has been written to memory."

They don't describe the predictor.

The University of Wisconsin sued Intel for a patent on such a predictor.
My advisor, Guri Sohi, and several of my friends, (Anreas Moshovos,
etc.) were inventors on that patent. Intel settled.
I'm sure AMD did, too.

http://dotnet.sys-con.com/node/499104

p. 80 - load-execute instructions are preferred over separate load and
execute instructions. (Not new.)

p. 81 - unaligned large (SIMD, 128 bit and 64, are recommemnded. Better
misaligned support.

p. 82 - "Take Advantage of x86 and AMD64 Complex Addressing Modes"

segbase + basereg + indexreg<<scale + offset not penalized.

(x86s thrash about this. P6 did, Banias didn't, etc. Bulldozer does
it, like P6.) (Well, except for segbase?)

p. 85 - partial register writes

Seems to use a merge on write scheme, not merge on read.

Optimization for initialing instructions, and writing upper bits of XMM
registers. 2 zsero bit flags dtaflow through processor.

"Another optimization recognizes MOVLPD/MOVHPD pairs and internally
converts the MOVLPD to a MOVSD xmm, mem."

p. 89 LEAVE instruction is recommended.

p. 90 SHLD deprecated - VectorPath.

Boo hoo, for people like me who like BitBlt.

p. 92, NOPs of various sizes.

"Note that NOP instructions which contain more than three prefix bytes
degrade performance"

p. 98 " A misaligned store or load operation suffers a minimum one-cycle
penalty in the processor’s loadstore
pipeline. Also, using misaligned loads and stores increase the
likelihood of encountering a storeto-
load forwarding pitfall, especially when operating in long mode"

"Store forwarding only occurs when the load
virtual address exactly matches the store virtual address and the store
size is greater than or equal to
the load size."

p. 102 "Choose linear addresses for the source and destination operands
of REP MOVS/CMPS that are not an
exact multiple of 4K pages away from each other."

They describe the STLF process in more detail than I am used to seeing:

"As mentioned in the previous section, store-to-load forwarding occurs
when the store address
matches the load address. This address match is split into two stages.
In the first stage, bits 4:11 of the
store and the load addresses are matched. In addition the double word
mask of the store and load
addresses is matched. The double word mask indicates whether the
load/store pair is accessing the
same double word in a 16-byte bank. If both these parameters match, then
a store-to-load forward is
initiated. In the second stage the remaining bits 12:47 of the store and
load addresses is matched. If
the remaining bits match, then the STLF is considered as a true STLF and
is allowed to proceed.
Otherwise it is considered as a false STLF and the load is cancelled and
retried."

p. 105 prefetching into unmapped pages can result in a significant delay.

(Hmm, I think this means that the AMD prefetcher can prefetch across 4KB
boundaries. Does Intel do this yet? I.e. it operates ob virtual, not
physical, addresses.)

I suspect this means that invalid pages are NOT placed into the TLB.

If it is true that invalid addresses are not placed into the TLB, then
every prefetch to the same may page may produce a TLB miss.

GLEW OPINION: you need to cache invalid TLB entries, to constrain
prefetch and other speculation. You may want to limit the number of
invalid TLB entries, so as to prevent invalids thrashing out valids.
Other schemes for constraining ...

p. 107 streaming stores are slower on Bulldozer!!!:

1 stream - ok

2 streams - 3X slower. "Only" 1.5X slower by Family 15hv2

4 streams - 3X slower. Comparable

"Using non-temporal stores but not writing out an entire cacheline may
cause performance to be up to six times slower than previous AMD
processors."

- I'm trying to figure out why for this last. Some processors have
burst writes with byte valids - but I did not think AMD did. Otherwise,
why would it be so much slower?

p. 115 - use WC instructions to write code

"If normal store instructions are used to
write the code to memory, then the L2 cache lines will be in a modified
state. When the processor
eventually tries to execute the code, it will miss in the instruction
cache. Because the instruction
cache cannot contain cache lines that are in a modified state, the data
must be flushed to memory
before it can be fetched into the instruction cache. This unnecessarily
evicts possibly useful
information from the caches."

From this we learn several things:

(1) Can't have data in I$ and in M state in D$. I.e. the O part of
MOESI does not apply to the I and D cache.

(2) flushing to memory invalidates the D$? no flush that leaves clean
data behind. (Heck, ARM's cache protocol can support that.)

p. 116 - "On Family 15h parts, instruction caches will invalidate
aliases to a physical page that differ in virtual
addresses bits 14:12. When one physical address is mapped by two or more
virtual addresses that
differ in this way, a performance decrease may be observed. This problem
can be observed primarily
in Linux and other Operating Systems which enable ASLR (Address Space
Layout Randomization)
by default."

Ah, using virtual address bits... Ah, yes: " 64-Kbyte, 2-way set
associative L1 instruction cache. Each line in this cache is 64 bytes long."

Not the D$.

Waddaya want to bet they did not have any ASLR workloads? And some
"smart" designer or architect, who had not kept up with security, wanted
to be clever by applying what they had learned in school?

"Disable ASLR. Note that this decision has security ramifications."
No shit.

I'm a security wonk. This pisses me off.

More glass jaws

* PIC (Position Independent Code)

* Virtualization merging identical physical pages at differehnt virtual
addresses.

p. 120 compare-branch fusion, adjacent instructions. single uop.

p. 123 - CALL next instruction;POP idiom

p. 127 - "Branches Not-Taken Preferable to Branches Taken ...
Correctly-predicted taken branches have at least one prediction-based
bubble while not-taken branches do not. In addition, taken branches
consume more branch prediction resources."

p. 132 "With the pipelined floating-point adder allowing one FADD every
cycle [still to confirm]," :-)

p. 134 "For functions that create fewer than 25 machine instructions
once inlined, it is likely that the functioncall
overhead is close to, or more than, the time spent executing the
function body. In these cases,
function inlining is recommended"

"Function-call overhead on the AMD Family 15h processors can be low
because calls and returns are
executed very quickly due to the use of prediction mechanisms. However,
there is still overhead..."

p. 147 - XOR reg, reg idiom - "AMD Family 15h processors are able to
avoid the false read dependency on the XOR instruction."

Sounds as if this is new with Bulldozer - which is surprising. Intel has
done this since P6.

p. 169 move between integer GPRs and XMM through memory. STLF stalls...

p. 169 Store ccombing or coalescing:

"The store data path on Family 15h is 128-bits wide. Stores are written
to both the L1 Data Cache and
the L2 via the CU module into the WCC (Write Combining Cache). Writes to
the Data cache which
are unaligned in an "address" are written in two cycles. If consecutive
unaligned addressed 128-bit
loads are written they can be coalesced such that the 64-bit portions of
128-bit writes which were
unaligned can be merged and written 128-bits at a time, removing most
the stall penalties. This is
performed in the Store Coalescing Buffer (SCB). A similar operation is
performed for those writes
which go to the L2 via the CU and WCC. There, the writes are coalesced
in the Coalescing Buffer
(CB). 128-bit stores are preferable because they can be dispatched in 1
uop and they only require one
store queue entry, thus putting less pressure on resources during
execution."

p. 187 MOV optimization

"The latency of certain XMM(SSE/AVX) move instructions that provide an
input operand to a
subsequent compute instruction can be hidden in all cases. This does not
apply to 256-bit operations."

"Move instructions of this type have no latency cost regardless of
location, as the hardware now
provides the alias to each and every instance.
This hardware optimization was initially designed to work with the
MOVAPD, MOVAPS,
MOVDQA, MOVDQU, MOVUPD, and MOVUPS instructions, but works well with
their AVX
variants regardless of 128-bit versions. Other SIMD move instructions
cause a two- cycle delay in
executing the dependent compute instruction. If at all possible, every
effort should be made to use
move instructions that the processor hardware can optimize."

p. 190 16 bit FP (F16c)

I had not noticed that AMD had added 16 bit FP to Bulldozer.
Oh, actually, I had - it's just called CVT16.

Intel had added it to LRB (R.I.P.) Q: has Intel added 16 bit FP to x86
yet, other than the LRB/MIC family?

I'm just waiting for people to start doing arithmetic in FP16. Perhaps
not in FP16 on all operands, but possibly

FP32 += FP16

FP32 = FP32*FP16 + FP16

FP32 = FP32*FP16 + FP32

etc.

p. 203

"When a modification [to code] does cross an aligned 8-byte boundary,
then care must be taken that the executor
not see an invalid combination of the original code and the new code.
There is no architectural store
instruction, including instructions that use the lock prefix, to ensure
that an executor will not see a
combination of the original code and the new code."

Lesson learned the hard way: atomicity of instruction fetch matters.

p. 210 Memory Barriers

Memory Barriers in WB Memory
In AMD64 architecture, when using writeback (WB) type memory without
streaming stores, the only
barriers that require an explicit barrier instruction are of the types
Store/Load and Store/Store. In WB
memory, all other barriers are implicit.

3 ways: SFENCE or MFENCE, locked instructions, serializing instructions
such as CPUID

Note that LFENCE is not mentioned. Q: does AMD make LFENCE a NOP in WB?

p. 229
"When migrating virtual machines between platforms with different
operating frequencies, there may
be problems with software that is dependent on a constant frequency TSC.
Family 15h processors
provide a new MSR called "Timestamp Counter Ratio (TscRateMsr)" which
allows the frequency of
the timestamp counter to be scaled to a fraction of the maximum
processor frequency of the host
system."

p. 244

CMPXCHG8 is one cycle slower than CMPXCHG16/32/64. Interesting. I
wonder why?

p. 315 special bypass latencies

Paul A. Clayton

unread,

Jan 13, 2012, 8:26:13 AM1/13/12

to

On Jan 13, 2:39 am, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:

> On 1/11/2012 10:15 AM, Tim McCaffrey wrote:
>
> >http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf
>
> > Talks a little about internals of the processor.
>
> Some notes, and some thoughts:
>
> p. 21 - a single macro-op can handle load and store to the same address,
> whereas micro-ops can only be load and store.

[snip]

> With separate loads and stores, an x86 instruction like INC mem might
> look like:
> INC mem
> tmp := load( M[seg:+basereg+indexreg+offset] )
> tmp := tmp + 1
> store_address( M[seg:+basereg+indexreg+offset] += tmp )

[snip]

> Optimizing
> INC mem
> tmp := load( M[lsq_entry.addr := &M[seg:+basereg+indexreg+offset]])
> lsq_entry.store_data := tmp+1
> // and, much later, after retirement, not an OOO uop:
> // M[store_buffer_entry.addr] := store_buffer_entry.data

Yes, and the send execution result to the store queue
optimization could be applied to other local stores.
Since the store queue would presumably be a different
physical storage structure, multiplexing the sending
of the result (in cases where it is also preserved)
might not be a problem.

This also draws the question of whether communicating
deadness of a register value after it is stored would
be sufficiently useful.

[snip]

> p. 33
> D$ - 4 ways, way prediction
> 16 banks, 16 bytes wide (i.e. intra cache line banks + inter)
> 4 cycles load-to-use

I wonder what the way predictor is like.

[snip]

> "Most of the time, as calls are fetched, the next return address is
> pushed onto the return stack and subsequent returns pop a predicted
> return address off the top of the stack. However, mispredictions
> sometimes arise during speculative execution. Mechanisms exist to
> restore the stack to a consistent state after these mispredictions."
> - i.e. the stack isn't purely a stack.
>
> Q: does Intel do this yet?

Yes, according to David Kanter's article on Nehalem
( http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=4
).

>
> p. 35 TLBs
>
> L1-DTLB - 32 entries, fully associative, multiple page sizes
> L2-DTLB - 1024 with 4K, 2M, 1G, 8 way set associative
> - Q: how does this work?

Wild guess: hash-rehash. With huge pages and up to 32
L1 entries, the latency for a second or third probe
might not be so bad.

> p. 38 LSU
>
> 2 128 bit loads/cycle (to different L1 banks)
> + 1 128 bit store
> Q: how does this reconcile with
> "The LS unit is composed of two largely independent pipelines
> enabling the execution of two memory operations per cycle"
> -- I think it may mean that store commit is special. (Or ???)

Special store commit would be a good guess.

[snip]

> p. 52 store-to-laod-forwarding prediction
> "The AMD Family 15h processor contains hardware to accelerate such
> store-to-load dependencies, allowing the load to obtain the store data
> before it has been written to memory."
>
> They don't describe the predictor.

Unfortunately for the curious, predictors often seem
to be left undescribed.

> The University of Wisconsin sued Intel for a patent on such a predictor.
> My advisor, Guri Sohi, and several of my friends, (Anreas Moshovos,
> etc.) were inventors on that patent. Intel settled.
> I'm sure AMD did, too.
>
> http://dotnet.sys-con.com/node/499104

So keeping such secret allows violating patents? :-/

> p. 82 - "Take Advantage of x86 and AMD64 Complex Addressing Modes"
>
> segbase + basereg + indexreg<<scale + offset not penalized.
>
> (x86s thrash about this. P6 did, Banias didn't, etc. Bulldozer does
> it, like P6.) (Well, except for segbase?)

I am guessing that this is meant to allow low-overhead
use for thread-local storage.

> "Another optimization recognizes MOVLPD/MOVHPD pairs and internally
> converts the MOVLPD to a MOVSD xmm, mem."

So _very_ good support for unaligned memory accesses.

[snip]

> p. 105 prefetching into unmapped pages can result in a significant delay.
>
> (Hmm, I think this means that the AMD prefetcher can prefetch across 4KB
> boundaries. Does Intel do this yet? I.e. it operates ob virtual, not
> physical, addresses.)
>
> I suspect this means that invalid pages are NOT placed into the TLB.
>
> If it is true that invalid addresses are not placed into the TLB, then
> every prefetch to the same may page may produce a TLB miss.
>
> GLEW OPINION: you need to cache invalid TLB entries, to constrain
> prefetch and other speculation. You may want to limit the number of
> invalid TLB entries, so as to prevent invalids thrashing out valids.
> Other schemes for constraining ...

In the case of next page within a cache block of PTEs, the
TLB entry could two bits to indicate if previous and
subsequent pages are valid (just supporting subsequent
page information might be adequate).

Andy (Super) Glew

unread,

Jan 13, 2012, 1:14:13 PM1/13/12

to

On 1/13/2012 5:26 AM, Paul A. Clayton wrote:
> On Jan 13, 2:39 am, "Andy (Super) Glew"<a...@SPAM.comp-arch.net>
> wrote:

>> p. 35 TLBs
>>
>> L1-DTLB - 32 entries, fully associative, multiple page sizes
>> L2-DTLB - 1024 with 4K, 2M, 1G, 8 way set associative
>> - Q: how does this work?
>
> Wild guess: hash-rehash. With huge pages and up to 32
> L1 entries, the latency for a second or third probe
> might not be so bad.

A, yes, thanks.

We've talked about this before.

I think that hash-rehash probing to alow multiple page sizes to be
stored in a TLB is someting I really need to write up for the
comp-arch.net wiki. Since I, at least, keep forgetting it.

Andy (Super) Glew

unread,

Jan 13, 2012, 1:24:08 PM1/13/12

to

On 1/13/2012 5:26 AM, Paul A. Clayton wrote:

> On Jan 13, 2:39 am, "Andy (Super) Glew"<a...@SPAM.comp-arch.net>

> [snip]
>> p. 105 prefetching into unmapped pages can result in a significant delay.
>>
>> (Hmm, I think this means that the AMD prefetcher can prefetch across 4KB
>> boundaries. Does Intel do this yet? I.e. it operates ob virtual, not
>> physical, addresses.)
>>
>> I suspect this means that invalid pages are NOT placed into the TLB.
>>
>> If it is true that invalid addresses are not placed into the TLB, then
>> every prefetch to the same may page may produce a TLB miss.
>>
>> GLEW OPINION: you need to cache invalid TLB entries, to constrain
>> prefetch and other speculation. You may want to limit the number of
>> invalid TLB entries, so as to prevent invalids thrashing out valids.
>> Other schemes for constraining ...
>
> In the case of next page within a cache block of PTEs, the
> TLB entry could two bits to indicate if previous and
> subsequent pages are valid (just supporting subsequent
> page information might be adequate).

I think AMD said that the largest stride they predict cannot be more
than a page, so this *might* work - but it would require the prefetcher
to look up both the address to be prefetched and the real-fetch that
caused the prefetch

Methinks it just simpler to cache invalid TLB entries.

Heck, if you are loading a cache line of TLB entries from the page
tables - e.g. 8 at a time - and not storing them individually, but
having a larger TLB storage block that holds all 8 - then you are
already storing invalid TLB entries, since there may be only one valid
TLB entry in a block of multiple adjacent TLB entries.

Heck, doing just this would solve many (but not all) problems with
speculative TLB misses to invalid pages. It would automatically limit
the amount of wasted space.

It's just one step more to caching an entire block og invalid TLB
entries. And then rexducing the block size to one.

Here's something fun: merge adjacent TLB entries in a fully associative
multipagesize TLB, but adjusting a bit per but mask. Really easy for
invalid TLB entries. A bit harder for valid, since you would have to
detect that the physical addresses are adjacent. Or, perhaps, different
only in a few lower bits.

Brett Davis

unread,

Jan 13, 2012, 1:48:55 PM1/13/12

to

In article <4F0FDFC5...@SPAM.comp-arch.net>,

"Andy (Super) Glew" <an...@SPAM.comp-arch.net> wrote:

> On 1/11/2012 10:15 AM, Tim McCaffrey wrote:
> > http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf
> >
> > Talks a little about internals of the processor.
>
> Some notes, and some thoughts:
>

> p. 39
>
> 4 write combining buffers
> *and* Write Coalescing Cache
>
> - I suspect the write-through L1$ was causing pains.

So are there bugs with this that can be fixed with more hardware,
or is AMD going to dump the write-through L1 cache?
Is a write-through L1 a mistake, as some benchmarks show.
(Not enough L1 write bandwidth.)

There are tradeoffs on the other side of the coin, AMD thought
they had good reasons to do this.

> p. 80 - load-execute instructions are preferred over separate load and
> execute instructions. (Not new.)

Does this have any implications on future non-x86 designs?
If you were designing a new high end CPU instruction set would you include
add from memory opcodes?

> p. 81 - unaligned large (SIMD, 128 bit and 64, are recommemnded. Better
> misaligned support.

The world is misaligned, CPU's are finally getting on board with reality.

> p. 82 - "Take Advantage of x86 and AMD64 Complex Addressing Modes"
>
> segbase + basereg + indexreg<<scale + offset not penalized.
>
> (x86s thrash about this. P6 did, Banias didn't, etc. Bulldozer does
> it, like P6.) (Well, except for segbase?)

The example used two registers and an offset, not three.
I wonder about the hit for segbase also.
LEA gets much of the benefit of ARM's shifted compute.

> p. 90 SHLD deprecated - VectorPath.
>
> Boo hoo, for people like me who like BitBlt.

To implement SHLD in one cycle would seem to require two shifters and an OR.

> p. 132 "With the pipelined floating-point adder allowing one FADD every
> cycle [still to confirm]," :-)

The hardware guys seem to hate the idea of three read ports on a ALU,
with just 2 ports it takes 3 cycles for 2 consecutive MAC's.
(1.5 cycles per MAC average.)
Floating point math tends to have long chains, converted to a streaming
accumulator format you would only need ~1.2 read ports on average.
Mixing a few MAC instructions into the chain and two ports are actually
plenty.

Is this how things really work, or are we still in the age of independent
RISC micro-ops?

Page 20 intro says instructions are decoded into fixed length (RISC)
macro-ops for tracking, which are then broken into micro-ops for
execution.

> p. 190 16 bit FP (F16c)
>
> I had not noticed that AMD had added 16 bit FP to Bulldozer.
> Oh, actually, I had - it's just called CVT16.
>
> Intel had added it to LRB (R.I.P.) Q: has Intel added 16 bit FP to x86
> yet, other than the LRB/MIC family?
>
> I'm just waiting for people to start doing arithmetic in FP16. Perhaps
> not in FP16 on all operands, but possibly
>
> FP32 += FP16
>
> FP32 = FP32*FP16 + FP16
>
>
> FP32 = FP32*FP16 + FP32
>
> etc.

The opcode bits required to multiply different types are excessive,
no one mixes singles and doubles, so I cannot see it ever happening.
ConVerT 16 bit float pretty much implies that no math will ever be
done in 16 bit float format.

AMD has a separate CVT unit on pipe 0, its not a free operation.

Is my world view wrong on this?

Mixed precision arithmetic seems to be a hot topic, mostly on RAM
bandwidth concerns.

Terje Mathisen

unread,

Jan 13, 2012, 2:12:32 PM1/13/12

to

They probably exist, but I don't know about any architecture which does
arithmetic in FP16:

You either have implicit or explicit conversion to FP32 before you do
any actual fp work.

In the places where this is most common (i.e. graphics) you actually
need FP32 in order to avoid artifacts.

Afair LRB/MIC has the conversion circuitry as part of the load unit, so
it can do load-execute even if the input happens to be 4 FP16 variables
in a 64-bit word where each of the inputs is replicated to 4 FP32 vector
slots.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Tim McCaffrey

unread,

Jan 13, 2012, 2:48:29 PM1/13/12

to

In article <2ac6u8-...@ntp6.tmsw.no>, "terje.mathisenattmsw.no" says...

>

>In the places where this is most common (i.e. graphics) you actually
>need FP32 in order to avoid artifacts.
>

I'm probably mis-remembering, but I thought FP16 was used mostly for
Z-Buffers (which makes sense), not for doing math.

- Tim

Terje Mathisen

unread,

Jan 13, 2012, 3:36:42 PM1/13/12

to

OK, that could well work.

I've seen it for high dynamic range textures, where you do convert on load.

Paul A. Clayton

unread,

Jan 13, 2012, 4:39:11 PM1/13/12

to

On Jan 13, 2:12 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
[snip]

> They probably exist, but I don't know about any architecture which does
> arithmetic in FP16:
>
> You either have implicit or explicit conversion to FP32 before you do
> any actual fp work.
>
> In the places where this is most common (i.e. graphics) you actually
> need FP32 in order to avoid artifacts.

I thought the new IEEE standard indicated that it was only
an in-memory format. Of course, this would only discourage
the use of the format in computation not prohibit its use.

Paul A. Clayton

unread,

Jan 13, 2012, 4:44:18 PM1/13/12

to

On Jan 13, 1:14 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip hash-rehash speculation]

> A, yes, thanks.
>
> We've talked about this before.
>
> I think that hash-rehash probing to alow multiple page sizes to be
> stored in a TLB is someting I really need to write up for the
> comp-arch.net wiki. Since I, at least, keep forgetting it.

This was already done a while ago.

http://semipublic.comp-arch.net/wiki/TLB_Structures_for_Multiple_Page_Sizes#Multiple_Indexing_Methods

"One simple way of implementing this is to initially
probe the TLB assuming a particular page size and on
a miss to probe the TLB for each alternative page size."

It might be appropriate to add the "hash-rehash" term,
which could link to an article pointing to other uses
of the technique (data caches mainly, I think).

Paul A. Clayton

unread,

Jan 13, 2012, 6:10:26 PM1/13/12

to

On Jan 11, 1:15 pm, timcaff...@aol.com (Tim McCaffrey) wrote:
> http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf
>
> Talks a little about internals of the processor.

Interestingly (if it is not a documentation mistake), gselect is
used rather than gshare (section 2.6 [p. 34], "The global
predictor is accessed via a 2-bit address hash and a 12-bit
global history."). I am not certain if the selector being
per-address is unusual (it might be appropriate to merge the
selector and branch identifier table, in which case per-address
makes sense).

An architectural wrinkle that I did not know about: SSE
instructions are defined not to affect the upper bits of YMM
(AVX) registers. I wonder what the rationale was for this
behavior. Zeroing destinations (as AVX does and, I think, x86
does for 64-bit registers) seems to make more sense, especially
for SIMD registers.

Andy (Super) Glew

unread,

Jan 14, 2012, 8:23:51 PM1/14/12

to

On 1/13/2012 10:48 AM, Brett Davis wrote:
> In article<4F0FDFC5...@SPAM.comp-arch.net>,
> "Andy (Super) Glew"<an...@SPAM.comp-arch.net> wrote:

>> p. 80 - load-execute instructions are preferred over separate load and
>> execute instructions. (Not new.)
>
> Does this have any implications on future non-x86 designs?
> If you were designing a new high end CPU instruction set would you include
> add from memory opcodes?

I would not.

However, I might make one of the register numbers stand for
"Non-register Bypass".

So that we might have

non-register-bypass := LOAD( M[...] )
Rdest := ADD( Rsrc + non-register-nypass )

And I might consider requiring that such instructions be adjacent,
and that no exceptions could occur between them.

Essentially, broadening the register file, but in a way such that you
never have to preserve the value of the non-register bypass in
architectural state.

Essentially providing load-op instructions, without increasing the
instruction set.

Andy (Super) Glew

unread,

Jan 14, 2012, 8:25:32 PM1/14/12

to

On 1/13/2012 10:48 AM, Brett Davis wrote:

> In article<4F0FDFC5...@SPAM.comp-arch.net>,
> "Andy (Super) Glew"<an...@SPAM.comp-arch.net> wrote:

>> p. 82 - "Take Advantage of x86 and AMD64 Complex Addressing Modes"
>>
>> segbase + basereg + indexreg<<scale + offset not penalized.
>>
>> (x86s thrash about this. P6 did, Banias didn't, etc. Bulldozer does
>> it, like P6.) (Well, except for segbase?)
>
> The example used two registers and an offset, not three.
> I wonder about the hit for segbase also.
> LEA gets much of the benefit of ARM's shifted compute.

AMD traditionally takes an extra cycle for non-zero segbase. Recent
Intel processors may do so.

In 64-bit all segbases are zero. Except for VmWare...

Andy (Super) Glew

unread,

Jan 14, 2012, 8:27:25 PM1/14/12

to

On 1/13/2012 10:48 AM, Brett Davis wrote:

> In article<4F0FDFC5...@SPAM.comp-arch.net>,
> "Andy (Super) Glew"<an...@SPAM.comp-arch.net> wrote:

>> p. 90 SHLD deprecated - VectorPath.
>>
>> Boo hoo, for people like me who like BitBlt.
>
> To implement SHLD in one cycle would seem to require two shifters and an OR.

Funnel shifrt => form a double width value by concatenating, and then
shift that and extract a single width chunk.

But since the datapaths are probably bit interleaved, amounts to the
same thing.

Andy (Super) Glew

unread,

Jan 14, 2012, 8:30:15 PM1/14/12

to

Note that I said stuff like

FP32 += FP16

as opposed doing a conversion

FP32.tmp := FP16
FP32 += FP32.tmp

It's just instruction count, and preservig putative advantages of
load-execute.

Andy (Super) Glew

unread,

Jan 14, 2012, 8:37:12 PM1/14/12

to

On 1/13/2012 3:10 PM, Paul A. Clayton wrote:
> On Jan 11, 1:15 pm, timcaff...@aol.com (Tim McCaffrey) wrote:
>> http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf
>>
>> Talks a little about internals of the processor.
>
> Interestingly (if it is not a documentation mistake), gselect is
> used rather than gshare (section 2.6 [p. 34], "The global
> predictor is accessed via a 2-bit address hash and a 12-bit
> global history."). I am not certain if the selector being
> per-address is unusual (it might be appropriate to merge the
> selector and branch identifier table, in which case per-address
> makes sense).

I think this is the classic McFarling predictor.

Per-branch identifier and chooser,
choosing between a per-branch prediction
and a history like (gshare) prediction.

It gets more fun when your history based predict has valid bits as well
as taken/not taken counters.

E.g. you might let a branch normally live in the per-branch predictor,
but update a share predictor only when the per branch predictor
mispredicts. Then let a strongly confident per-history predictor
override the per-branch predictor.

You can do this in layers, so that longer and longer "exceptions to the
norm" can be handled.

E.g. the branch may normally be taken.

But ion history 1001 it may be non-taken - get that from gshare-4

But on history 1001011 it may be taken again. Get that from gshare-7

But on history 1001011011 it may be non-taken again. Get that from
gshare-10.

Etc.

Art some point the long huistories become sparse, so you want tagging,
possibly associativty.

Brett Davis

unread,

Jan 15, 2012, 2:41:14 PM1/15/12

to

In article <4F122C27...@SPAM.comp-arch.net>,

"Andy (Super) Glew" <an...@SPAM.comp-arch.net> wrote:

You sold me on the idea, but for a RISC chip I would do the conversion
as part of the math opcodes. The load opcodes have all bit committed,
whereas on MIPS the integer instructions have 5 bits of unused zeros,
and the float math opcodes underutilized the 5 bit field.

Format conversion from FP16 is just a shift and sign copy, almost free.

Lots of hits for mixed precision floating point:
http://www.google.com/search?source=ig&hl=en&rlz=&q=mixed+precision+floating+point

Tacit

unread,

Jan 15, 2012, 8:07:19 PM1/15/12

to

Hello, Andy.

«I think it may mean that store commit is special. (Or ???)» — …or LSU
is almost like Intel's SB has: 2 address ports, 2 integer data ports
(Intel has 3), (how many?) FPU ports and 3 L1D ports. So peak 48 B/cl
(32 R + 16 W) can be achieved with SSE (not sure — Intel don't) or
AVX, but not with scalars. Needs testing…

«p. 41 - Hypertransport assist, aka probe filtering (not new). Steals
part of the L3 to use as a directory» — and some of ways too. Is it 1
or 2 MB in size here?

«They don't describe the predictor» — cause there is no predictor.
Probably, when preceding write's address isn't known, speculating
loads are fetched and kept in load queue until address mismach is
confirmed (if not — cleared and refetched). IMO, Intel's MA can send
loaded data to EU and reset the pipeline on address conflict. So they
have actual IP-based predictor to lower mispredict chance by stoping
questionable loads before sending to EU. There's no indication AMD has
it.

«p. 102 "Choose linear addresses for the source and destination

operands of REP MOVS/CMPS that are not an exact multiple of 4K pages

away from each other."» — well, that's the stupidest of things. Isn't
most time string instructions operate with large size page-aligned
data? (So start page offsets are 0's — and they'll collide with STLF
early detection.) What would they recomend doing instead?

«I think this means that the AMD prefetcher can prefetch across 4KB
boundaries. Does Intel do this yet?» — no, but IB soon will.

«I suspect this means that invalid pages are NOT placed into the TLB»
— do you mean non-resident in RAM?

«If it is true that invalid addresses are not placed into the TLB,
then every prefetch to the same may page may produce a TLB miss» — why
bother? It's in advance, and there's no exception handling (if non-
resident page or access violation), since access is speculative.

«Some processors have burst writes with byte valids - but I did not
think AMD did. Otherwise, why would it be so much slower?» — AMD did,
cause there is maskmov*. Maybe, ECC bits calculation is in effect? L1*
are protected by parity only, L2 — with ECC, the only place to keep
written data before saving to L2 is WCC, so it must be ECC-protected.
Or they consider it as «short term store» not worth of ECC?…

Oh, and they failed to mention there is no single WCC in the module at
all :-D

«Can't have data in I$ and in M state in D$. I.e. the O part of MOESI
does not apply to the I and D cache» — that's normal. L1I has SI
states. L1D (with write-through) — probably, SI too. M-line with code
needs to be converted to S-state to be copied to L1I (somehow, L1I
doesn't allow keeping L2's M-line with S-state), so it must be saved
in memory through L3 cache (by evicting something posibly useful from
it). And there may by many L3's in SMP system, so they used «caches»
in plural :)

«flushing to memory invalidates the D$?» — I think they misused the
word Flush instead of Store or Save. There's no need to invalidate the
data, which has just been requested by L1I, and immediately read it in
S-state.

«Waddaya want to bet they did not have any ASLR workloads? And some

"smart" designer or architect, who had not kept up with security,

wanted to be clever by applying what they had learned in school?» —
well, 16-way 64 KB cache may appeared too slow even for a 4-cl access
(with target frequencies). What else could they do?

«p. 123 - CALL next instruction;POP idiom» — wouldn't it be better to
macro-decode both instructions into «mov r64, rIP» mop?…

«p. 132 "With the pipelined floating-point adder allowing one FADD
every cycle [still to confirm]," :-) » — aha, probably, they still not
sure themselves :) Where's that customer hotline number — I'll
tell'em…

«p. 169 move between integer GPRs and XMM through memory. STLF
stalls...» — why? Address and size will be the same.

«I had not noticed that AMD had added 16 bit FP to Bulldozer» — it's
not FP, just converts.

«Q: has Intel added 16 bit FP to x86 yet, other than the LRB/MIC
family?» — CVT16 will be in Ivy bridge.

«I'm just waiting for people to start doing arithmetic in FP16» —
probably not. At best, they'll add moves, scalar/constant broadcasts
and compares — thats enougth to implement sort and searsh algorithms
without major FP-EU change and opcode additions.

«Q: does AMD make LFENCE a NOP in WB?» — yes.

«CMPXCHG8 is one cycle slower than CMPXCHG16/32/64. Interesting. I
wonder why?» — typo? :) We've seen a lot of those.

«p. 315 special bypass latencies» — these are usual intrapath
bypasses. Useful to investigate EU position relative to RF.

More strange things on BD later…

Terje Mathisen

unread,

Jan 16, 2012, 3:35:44 AM1/16/12

to

Tacit wrote:
> «p. 123 - CALL next instruction;POP idiom» — wouldn't it be better to
> macro-decode both instructions into «mov r64, rIP» mop?…

That would be a worthwhile tweak to any x86 which tries to maintain a
call/ret stack cache.

I noticed one really fun tip on page 121:

"Branches That Depend On Random Data

Avoid conditional branches that depend on random data, as these branches
are difficult to
predict."

Imagine that!

This does mean that AMD would prefer if you don't use this cpu to decode
CABAC-encoded h264 video, since that compression format has one totally
unpredictable, as well as unavoidable, branch for every _bit_ decoded.
(The reference implementation actually has three branches per bit, but
the two extra are related to the bit extraction code which isn't too
hard to optimize.)

With a maximum rate (40 Mbit/s) video and ~20 cycles lost on a
mispredict, you'll average ~400 M wasted cycles per second on just this
one operation: That's quite a bit of power as well.

OTOH, current cpus will need very good sw, running all 4+ cores flat out
in order to handle that decoding job. In reality you end up outsourcing
critical parts to the GPU or other dedicated hw.

The Partial Loop Unrolling example on page 131++ is pretty bad, since it
advocates using unrolled scalar code to do a vector addition: If the
loop count is large enough to support unrolling, it is definitely large
enough to support 2-way or 4-way SIMD code (SSE2 or AVX registers)!

Noob

unread,

Jan 16, 2012, 4:23:15 AM1/16/12

to

Andy (Super) Glew wrote:

> p. 147 - XOR reg, reg idiom - "AMD Family 15h processors are able
> to avoid the false read dependency on the XOR instruction."
>
> Sounds as if this is new with Bulldozer - which is surprising.

This is /most definitely/ NOT new.

cf. the original K7 optimization guide (22007.pdf)

<quote>

Use XOR Instruction to Clear Integer Registers

To clear an integer register to all 0s, use "XOR reg, reg".
The AMD Athlon processor is able to avoid the false read

dependency on the XOR instruction.

Example 1 (Acceptable):
MOV REG, 0

Example 2 (Preferred):
XOR REG, REG

</quote>

Paul A. Clayton

unread,

Jan 16, 2012, 6:36:18 AM1/16/12

to

Tacit wrote:
[snip interesting comments]

> «I suspect this means that invalid pages are NOT placed into the TLB»
> — do you mean non-resident in RAM?

Yes, I believe he means invalid translations, though technically such
could still be in RAM (an OS might mark a page as invalid to make
it a replacement candidate; it is also possible that a dirty victim
page might still be waiting for write to swap and still be in RAM).

> «If it is true that invalid addresses are not placed into the TLB,
> then every prefetch to the same may page may produce a TLB miss» — why
> bother? It's in advance, and there's no exception handling (if non-
> resident page or access violation), since access is speculative.

With hardware-based TLB fill, in some cases the overhead of
filling a TLB can be very small (e.g., a hit in the PDE caching
structure and a hit for the PTE in cache), so speculatively
handling the miss (at least for low-overhead cases) may be
worthwhile.

[snip]

> «Waddaya want to bet they did not have any ASLR workloads? And some
> "smart" designer or architect, who had not kept up with security,
> wanted to be clever by applying what they had learned in school?» —
> well, 16-way 64 KB cache may appeared too slow even for a 4-cl access
> (with target frequencies). What else could they do?

The performance decrease might be from a reduction in
snooping bandwidth (in which case replicating the tags
or placing aliasing information in an inclusive L2 cache
might have helped).

For an Icache, it might also be practical to use physical
indexing since addresses come either from the next
block (so caching the current and next page translation--
or even just the three extra indexing bits--would suffice)
or the BTB (which could provide a physical address).

[snip]

> «p. 123 - CALL next instruction;POP idiom» — wouldn't it be better to
> macro-decode both instructions into «mov r64, rIP» mop?…

While such an optimization would violate the definition of
the architecture (by not faulting on a page violation and
by not recording the rIP in the redzone of the stack), I
suspect that no software exploits such.

Tacit

unread,

Jan 16, 2012, 9:19:34 AM1/16/12

to

Paul A. Clayton:

«or the BTB (which could provide a physical address)» — that's too
much space to use. Physical indexing is slower (+1 cl ?), but more
elegant and power efficient (reading just 1 way — where it hits).

«While such an optimization would violate the definition of the

architecture (by not faulting on a page violation and by not recording

the rIP in the redzone of the stack)» — correct. Needs a check for rSP
page crossing.

Paul A. Clayton

unread,

Jan 16, 2012, 10:48:05 AM1/16/12

to

On Jan 14, 8:23 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:

> On 1/13/2012 10:48 AM, Brett Davis wrote:
>

> > In article<4F0FDFC5.6030...@SPAM.comp-arch.net>,
> > "Andy (Super) Glew"<a...@SPAM.comp-arch.net> wrote:

> >> p. 80 - load-execute instructions are preferred over separate load and
> >> execute instructions. (Not new.)
>
> > Does this have any implications on future non-x86 designs?
> > If you were designing a new high end CPU instruction set would you include
> > add from memory opcodes?
>
> I would not.
>
> However, I might make one of the register numbers stand for
> "Non-register Bypass".
>
> So that we might have
>
> non-register-bypass := LOAD( M[...] )
> Rdest := ADD( Rsrc + non-register-nypass )
>
> And I might consider requiring that such instructions be adjacent,
> and that no exceptions could occur between them.

I am curious why you propose this form of instruction fusion
but prefer to use multiple instruction addresses to implement
delayed branches.

Paul A. Clayton

unread,

Jan 16, 2012, 10:59:05 AM1/16/12

to

On Jan 16, 9:19 am, Tacit <tacit.mu...@gmail.com> wrote:
> Paul A. Clayton:
>
> «or the BTB (which could provide a physical address)» — that's too
> much space to use. Physical indexing is slower (+1 cl ?), but more
> elegant and power efficient (reading just 1 way — where it hits).

How is using a physical address vs. a virtual address in the
BTB using more space? The extra space would be in caching
the translation for the current page and next sequential page

How is indexing by physical address slower? How does it allow
reading only one way? (Unless you are thinking of a set and
way predictor and not a predictor of the address.)

> «While such an optimization would violate the definition of the
> architecture (by not faulting on a page violation and by not recording
> the rIP in the redzone of the stack)» — correct. Needs a check for rSP
> page crossing.

It would also need to write the rIP if was not overwritten
before being accessed (the standard x86-64 ABI protects the
stack redzone from being written by exceptions).

However, I suspect that such an Architectural incompatibility
would not be noticed. (Famous last words?)

Tacit

unread,

Jan 16, 2012, 1:09:18 PM1/16/12

to

On 16 янв, 17:59, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
«How is using a physical address vs. a virtual address in the BTB
using more space?» — can prediction be made with PA's? With checking
every time in TLB? If not, BTB should contain both PA and VA, nobody
does that.

«The extra space would be in caching the translation for the current
page and next sequential page» — too clumsy. With more than 12 bits
for tagging, problems arise in evicting VA synonyms (up to 14 in AMD's
case — will take 7 more tag checks; nothing new).

«How is indexing by physical address slower?» — can't check tags & TLB
at the same time. TLB first, tags second.

«How does it allow reading only one way?» — if tag access is before
set and way selections. Can be concurrent, as now. Not caused by PA/VA
debate — just for energy save. Atom, Bobcat and others do that.

Message has been deleted

Andy (Super) Glew

unread,

Jan 16, 2012, 1:36:54 PM1/16/12

to

On 1/15/2012 5:07 PM, Tacit wrote:

> «p. 102 "Choose linear addresses for the source and destination
> operands of REP MOVS/CMPS that are not an exact multiple of 4K pages
> away from each other."» — well, that's the stupidest of things. Isn't
> most time string instructions operate with large size page-aligned
> data? (So start page offsets are 0's — and they'll collide with STLF
> early detection.) What would they recomend doing instead?

The good thing about REP MOVS instructions is that you can tune them.

The bad thing about REPM MOVS instructions is if you don't tune them.

> «I think this means that the AMD prefetcher can prefetch across 4KB
> boundaries. Does Intel do this yet?» — no, but IB soon will.
>
> «I suspect this means that invalid pages are NOT placed into the TLB»
> — do you mean non-resident in RAM?

Anything that makes the translation invalid. E.g. marked as not-present
(which, by the way, often means "in RAM but I want a page fault if you
reference it so that I can do OS virtual memory scheduling." Not just
"not in RAM.")

But also, it might be kernel mode accessible but not user mode
accessible. Some systems do not load a translation into the TLB if it
is accessible, just not by the current instruction/mode.

Similarly, they might not load if you are on a speculative path, trying
to get write permission for a store or read/write for an INC mem.

> «If it is true that invalid addresses are not placed into the TLB,
> then every prefetch to the same may page may produce a TLB miss» — why
> bother? It's in advance, and there's no exception handling (if non-
> resident page or access violation), since access is speculative.

Performance.

Some processors do TLB misses speculatively.

If you keep doing spexculative references to the same invalid page, but
never put it into a TLB, then (a) you waste power redoing the same
reference, (b) the extra misses may delay TLBB miss handling that is
really needed.

> «Some processors have burst writes with byte valids - but I did not
> think AMD did. Otherwise, why would it be so much slower?» — AMD did,
> cause there is maskmov*.

Not necessarily true. Some implementations of MASKMOV
a) do the masking when storing into a cache line in the cache
b) do the masking when doing a partial, non burst write

but imply nothing about burst write partials.

> «p. 123 - CALL next instruction;POP idiom» — wouldn't it be better to
> macro-decode both instructions into «mov r64, rIP» mop?…

Unfortunately has a memory side effect.

> «p. 169 move between integer GPRs and XMM through memory. STLF
> stalls...» — why? Address and size will be the same.

I assumed that it also applies to YMM, 128 and 256 bits.

Paul A. Clayton

unread,

Jan 16, 2012, 5:21:59 PM1/16/12

to

Tacit wrote:
> On 16 янв, 17:59, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> «How is using a physical address vs. a virtual address in the BTB
> using more space?» — can prediction be made with PA's? With checking
> every time in TLB? If not, BTB should contain both PA and VA, nobody
> does that.

In theory the predictors could be physically indexed, though such
might introduce excessive aliasing (similar to data cache aliasing
that motivates page coloring).

I had not considered the TLB look-up issue (for permissions).
That seems to be a problem with just having the physical address.
While a physically indexed ITLB might be possible on some
Architectures, I think x86 technically allows virtual aliases to
have different permissions even within the same page table.

Rather than having the physical address, the BTB could hold
just the physical address bits necessary for physically indexing
the cache (adding three bits to the data portion of each BTB
entry in the Bulldozer case).

> «The extra space would be in caching the translation for the current
> page and next sequential page» — too clumsy. With more than 12 bits
> for tagging, problems arise in evicting VA synonyms (up to 14 in AMD's
> case — will take 7 more tag checks; nothing new).

Hmm? Such is just a microTLB 'tagged' with the current rIP.

> «How is indexing by physical address slower?» — can't check tags & TLB
> at the same time. TLB first, tags second.

No, tags would use physical addresses. The TLB access would
be a problem because virtual address aliases with different
permissions are technically possible. (I do not know if anyone
actually uses such aliasing.)

Overall, just keeping three extra bits in each BTB looks more
attractive--if one must support a physically indexed Icache.

> «How does it allow reading only one way?» — if tag access is before
> set and way selections. Can be concurrent, as now. Not caused by PA/VA
> debate — just for energy save. Atom, Bobcat and others do that.

I did not know that anyone did tag-sequential accesses for L1
caches (other than those using CAMs for tag checks for highly
associative caches). Way prediction is another matter. (Of
course one could have way selection--where either a hit in the
selected way or a miss is guaranteed--using smaller
selection tags perhaps with clever allocation, but prediction
probably works as well.)

Tacit

unread,

Jan 16, 2012, 8:44:54 PM1/16/12

to

On 16 янв, 20:36, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:

«Anything that makes the translation invalid» — OK, but the original
phrase was «An over-run in an unmapped page can result in a
significant delay». Could it be that Unmapped means just «Not present
in TLB»? Handling it is long: 2 TLB misses and an actual translation.
If it's causing an exception — that should stop pretetching.

«If you keep doing spexculative references to the same invalid page,
but never put it into a TLB, then …» — yes, right, so that's exactly,
why they should avoid that. BD is a high-performace CPU — why avoiding
invaids in TLB?… I think, it's just language issue.

«Some implementations of MASKMOV a) do the masking when storing into a
cache line in the cache» — yes, this option.

«b) do the masking when doing a partial, non burst write» — in a write
buffer entry?

«but imply nothing about burst write partials — well, if a) and b) are
present, isn't it means burst writes can be partial?

«I assumed that it also applies to YMM, 128 and 256 bits» — how's that
explaines «STLF stalls» phrase? :)

Tacit

unread,

Jan 16, 2012, 8:57:49 PM1/16/12

to

On 17 янв, 00:21, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:

«BTB could hold just the physical address bits necessary for

physically indexing the cache (adding three bits to the data portion

of each BTB entry in the Bulldozer case)» — and if that line gets
evicted? And then refetched in different way of same set? BTB size is
very large and span for more than 64 KB of code.

«No, tags would use physical addresses» — exactly, so where do you get
PA in case of BTB miss? :)

«I did not know that anyone did tag-sequential accesses» — it's not
tag-sequential (that's for pseudoassociative caches of 90-s). Just TLB
+ tags first and only proper way later.

«Way prediction is another matter» — BTW, I wonder what do they mean
by «way-predicted L1D»?…

Message has been deleted

Andy (Super) Glew

unread,

Jan 17, 2012, 12:09:00 PM1/17/12

to

On 1/16/2012 5:44 PM, Tacit wrote:
> On 16 янв, 20:36, "Andy (Super) Glew"<a...@SPAM.comp-arch.net> wrote:
>
> «Anything that makes the translation invalid» — OK, but the original
> phrase was «An over-run in an unmapped page can result in a
> significant delay». Could it be that Unmapped means just «Not present
> in TLB»? Handling it is long: 2 TLB misses and an actual translation.
> If it's causing an exception — that should stop pretetching.

The first time any page is accessed, including unmapped pages, it will
cause a TLB miss.

However, whereas mapped pages, valid pages, are placed into the TLB,
subsequent TLB misses to that page are avoided.

Whereas unmapped pages are not, in most machines, placed in the TLB. So
subsequent accesses to the uinmapped page will cause a TLB miss and page
table walk EACH AND EVERY TIME.

If the TLB miss is done in order at retirement, no problem: there will
be a fault.

But if the TLB miss is done speculatively, repeatedly, on the wrong path
- e.g. prefetching one page ahead on a loop, repeatedly causing
speculative accesses past the end of the loop at the end of the array
which will not retire and not cause a page fault - that can be a
performance problem.

Oh, I see that you aren't asking what this means any more, just
postulating meanings.

"Unmapped" is used in at least two senses: (1) unmapped in TLB, i.e.
not present in the TLB. (2) unmapped by the page tables.

Software doesn't have much control over (1), except by using large
pages, etc. If what they are saying is "TLB misses are a performance
problem", well, duh!

Software does have control over (2).

2 TLB misses and an actual translation is more consistent with (2) than (1).

>
> «If you keep doing spexculative references to the same invalid page,
> but never put it into a TLB, then …» — yes, right, so that's exactly,
> why they should avoid that. BD is a high-performace CPU — why avoiding
> invaids in TLB?… I think, it's just language issue.
>
> «Some implementations of MASKMOV a) do the masking when storing into a
> cache line in the cache» — yes, this option.
>
> «b) do the masking when doing a partial, non burst write» — in a write
> buffer entry?
>
> «but imply nothing about burst write partials — well, if a) and b) are
> present, isn't it means burst writes can be partial?

Apparently we have a language issue ourselves.

You may be talking about a burst of uncached writes.

I am talking about an efficient burst bus transfer. Which is usually
ful-line - which only a few machines, manty of which have my
fingerprints on them, have partials for.

Tacit

unread,

Jan 17, 2012, 3:13:27 PM1/17/12

to

On 17 янв, 19:09, "Andy (Super) Glew" <a...@SPAM.comp-arch.net> wrote:

«If the TLB miss is done in order at retirement, no problem: there

will be a fault. But if the TLB miss is done speculatively,

repeatedly, on the wrong path … that can be a performance problem» —
it'll be stupid not to use priorities when handling TLB requests (as
LSU & caches do). Demand requests should be processed ahead of
speculations. However, in this case there would be no significant
performance reduction…

«I am talking about an efficient burst bus transfer» — I'm too. I.e. 4
16-B transactions with 1 address. If partial — bit-masks should also
be transfered.

«Which is usually ful-line - which only a few machines, manty of which
have my fingerprints on them, have partials for» — OK, suppose K10
couldn't do that too. How's that explains 6-times slowdown in BD with
same 16 B/clock? ECC recalc?

MitchAlsup

unread,

Jan 19, 2012, 4:07:37 PM1/19/12

to an...@spam.comp-arch.net

On Saturday, January 14, 2012 7:23:51 PM UTC-6, Andy (Super) Glew wrote:
> On 1/13/2012 10:48 AM, Brett Davis wrote:
> > In article<4F0FDFC5...@SPAM.comp-arch.net>,
> > "Andy (Super) Glew"<an...@SPAM.comp-arch.net> wrote:
>
> >> p. 80 - load-execute instructions are preferred over separate load and
> >> execute instructions. (Not new.)
> >
> > Does this have any implications on future non-x86 designs?
> > If you were designing a new high end CPU instruction set would you include
> > add from memory opcodes?
>
> I would not.

I have been spending some time of recent thinking on this subject. And in particular, I have been thinking about the kinds of payloads an instruction could carry.

One of the BAD points of RISC instruction set design is that the compiler/assembler/programmer has to invent constants/displacements/immediates,... Many of these displacements need more address bits than the instruction carries naturally (16-bits typically, 13-bits SPARC, 12-bits 360+). The lack of enough displacement bits creates a lot of instructions that simply paste bits. Decoders are ALREADY good at pasting bits, arguably better than instruction sets.

One of the GOOD points of x86 is that it can express difficult memory addressing modes without making all memory reference instructions carry bagage that they do not need. Thus, x86 did not fall into the trap of VAX where everything is reigister or memory reference of any flavor.

However, A) it is not difficult to make a load-op-store pipeline for a low-mid-range machine, and B) it is not difficult to break Load-op-Store into multiple 3excursions down independent pipelines for higher end machines.

So, if one had an instruction that stands by itself and has the property that it simply announces more bits for future instructions yet to be decoded, these bits can be stripped off by the decoder and made available for when the necessarily dependent instruction is fetched and decoded. The dependent instruction is necessarily near the payload instruction and it makes little sense to allow arbitrary distance between same.

Using a payload instruction model one can endow a RICS-like instruction set with a lot more expressiveness, making in effect a Medium complexity instruction set or MISC.

Let us review the kinds of things one can express with a payload instruction:

LD Rx,[Rb+Ri<<s+Displacement32] // 1 payload+1 Load instruction

FADD Rx,[Rb+Ri<s+Displacement32] // 1 payload+1 FADD instruction

FMUL [Rb+Ri<s+Displacement48],Ry // 1 payload+1 FMUL+1 Store

In addition to pasting immediate bits for displacements, constants and immediates, payload instructions could provide access to sub-registers within a short vector register:

MOVB Rx<63:48>,Ry

Or create Calls to subroutines very far away:

CALL Displacement48

Without having to build or load the target address from a prebuilt table.

Thus, while I currently think the best instruction set is somewhere smaller and less complex than x86, it is necessarily significantly larger than RISC. Not that it will ever happen, though.....

Mitch

Paul A. Clayton

unread,

Jan 19, 2012, 6:24:50 PM1/19/12

to

On Jan 19, 4:07 pm, MitchAlsup <MitchAl...@aol.com> wrote:
[snip food for thought on PAYLOAD instructions]

> Thus, while I currently think the best instruction set
> is somewhere smaller and less complex than x86, it is
> necessarily significantly larger than RISC. Not that it
> will ever happen, though.....

Well, not ALL hope is lost. Fujitsu did use a much more
limited form of PAYLOAD instruction (you had a hand in
that, correct?), and a few years ago Renesas introduced
a new (CISC) ISA (I think mainly targeting code density
for low-end 32-bit processors).

The combination of the potential for binary translation
and the possible end of Moore's Law may increase the
attractiveness of ISA innovation. Of course, a century
from now computer architecture may be almost
unrecognizable--I could hope that by then no x86
binaries will be running (except as curiosities).

Paul A. Clayton

unread,

Jan 19, 2012, 6:50:45 PM1/19/12

to

On Jan 16, 8:57 pm, Tacit <tacit.mu...@gmail.com> wrote:
> On 17 янв, 00:21, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
>
> «BTB could hold just the physical address bits necessary for
> physically indexing the cache (adding three bits to the data portion
> of each BTB entry in the Bulldozer case)»
>
> — and if that line gets
> evicted? And then refetched in different way of same set? BTB size is
> very large and span for more than 64 KB of code.

I was thinking of only including the physical address bits,
so the cache could be addressed using an accurate prediction.
This would not be a way prediction but 'merely' a prediction
of a few bits of the physical address. On a BTB miss, a
modest extra penalty for making a translation after the
branch address is calculated would be involved. If the
non-virtual bits of the cache index were used last, it
might be possible to hide some of this delay.

> «No, tags would use physical addresses»
>
> — exactly, so where do you get
> PA in case of BTB miss? :)

Yes, that is a problem. A set associative TLB where
either virtual or physical addresses could be used as
tags might be useful to address such; but I seem to
be heading into a maze of twisty passages (with
little chance of a reasonable complexity/benefit
trade-off).

> «I did not know that anyone did tag-sequential accesses»
>
> — it's not
> tag-sequential (that's for pseudoassociative caches of 90-s).
> Just TLB + tags first and only proper way later.

I meant tag-sequential in the sense that tags are
checked before the data array is fully accessed
(like most L2 caches seem to be implemented).
I am not certain if "tag serial" is used for the
pseudo-associative sequential checking of individual
tags. I am not certain what the accepted terms
are for these cases.

> «Way prediction is another matter»
>
> — BTW, I wonder what do they mean
> by «way-predicted L1D»?…

Well, the R10000 used an MRU-based one-bit
per entry predictor for its 2-way set associative
off-chip L2 cache. (Only 8 Kib were provided--
on chip--, so for some larger versions of the
cache the MRU bit was shared among multiple sets.)
Of course, for L1 accesses, the latency of
checking an access-address based predictor (and
sending the prediction to the row decoder of the
data array) may be too great (for smallish caches?).
Some academic papers have proposed using
predictors indexed by the instruction address.

Unfortunately for the curious outsiders,
detailed information about predictors is less
commonly shared with outsiders (whether because
such is considered a trade secret or is not
considered worthwhile documenting).

Tacit

unread,

Jan 22, 2012, 3:45:31 PM1/22/12

to

Paul A. Clayton:
«On a BTB miss, a modest extra penalty for making a translation after
the branch address is calculated would be involved» — so we have to
read and decode the instruction itself, which is way down the
pipeline. Not so modest ;)

«R10000 used an MRU-based one-bit per entry predictor for its 2-way
set associative off-chip L2 cache» — when implemented with 4 ways,
it'll be too stupid. More accurate thing is needed.

«the latency of checking an access-address based predictor (and

sending the prediction to the row decoder of the data array) may be

too great» — so that's the problem. Save time and get a smart
prediction.

«detailed information about predictors is less commonly shared with
outsiders» — maybe there is a way to investigate it with some very
intricate synthetic test. I suppose, when prediction fails, read
latency is increasing by 1 clock (to 5). If we can find certain
«improper» pattern, it'll have 5 cl. latency for all reads, which can
be measured easily.

Joe keane

unread,

Jan 22, 2012, 6:40:33 PM1/22/12

to

In article <11457910.970.1327007257462.JavaMail.geo-discussion-forums@yqcz12>,

MitchAlsup <comp...@googlegroups.com> wrote:
>One of the BAD points of RISC instruction set design is that the
>compiler/assembler/programmer has to invent
>constants/displacements/immediates,... Many of these displacements need
>more address bits than the instruction carries naturally (16-bits
>typically, 13-bits SPARC, 12-bits 360+). The lack of enough displacement
>bits creates a lot of instructions that simply paste bits. Decoders are
>ALREADY good at pasting bits, arguably better than instruction sets.

On the RT, r0 always points to the function's "contant pool". So if
you need a big constant (e.g. full word), you just load it from there,
normal load/store. It doesn't make any attempt of a way to get bigger
immediates.

I think it's fine. You don't really miss the immediates.

Paul A. Clayton

unread,

Jan 22, 2012, 9:54:07 PM1/22/12

to

On Jan 22, 3:45 pm, Tacit <tacit.mu...@gmail.com> wrote:
> Paul A. Clayton:
> «On a BTB miss, a modest extra penalty for making a translation after
> the branch address is calculated would be involved»
>
> — so we have to
> read and decode the instruction itself, which is way down the
> pipeline. Not so modest ;)

The penalty I was referring to was only the lack of
a few bits to physically index the cache requiring
a TLB look-up. (If the TLB look-up was fast and
the Icache read sufficiently slow, the TLB might be
able to provide the last few bits of index in time
to avoid extra delay--though such might not be
practical.) Of course, any BTB miss will have the
additional latency you mention, but that is the
case whether the cache is physically indexed or
not.

> «R10000 used an MRU-based one-bit per entry predictor for its 2-way
> set associative off-chip L2 cache»
>
> — when implemented with 4 ways,
> it'll be too stupid. More accurate thing is needed.

Agree. (A couple of days ago, I presented here a
hybrid approach that might work.)

> «the latency of checking an access-address based predictor (and
> sending the prediction to the row decoder of the data array) may be
> too great»
>
> — so that's the problem. Save time and get a smart
> prediction.

That is just my guess. If one has enough time,
checking a simple hash of the virtual address
may be enough (a little like the Pentium4's
predictive way selection). Using a CAM-like
structure might reduce latency, but would have a
significant area and power penalty. (I think
Mitch Alsup recently posted here that such an
approach was rejected on one project because of
the excessive costs.)

> «detailed information about predictors is less commonly shared with
> outsiders»
>
> — maybe there is a way to investigate it with some very
> intricate synthetic test. I suppose, when prediction fails, read
> latency is increasing by 1 clock (to 5). If we can find certain
> «improper» pattern, it'll have 5 cl. latency for all reads, which can
> be measured easily.

I suspect reverse engineering a way predictor is
very difficult, especially if it uses an instruction
address component.

Paul A. Clayton

unread,

Jan 22, 2012, 10:26:48 PM1/22/12

to

On Jan 22, 6:40 pm, j...@panix.com (Joe keane) wrote:
[snip]

> On the RT, r0 always points to the function's "contant pool". So if
> you need a big constant (e.g. full word), you just load it from there,
> normal load/store. It doesn't make any attempt of a way to get bigger
> immediates.
>
> I think it's fine. You don't really miss the immediates.

I think immediates have some nice properties. A wider
instruction fetch tends to be easier to implement than
adding a load port (or a quarter of a load port :-)
and immediates take advantage of the prefetching and
spatial locality of the instruction stream. A load
(without load-operate instructions--and given that a
major purpose of large immediates is for load
instructions (did even VAX have had an instruction that
loaded an address like [R1 + [R2 + small_immediate]]?))
would also use up a temporary register. An immediate
also hides its load-to-use delay in the fetch latency,
though something like a Knapsack Cache could be
indexed early enough and quickly enough to hide latency
on a hit (the Knapsack Cache effectively always hit
because it was a selection of the global variables
that are pre-loaded into the cache).

Of course, large immediates also tend reduce the effective
decode bandwidth. 32-bit immediates are probably on
the positive side of the decode trade-off, but 64-bit
immediates might be worse (for decoding) than a
separate load instruction. Of course, such depends on
what the limitations on decoding are, and the performance
impact would depend on buffering (and OoO execution
resources).

Andy (Super) Glew

unread,

Jan 23, 2012, 1:17:33 PM1/23/12

to

Except in locore.

Nearly every OS, on machines that do not switch some registers or run on
a special interrupt stack in at least some modes, have some sort of
"locore" - an area of memory whose address is such that it can be
constructed via an immediate WITHOUT DESTROYING ANY REGISTERS, so that
you can save the registers that you would have to destroy in order to
access any region of memory further afield.

Heck, quite a few have locore even if they switch registers on an event.
It's always nice to have something that you can guarantee access to.

---

"Constant pools" - a frequently used constant in a constant pool
occupies DRAM, several places in the cache (full cache line, including
neighbours that you should probably arrange to access together), and the
register it is currently held in. And load instructions are necessary
to pull it in, costing power.

Constants embedded in the instruction stream occupy those bits in the
instruction stream, howsoever many times they are used. It is likely
that such large immediates can be moved to registers just as often as
constant pool memory locations can be moved to registers.

So, the difference between constant pool and constants in the
instruction stream are
(a) the difference between the decoder extracting the bits,
and the load instruction executing to get the bits
(b) the cost of the cache occupancy.

Oh, and there's an additional benefit to immediate constants embedded in
the instruction stream: they need never occupy scheduler bandwidth, RF
read port bandwidth, etc. So you would probably NOT want to optimize
them to registers as often as you need to optimize constant pool values,
because even if the code size may be bigger die to having replicas of
the constant, the power costs are lower.

Hmm...

---

GPUs have "constant register files", and/or allow memory locations in a
constant pool to be specified via a register number like field in an
instruction, that is added to a base register to get the address.

Typically there are lots of swizzles and transforms available to be
applied to constants, as well as "ordinary" GPU registers.

I suppose that there are some optimizations that you can create for a
nearly constant register file.

Definitely, this saves a few bits on the destination register, since you
don't write to constant registers (very often, and not in the
instruction set - you may load and store the constant registers en masse).

You wouldn't have to rename the constants, you could just serialize when
they are changed.

Constant registers aren't attached to a bypass network. However, if you
are doing register file port reduction, they may participate.

MitchAlsup

unread,

Jan 23, 2012, 1:02:08 PM1/23/12

to

On Sunday, January 22, 2012 5:40:33 PM UTC-6, Joe keane wrote:
> I think it's fine. You don't really miss the immediates.

So, you really want to clutter up the data-cache with the immediates that can be placed in the instruction cache?

Mitch

Tacit

unread,

Jan 23, 2012, 1:45:34 PM1/23/12

to

Paul A. Clayton:
«The penalty I was referring to was only the lack of a few bits to
physically index the cache requiring a TLB look-up» — so we need to
keep both PA and VA in BTB entries? That's too large.

«checking a simple hash of the virtual address may be enough (a little
like the Pentium4's predictive way selection)» — P4 and other CPU's
since are using «Vhints», but these are not counted as real prediction.

Tacit

unread,

Jan 23, 2012, 2:17:39 PM1/23/12

to

Paul A. Clayton:
«64-bit immediates might be worse (for decoding) than a separate load
instruction» — why. No current CPU is restricting imm64's. However,
pipeline lane bit-width is critical for high-frequency CPU's. Also,
mop-cache size is still in bits, not only in mops :) Both P4 and SB
use compaction methods to compress long constants (imm and ofs) in mop-
caches. Separate constant cache is also possible.

Tacit

unread,

Jan 23, 2012, 3:28:20 PM1/23/12

to

Andy (Super) Glew:
«GPUs have "constant register files", and/or allow memory locations in

a constant pool to be specified via a register number like field in an

instruction, that is added to a base register to get the address»

— good for them, but I still think for all-purpose CPU it'll be better
to have immediates. With 2 additions: IP-based const addressing and a
const-cache. IP+offset is possible in x86-64, but if someone (compiler
or softmaker) used it yet for constant lookup in code stream? I mean
this:

instr r1,imm64
<some more code, no more than 127 bytes>
instr r1,r2,[IP-off8] ; = address of imm64

We can save 1(5) bytes by replacing imm32(64) with its offset. And
it's the only way to load float imm's (with broadcasts — can be
vectors too). But to speed things up we need const-cache. I'll be in
the front-end and contain IP-relative constants. Decoder should
calcutate IP+off8 value and put it in the mop with «code-placed
constant» flag up. Const-cache fetches the data and replaces address
with it. Back-end will have no clue it wasn't a constant 1 clock
ago :) Keeping const-cache in sync is easy, since it's read-only.
Unlike SB's mop-cache, here's no need to evict constants for evicted
line from L1I (except if it was invalidated-as-modified). Is that
cool? :)

BTW, why x86 still don't have float and/or scalar-to-vector imm's?

Paul A. Clayton

unread,

Jan 23, 2012, 3:32:01 PM1/23/12

to

On Jan 23, 2:17 pm, Tacit <tacit.mu...@gmail.com> wrote:
> Paul A. Clayton:
> «64-bit immediates might be worse (for decoding) than a separate load
> instruction»
>
> — why. No current CPU is restricting imm64's.

I was referring to general ISA design concepts. A
64-bit immediate will take up the fetch width of
about two instructions whereas a load instruction
would only be one extra instruction.

> However,
> pipeline lane bit-width is critical for high-frequency CPU's. Also,
> mop-cache size is still in bits, not only in mops :) Both P4 and SB
> use compaction methods to compress long constants (imm and ofs) in mop-
> caches. Separate constant cache is also possible.

Yes, optimizations are possible for both immediates
and constant pools. At some level (even if only Icache
fill), large immediates will cost more instruction
bandwidth than load instructions. (Not that
transferring--and likely multiplying--this cost to data
accesses necessarily makes sense!)

Joe keane

unread,

Jan 23, 2012, 3:37:04 PM1/23/12

to

In article <11457910.970.1327007257462.JavaMail.geo-discussion-forums@yqcz12>,
MitchAlsup <comp...@googlegroups.com> wrote:

>Many of these displacements need
>more address bits than the instruction carries naturally (16-bits
>typically, 13-bits SPARC, 12-bits 360+).

Thinking about it, the 6502 was like this too.

You can't do this:

lda $####

And there's no this:

lda (XY)

[even though you think it should be, never really figured that out]

But it did have zero-page stuff out the yin-yang.

Although it's pretty crude machine, it is more effective to think of it
as having 256 registers. Wow, that's a lot of registers. You can do
stuff like global register allocation...

Paul A. Clayton

unread,

Jan 23, 2012, 3:22:26 PM1/23/12

to

On Jan 23, 1:45 pm, Tacit <tacit.mu...@gmail.com> wrote:
> Paul A. Clayton:
> «The penalty I was referring to was only the lack of a few bits to
> physically index the cache requiring a TLB look-up»
>
>— so we need to
> keep both PA and VA in BTB entries? That's too large.

Only a few extra PA bits (enough to physically index
the cache).

> «checking a simple hash of the virtual address may be enough (a little
> like the Pentium4's predictive way selection)»
>
>— P4 and other CPU's
> since are using «Vhints», but these are not counted as real prediction.

That is why I wrote "a little like"; the P4's Dcache did
use prediction, but only to allow early selection (and I
think forwarding) of the value with later correction.
A similar mechanism might be used to make a prediction
in time to select which way to read without increasing
load latency too much.

Thomas Womack

unread,

Jan 23, 2012, 3:54:32 PM1/23/12

to

In article <jfkgdg$fbu$1...@reader1.panix.com>, Joe keane <j...@panix.com> wrote:
>In article <11457910.970.1327007257462.JavaMail.geo-discussion-forums@yqcz12>,
>MitchAlsup <comp...@googlegroups.com> wrote:
>>Many of these displacements need
>>more address bits than the instruction carries naturally (16-bits
>>typically, 13-bits SPARC, 12-bits 360+).
>
>Thinking about it, the 6502 was like this too.
>
>You can't do this:
>
> lda $####

Unless I'm missing something, that's opcode AD load-absolute; the
eight addressing modes

LDA #&37 a=0x37
LDA &37 a=mem[0x37]
LDA &37,x a=mem[(0x37+x)&0xff]
LDA &1234 a=mem[0x1234]
LDA &1234,x a=mem[0x1234+x]
LDA &1234,y a=mem[0x1234+y]
LDA (&70,x) a=mem[256*mem[(0x71+x)&0xff] + mem[(0x70+x)&0xff]]
LDA (&70),y a=mem[256*mem[0x71] + mem[0x70] + y]

>And there's no this:
>
> lda (XY)
>
>[even though you think it should be, never really figured that out]

That would use three registers in a single instruction.

Did anyone manage to write a multi-tasking OS on a 6502? You might
well need to use some sort of paging support to get enough memory in,
but that sounds as if you'd only need two memory-mapped registers for
'implicit bits 21-13 of memory address to use when accessing an
address 0x4000-0x7fff' and 'implicit bits 21-13 of memory address to
use when accessing 0x8000-0xbfff'

I wonder how much of zero-page you'd want to save on task switch.

Tom

Tim McCaffrey

unread,

Jan 23, 2012, 4:29:13 PM1/23/12

to

In article <jfkgdg$fbu$1...@reader1.panix.com>, j...@panix.com says...

The hardware stack is only 256 bytes, so you better do global allocation,
because you don't have any temporary space to store them....

(and the OS wants to use that zero-page as well, and the since the I/O is
memory mapped it wants to be located there as well....)

What made the 6502 popular was price ($10 when the 6800 and 8080 were going
for $500). There is a lesson there...

- Tim

Robert Wessel

unread,

Jan 23, 2012, 5:02:01 PM1/23/12

to

On Mon, 23 Jan 2012 21:29:13 +0000 (UTC), timca...@aol.com (Tim
McCaffrey) wrote:

>In article <jfkgdg$fbu$1...@reader1.panix.com>, j...@panix.com says...
>>
>>In article
><11457910.970.1327007257462.JavaMail.geo-discussion-forums@yqcz12>,
>>MitchAlsup <comp...@googlegroups.com> wrote:
>>>Many of these displacements need
>>>more address bits than the instruction carries naturally (16-bits
>>>typically, 13-bits SPARC, 12-bits 360+).
>>
>>Thinking about it, the 6502 was like this too.
>>
>>You can't do this:
>>
>> lda $####
>>
>>And there's no this:
>>
>> lda (XY)
>>
>>[even though you think it should be, never really figured that out]
>>
>>But it did have zero-page stuff out the yin-yang.
>>
>>Although it's pretty crude machine, it is more effective to think of it
>>as having 256 registers. Wow, that's a lot of registers. You can do
>>stuff like global register allocation...
>
>The hardware stack is only 256 bytes, so you better do global allocation,
>because you don't have any temporary space to store them....
>
>(and the OS wants to use that zero-page as well, and the since the I/O is
>memory mapped it wants to be located there as well....)

Hmmm... Most of the bigger systems (IOW "PCs" of the era), didn't put
any I/O on zero-page. As one example, the Apple II put its I/O in the
$C000-$CFFF (not counting the display buffers in lower addresses, or
the ROMs/bank switched memory in the area above that). IIRC, the
Commodore PET put its I/O in the $Exxx block. I did see it on at
least one embedded application.

As for the hardware stack, since accessing it outside the
push/pop/call/return instructions was somewhat clumsy, and the size
was so limited, most systems that needed a stack for C- or Pascal-like
activation records used a separate area pointed to be a zero-page
word. If you wanted to set up a few tasks, and split the stack into
smaller chunks (one for each thread), you were *really* limited, and
using the stack for anything other than return addresses and a modest
number of temporary register saves was pretty much impossible.

MitchAlsup

unread,

Jan 23, 2012, 5:35:10 PM1/23/12

to

On Monday, January 23, 2012 1:17:39 PM UTC-6, Tacit wrote:
> No current CPU is restricting imm64's.

No current CPU HAS imm64s. That is: a 64 bit immediate that comes from the instruction byte stream and is used in the attached instruction. There are x86s with disp64s that are restricted to MOV instructions (no base register or indexing) {A4 and A5}.

And what do you do when the constant pool overflows the 16-bit immediate one needs to load the 'immediate'? {Blow a<nother> register to pooint at the <current> constant pool of this compilation unit?}

Mitch

Tim McCaffrey

unread,

Jan 23, 2012, 6:40:59 PM1/23/12

to

In article
<30590264.1084.1327358110422.JavaMail.geo-discussion-forums@yqnv21>,
Mitch...@aol.com says...

You can do a load immediate on x64, and disp64 only works for RAX.

One of the big shortcomings of the x64 ISA in my opinion (after writing a
code generator for it).

- Tim

Tacit

unread,

Jan 23, 2012, 7:06:23 PM1/23/12

to

Paul A. Clayton:
«Only a few extra PA bits (enough to physically index the cache)» —
sorry, but I still don't get it. Suppose we have 9 PA bits [3…11] in
BTB to select a set from 2-way 64K L1I with 64B lines (i.e. AMD's
usual L1I). Other PA bits check with 2 tags, no TLB involved, power
+time saved. Now, BTB misses: PA bits [3…9] can be calculated (if it's
not indirect), but we need bits [9…11], having full VA. TLB lookup is
inevitable, isn't it?

Tacit

unread,

Jan 23, 2012, 7:14:54 PM1/23/12

to

MitchAlsup:
«No current CPU HAS imm64s» — ahem. Would you rather read this —
http://flatassembler.net/docs.php?article=manual#2.1.19 ? (Below the
table.)

«what do you do when the constant pool overflows the 16-bit immediate
one needs to load the 'immediate'?» — obviously, load the const from
L1I :)

Paul A. Clayton

unread,

Jan 23, 2012, 7:41:31 PM1/23/12

to

On Jan 23, 5:35 pm, MitchAlsup <MitchAl...@aol.com> wrote:
[snip]

> And what do you do when the constant pool overflows the
> 16-bit immediate one needs to load the 'immediate'?
> {Blow a<nother> register to pooint at the <current>
> constant pool of this compilation unit?}

Insane ISA designer speaking: Use the instruction
address as a base address, mask off about 20 bits,
and use a shifted 16-bit immediate to load a 32-bit
or 64-bit constant. More insanely, an 8-bit
immediate form might only mask off 12 bits. This
might provide access to enough constants. (Using
a negative offset--with 0 being max_negative--
would facilitate using such for mutable global
values by allowing them to be in a different
page [at least on multiple-address space OSes;
such could be extended to a SASOS by providing
the equivalent of an ASID which might simply
replace the most significant bits of the
instruction address].)

Paul A. Clayton

unread,

Jan 23, 2012, 8:07:19 PM1/23/12

to

Yes, there would be a TLB lookup, but indexing would only
need the result of that lookup if there was a BTB miss
(which is the less common case). I was assuming that
the TLB would be accessed to check tags. The use of the
few extra PA bits was only meant to facilitate indexing
with the physical address (not avoid TLB lookups).

(Aside: Using a pointer to the page number could
compress tags, BTB entries, and TLB entries--as suggested
by André Seznec [and others].)

Tacit

unread,

Jan 23, 2012, 10:07:51 PM1/23/12

to

Paul A. Clayton:
«Use the instruction address as a base address, mask off about 20

bits, and use a shifted 16-bit immediate to load a 32-bit or 64-bit

constant» — nope, code constant isn't aligned, you need lower bits
too :) Otherwise — align host instruction with long nop.

Tacit

unread,

Jan 23, 2012, 10:17:03 PM1/23/12

to

Paul A. Clayton:
«Using a pointer to the page number could compress tags, BTB entries,
and TLB entries» — a pointer used and keeped where?

Andy (Super) Glew

unread,

Jan 23, 2012, 11:10:50 PM1/23/12

to

On 1/23/2012 12:28 PM, Tacit wrote:
> Andy (Super) Glew:
> «GPUs have "constant register files", and/or allow memory locations in
> a constant pool to be specified via a register number like field in an
> instruction, that is added to a base register to get the address»
>
> — good for them, but I still think for all-purpose CPU it'll be better
> to have immediates.

I agree.

By the way, one of my first computer architecture studies, back in
undergrad, was on how many immediate bits to allocate. This was back
when people recommended eliminating immediates completely.

With 2 additions: IP-based const addressing and a
> const-cache. IP+offset is possible in x86-64, but if someone (compiler
> or softmaker) used it yet for constant lookup in code stream? I mean
> this:
>
> instr r1,imm64
> <some more code, no more than 127 bytes>
> instr r1,r2,[IP-off8] ; = address of imm64
>
> We can save 1(5) bytes by replacing imm32(64) with its offset. And
> it's the only way to load float imm's (with broadcasts — can be
> vectors too). But to speed things up we need const-cache. I'll be in
> the front-end and contain IP-relative constants. Decoder should
> calcutate IP+off8 value and put it in the mop with «code-placed
> constant» flag up. Const-cache fetches the data and replaces address
> with it. Back-end will have no clue it wasn't a constant 1 clock
> ago :) Keeping const-cache in sync is easy, since it's read-only.
> Unlike SB's mop-cache, here's no need to evict constants for evicted
> line from L1I (except if it was invalidated-as-modified). Is that
> cool? :)
>
> BTW, why x86 still don't have float and/or scalar-to-vector imm's?

x87 has a variety of float accessible from a "constant ROM".

FLDZ 0.0
FLD1 1.0
FLDL2E log2(e)
FLDLG2 log10(2)
FLDLN2 ln(2)
FLDPI pi

F2XM1

Not in post x87

Tacit

unread,

Jan 24, 2012, 12:42:30 AM1/24/12

to

Andy (Super) Glew:
«x87 has a variety of float accessible from a "constant ROM"» — these
are just 6 hardwired constants, not all potentially needed. Besides,
x87 is long obsolete, is anyone use it now? Even transcendent
functions (trigs, logs) are faster with SSE2+ (by partial sum of
series with LUTs).

«By the way, one of my first computer architecture studies, back in
undergrad, was on how many immediate bits to allocate» — oh, BTW, do
you know if there's any open and fresh research with various code
statistics for x86? I mean stuff like: avg. instruction length, # of
instructions between jumps (all and certain types); avg. immediate,
displacements, loads and stores per instruction; partial regiser
usage, address modes usage, etc. — thousands of numbers. I'm sure big
companies have done (and doing) such research with large-scale
emulations for a lot of popular code, but none of them are open for
public.

George Neuner

unread,

Jan 24, 2012, 2:30:20 AM1/24/12

to

On 23 Jan 2012 20:54:32 +0000 (GMT), Thomas Womack

<two...@chiark.greenend.org.uk> wrote:

>Did anyone manage to write a multi-tasking OS on a 6502?

Simple task switchers on top of some DOS, absolutely. However, I'm
not aware of anything I would have called an "multitasking OS" on the
6502.

There existed at least 2 working prototypes of a 65816 multitasking
system on the Apple //gs - both layered over the existing Apple GS/OS
system. One was a Macintosh-like MultiFinder, the other (which I
participated in) was a more Unix-like multitasking shell.

The '816 extended the basic 6502 register set to 16-bit, added 8-bit
bank registers to the program counter, accumulator and index registers
allowing 16MB to be addressed in 64KB segments, added a 16-bit base
register for the direct (zero) page allowing it to be placed anywhere
within the first 64KB, and the ISA somewhat rounded out the asymmetric
6502 addressing modes.
See: http://www.65xx.com/wdc/documentation/w65c816s.pdf

>You might
>well need to use some sort of paging support to get enough memory in,

The main problem for multitasking was the 16-bit stack, which was
constrained to memory bank 0 and under GS/OS was further limited to
about 32KB. The '816 had block move instructions that could have been
used to swap out the stack and direct pages, but the project I worked
on (ab)used the Ensonic voice chip in the //gs.

The Ensonic chip in the //gs had a dedicated wave RAM and a couple of
DMA channels. We stole one DMA channel and 32KB of the RAM to buffer
stack data during a context switch. We used DMA to copy the current
stack to the Ensonic buffer while simultaneously bringing in the new
stack using a CPU block move. Then while the new task proceeded, the
buffered stack was DMA'd to its final save location in memory.

Using the DMA trick we were able to do a task switch in about 15ms on
the stock 2.8 MHz hardware. Because we leveraged GSOS and some of the
//gs Toolbox services, our prototype system was limited to running 14
simultaneous tasks (and each task to 32KB of stack) ... but it worked.
There was preliminary plan in the works to support multitasking //gs
Toolbox (graphical) apps - saving internal Toolbox state - and to
allow creating more tasks, but interest in developing for the //gs
faded before anything more came of it.

Oh well. It was fun at the time 8-)

>I wonder how much of zero-page you'd want to save on task switch.

Most likely all of it. Typical 6502 programs made extensive use of
zero page pseudo-registers.

>Tom

George

Paul A. Clayton

unread,

Jan 24, 2012, 7:24:07 AM1/24/12

to

I guess I was not clear. I meant that the constant
pool would be located at a (e.g.) 1 MiB aligned
address based on the instruction address. Each
MiB of code could have 64 Ki 32-bit constants and
32 Ki 64-bit constants (or just 64 Ki 64-bit
constants). I seem to recall that ARM's AArch64
has a similar operation which (IIRC) loads the
PC masking off a large number of bits, so the
idea might not be crazy (or I may just be
misremembering).

Paul A. Clayton

unread,

Jan 24, 2012, 7:31:06 AM1/24/12

to

In André Seznec's "Don't use the page number, but a pointer
to it" (1996), a separate table of physical page numbers
was proposed. The pointer would simply index the table.
For virtual addresses, using a TLB as the table was
presented as a possibility.

Paul A. Clayton

unread,

Jan 24, 2012, 8:46:03 AM1/24/12

to

On Jan 24, 7:24 am, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
[snip]

> address based on the instruction address. Each
> MiB of code could have 64 Ki 32-bit constants and
> 32 Ki 64-bit constants (or just 64 Ki 64-bit
> constants). I seem to recall that ARM's AArch64
> has a similar operation which (IIRC) loads the
> PC masking off a large number of bits, so the
> idea might not be crazy (or I may just be
> misremembering).

Well, my memory was not exceedingly defective. The
ARM AArch64 instruction (described on page 35,
section 5.3.4 of the "ARMv8 Instruction Set Overview"
[PRD03-GENC-010197 15.0]) takes the PC, adds a
21-bit sign-extended immediate that has been shifted
left 12 bits, and masks off the low 12 bits of the
result.

George Neuner

unread,

Jan 24, 2012, 10:37:08 AM1/24/12

to

On 23 Jan 2012 20:54:32 +0000 (GMT), Thomas Womack
<two...@chiark.greenend.org.uk> wrote:

>Did anyone manage to write a multi-tasking OS on a 6502? You might
>well need to use some sort of paging support to get enough memory in,

Thinking about it a little more, if I absolutely *had* to do a tasking
6502 application now, I would use the 65802, which internally is an
'816 but externally is '02 pin compatible. The '802 can be run in
8-bit mode while still allowing the direct page and stack to be
anywhere within the 64KB address space.

>Tom

George

George Neuner

unread,

Jan 24, 2012, 10:48:36 AM1/24/12

to

On Mon, 23 Jan 2012 16:02:01 -0600, Robert Wessel
<robert...@yahoo.com> wrote:

>As for the [6502] hardware stack, since accessing it outside the

>push/pop/call/return instructions was somewhat clumsy, and the size
>was so limited, most systems that needed a stack for C- or Pascal-like
>activation records used a separate area pointed to be a zero-page
>word.

That general clumsiness carried over to the 65802/816 also, but these
chips had a 16-bit stack register and the 256-byte direct (zero) page
was relocatable via a 16-bit base register. The convention on these
chips was to abuse the direct page register as a frame pointer.

George

vince

unread,

Jan 24, 2012, 12:46:52 PM1/24/12

to

On Jan 24, 2:30 am, George Neuner <gneun...@comcast.net> wrote:
> On 23 Jan 2012 20:54:32 +0000 (GMT), Thomas Womack
>

> <twom...@chiark.greenend.org.uk> wrote:
> >Did anyone manage to write a multi-tasking OS on a 6502?
>
> Simple task switchers on top of some DOS, absolutely. However, I'm
> not aware of anything I would have called an "multitasking OS" on the
> 6502.

Try Lunix ( http://lng.sourceforge.net/ ).

I always wanted to do an Apple II port, but pre-IIgs there really
isn't a good
programmable interrupt source unless you have a mouse card installed
or
else solder together a custom board.

Vince

Joe keane

unread,

Jan 24, 2012, 1:39:27 PM1/24/12

to

In article <a6e*cj...@news.chiark.greenend.org.uk>,
Thomas Womack <two...@chiark.greenend.org.uk> wrote:
>Unless

Sorry, i wrote the first thing to come to my head [i knew there were two
modes like that but i couldn't remember which used X and which used Y]
and said 'of course we have to fix that' which at some point i forgot.

>LDA (&70,x) a=mem[256*mem[(0x71+x)&0xff] + mem[(0x70+x)&0xff]]
>LDA (&70),y a=mem[256*mem[0x71] + mem[0x70] + y]

I mean those two, there's no two-byte versions.

Normally you would say, so what, just use two instructions. But how?
So this zero-page stuff is not a useful hack, it's almost mandatory.

I mentioned the 6819 a while back. It's cool, kind of a fusion of 6502
and 8086 but with all the stupid parts better.

George Neuner

unread,

Jan 24, 2012, 2:49:57 PM1/24/12

to

Speaking without looking

On Tue, 24 Jan 2012 10:37:08 -0500, George Neuner
<gneu...@comcast.net> wrote:

>if I absolutely *had* to do a tasking 6502 application now,

>I would use the 65802 ...

It was brought to my attention that the '802 no longer is in
production. However, WDC does offer '816 cores so it is possible to
implement a custom version in ASIC or FPGA.

George

Joe keane

unread,

Jan 24, 2012, 5:28:07 PM1/24/12

to

In article <4F1DA43D...@SPAM.comp-arch.net>,
Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>Nearly every OS, on machines that do not switch some registers or run on
>a special interrupt stack in at least some modes, have some sort of
>"locore" - an area of memory whose address is such that it can be
>constructed via an immediate WITHOUT DESTROYING ANY REGISTERS, so that
>you can save the registers that you would have to destroy in order to
>access any region of memory further afield.

Of course, this is what the machine gives you. If this is an
architectural assumption, rather than some guy's convention, that r0 is
as important as the PC, one might assume that it saves r0 as well.

e.g.

save r0 to r99
save PC to r98
move _function_pointer_i_want to r0
nop
move (r0) to PC
nop

However one may assume that the machine has *some* way to make an
interrupt handler, otherwise no one would use it.

I remember, in context of Z80 successor, there's some sort of 'swap to
alternate registers' which is precisely intended for this.

In fact, i think SPARC has some 'alternate global'.

Andy (Super) Glew

unread,

Jan 24, 2012, 10:59:52 PM1/24/12

to

This is what *some* machines give you. Most sane machines, one might hope.

[Woah, I feel a comp-arch.net article coming on. I have long wanted to
address the issues of interrupt architecture, but I have long felt that
most treatments miss something fundamental. And I don't like just
listing one example after the other, because so many people stop at the
first example, which is one reason why there are so many messed up
interruppt architectures. It ss a big enough topic that I never feel
that I get a least postable unit. But the journey of a thousand miles
begins with a single step, and this might be it.
Wiki in
http://semipublic.comp-arch.net/wiki/Interrupt_Delivery_and_State_Switching.
I expect, hope that I can evolve it into something more coherent. Now,
let me try to start off with a proper introduction.]

= Interrupt State Manipulation =

One of the biggest issues in an [[Interrupt Architecture]] is how to
manipulate state. How to save the state of the interrupted code,
thread, or process (the interruptee), and how to load the state of the
interrupt handling code (which I will call the interrupt handler, rather
than the interruptor, since in most cases the interruptor is the I/O
device that causes an interrupt, or the logic that detects an
exception). Where save the state. Who can access what parts of the
state. How much state is loaded. What other events or interrupts can
interrupt the interrupt handler.

Worst or simplest: machines that take [[interrupts on the current
stack]] that the user code, the interruptee, is running on. Bad, bad,
bad. [OK, I usually trty to tine these editorial comments down for the
wiki.] Bad in so many ways: (a) the user might have deliberately set the
stack to the edge of an invalid page, so the interrupt will take an
unacceptable page fault. (b) the fact that an interrupt has occurred,
the fact that the user state is saved on the user stack, might be
visible to other threads in the same user process. I.e. it is not
properly virtualized.

Basically, you can't trust the user stack. So if you do an interrupt
without switching stacks, you must save to [[locore]].

It is a bit better if you save interruptee PC and other state in some
special registers, and switch in a new OS PC and stack pointer from some
other special registers. Just a bit better - because what happens if
the OS interrupt handler is itself interrupted. Some RISCs said that
interrupts would be blocked until the OS first level interrupt handler
had done its work... but that is shortsighted. NMIs (Non Maskable
Interrupts) happen. As, for that matter, do interrupts that switch
virtual machines.

Variants:
* [[interrupt register save/load]]

save_PC_reg := PC; PC := load_PC_reg
save_reg := GPR; GPR := load_reg

* [[interrupt register mode switch]]

PC_lvalue() { if mode0 then ordinary_PC else alternate_PC; }

It is a little bit better if the [[interrupt stack pointer]] is not just
a special register that is loaded into the stack pointer after the user
stack pointer has been saved by the interrupt dispatch hardware or
microcode - but if it is a real stack pointer, so that if an interrupt
occurs while the interrupt stack pointer is active, then it is incremented.

I.e. in the N-1st case there is an [[interrupt stack base register]]
that is loaded. Whereas in the Nth case there is a true [[interrupt
stack pointer]], that is loaded if not active, or incremented if active.

But this doesn't work for virtual machine interrupts, because the guest
(the OS on which the virtual machine is running) is not supposed to see
the state get interrupted. [[virtual machine interruptee/interrupt
handler state isolation requirement]].

Observations, attempting to generalize:

a) interruptee state needs to get saved in a place that is not
necessarily visible to the interruptee.

b) state switch must include the PC.

c) state switch must include whatever control register and mode bits
should apply to the interrupt handling code.

d) if state switch does not include at least one register which can be
used to create addresses, then locore must be used.

e) the interrupt handkler must somewhow know what interrupt number,
etc., it is handling - if onoly so that it can know what dedicated
datastructures in locore it can use while doing the rest of the state
switch in software.

Here "know" means "the interrupt handler's behavior must be adjustable
so that interrupts that do not block each other can access disjoint
state save and reload areas."

e.1) interrupt vectors can accomplish this. Not necessarily a vector
per interrupt source, but per class of interrupts that can mutually
interrupt each other. E.g. at least ordinary I/O interrupts, and NMIs;
probably also priority classes.

e.2) a register that contains an interrupt number can accomplish this.
But not necessarily a control register unless you can guarantee
blockout. Loading an interrupt ID into an ordinary register can work,
but only if that register is itself saved or shadowed.

(By the way, it is amazing how many times I have seen, or written, code
that starts off in separate interrupt vectors, does some state
manipulation in locore, loads an interrupt ID number into an ordinary
register, and then transfers back to a common interrupt handler. I.e.
how often e.1 is transformed into e.2.)

f) saving or shadowing registers to special registers works - but only
if there is one set of such registers for every group of interrupts that
can be interrupted. Or if you believe that large classes of interrupts
can be briefly blocked.

Which amounts to the same thing.

= [[Interrupt clasees that can interrupt each other]] =

The key here is "what other interrupts or events can interrupt an
interrupt handler, at what times"? Or, put another way "What parts of
an interrupt handler are atomic, wrt which other interrupts or events."

We may allow a low priority interrupt handler to block a high priority
ordinary interrupt, but only briefly. I.e. we may start the interrupt
handler "all blocked", for some definition of all that includes multiple
priority classes.

But we probably should not block NMI. (Although, yes, I know that many
systems have non-maskable-interrupt masking bits. Even more have
mechanisms that are supposed to block the sources of NMIs.)

We probably cannot block machine check interrupts. (Or, rather, soft
machine checks can be blocked or delayed, but hard machine checks mean
that the interruptee is dead.)

We probably cannot or should not block virtualization events (VMEXITs)
in guest interrupt handler code.

And there may be more "interrupt clasees that cannot be blocked". Secure
monitors that are outside the virtual machine manager - DRM security
modes, etc. Hardware virtual machine layers that are used to fix bugs
under the hypervisor. ETc.

Each such "interrupt class that cannot be blocked" needs its place to
save state. If special registers, multiple.

Or, you can avoid special registers by saving to dedicated areas in
external memory. But saving to memory usually requires microcode. Such
microcode, by the way, really is software running in a mode that can
block all interrupts. It's just software provided by the HW vendor. Or
whatever vendor runs the software that underlies all other software.
(Hint: not the OS, neither Microsoft or Linux. Probably not a VMM
vendor like VMare Probably not Wind River. Hopefully not the NSA or the
MPAA. Q: what vendor(s) must be trusted by all other vendors in the
system? A: the hardware + .. other vendors to be named later.

(I once was talking to a guy about virtual machine layers. He had the
mindset that each layer required a new register set. Memory simplifies.)

=

Let me close by mentioning the Gould SEL interrupt system. Which Mitch
may have had some involvement with.

Each interrupt class had a dedicated area in memory. In locore, for
some of the usual reasons.

Some interruptee registers were saved to this area. Some handler
registers were loaded from this area. The area was big enough for
handler software to manipulate more registers, before it did its job.

Sometimes the handler transferred this state back to the interruptee's
OS stack. While it might have been nice not to have to bounce data like
this, I see this as an optimization. I.e. you should always provide
the simpler scheme as a fallback.

I don't recall the Gould hardware doing this, but in general the amount
of state that gets saved/loaded may need to be [[parameterizable |
parameterizable interrupt state save/load]]. In many cases we may want
to minimize it. But in other cases we may need to maximize, e.g. in
some situations the interrupt handler is not allowed to see any of the
interruptee state. Which either means that the interruptee state must
all be saved, and then zeroed, before being delivered to the interrupt
handler, or that somehow it must be locked, so that attempts to the
interrupt handler to access it can be intercepted, e.g. trapped.
[[Delayed interrupt state save/load]] (which introduces complexities
such as remembering where, and how, and who.)

Some variant of this is the only interrupt and event delivery scheme
that I am familiar with that is a [[scalable interrupt architecture]].
That does not have to be completely redesigned when you add a new class
of interrupts. Add such a new class, and you just give it a register
pointing to the dedicated interruptee state save and interrupt handler
state load areas. And maybe add a few extra registers to the list of
what gets saved and loaded. You don't even have to have a register
pointing to the areas: you can have a single register pointing to a
vector table.

Problem is, saving state to memory is microcode. And/or PALcode. If you
can't live with that, but must switch registers in hardware, thedn be
prepared for regular arguments about whether a new interruot class can
be avoided, and trying to argue that certain registers do not need to be
shadowed. (And then getting it wrong, and having to fix it.)

The gotcha is that on modern systems, not all interrupt handlers are
allowed to see the interruptee state. I.e. sometimes the interruptee
save area and the interrupt handler load area must be in disjoint memory
spaces, protected from each other. This is not a big thing to solve.
You have just got to live with it. (Although many people try to fight
against it. They say it is just an evil artifact of DRM and the MPAA,
but, although DRM and the MPAA are the biggest reasons, there are other
reasons to want to isolate interruptee and interrupt handler from each
other.

= [[Interrupt handling by thread switching]] =

Finally, interrupt handling by switching to a different thread. Doesn't
really change much. Amounts to the same thing.

If the interruptee and all possible interrupt handlers that can be
running at the same time can all fit into registers - separate, isolated
registers - great. That is just dedicated registers for each interrupt
class.

And if the thread state(s) cannot fit into registers, but must be
spilled to memory, then it is equivalent to what I said at in the
previous section.

[a] there is an interruptee/thread state save area (or thread area) that
[a.1] may need to be isolated from the interruptee/thread
[a.2] may need to be isolated from the interrupt handler.

[b] there is an interrupt handler/thread state context area. At least
one for every class of events that can be simultaneously interrupting
each other.

The nice thing about thinking about things as interrupt threads is that
it tends to drive you to conclusions about the state that are more
scalable. You are less likely to try to take shortcuts that will prove
problematic. Threads/processes may have entire register files...
Threads/processes may be in different protection domains.

= At least three privilege domains =

Finally^2, note that there are at least three privilege domains here.
(I originally said "levels", but corrected myseklf to say "domains",
since a ranking from high to low is not necedssarily implied.)

1) the interruptee

2) the interrupt handler

3) the code that

3.1) saves the interruptee state to a save area that is accessible to
neither 1) nor 2).

3.2) loads the interrupt handler state.

I think that you can get away with these three levels. Level 3) may be
hardware, or microcode, or some universally trusted software layer. (Hah!)

But you can also split 3.1) and 3.2) into separate privilege levels.

And more.

= [[Interrupt return]] =

Similar issues arise when returning from an interrupt.
The simple cases are simple.

The harder cases are harder.

E.g. how does a

= [[Lessons from History]] =

I think that some instrucrtion set architecturesd, in 2011 at the time
of writing quite old, attempt to address these issues.

E.g. "Interrupt to nested task"

etc.

I do not think that all are completely general. And, I must admit, I am
trying to find a general solution - a general framework in which to
describe the issues, if not a general interrupt achitecture.

nm...@cam.ac.uk

unread,

Jan 25, 2012, 4:03:41 AM1/25/12

to

In article <4F1F7E38...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>
>Worst or simplest: machines that take [[interrupts on the current
>stack]] that the user code, the interruptee, is running on. Bad, bad,
>bad. [OK, I usually trty to tine these editorial comments down for the
>wiki.] Bad in so many ways: (a) the user might have deliberately set the
>stack to the edge of an invalid page, so the interrupt will take an
>unacceptable page fault. (b) the fact that an interrupt has occurred,
>the fact that the user state is saved on the user stack, might be
>visible to other threads in the same user process. I.e. it is not
>properly virtualized.

Not at all. It is the ideal way to handle interrupts that should
be kept in the context of the executing thread - e.g. floating-point
wittering. If the stack is in a mess, then it just changes one user
error for another.

I agree that it is catastrophic for unrelated interrupts.

>= [[Interrupt clasees that can interrupt each other]] =
>
>The key here is "what other interrupts or events can interrupt an
>interrupt handler, at what times"? Or, put another way "What parts of
>an interrupt handler are atomic, wrt which other interrupts or events."

Right. Any properly engineered scheme works; any hacked about one
doesn't. This becomes absolutely critical when, as all reasonable
systems should, you allow user code to be able to handle its own
interrupts.

>= [[Interrupt handling by thread switching]] =
>
>Finally, interrupt handling by switching to a different thread. Doesn't
>really change much. Amounts to the same thing.
>
>If the interruptee and all possible interrupt handlers that can be
>running at the same time can all fit into registers - separate, isolated
>registers - great. That is just dedicated registers for each interrupt
>class.
>
>And if the thread state(s) cannot fit into registers, but must be
>spilled to memory, then it is equivalent to what I said at in the
>previous section.

Not at all. Sorry. This is a ghastly idea for precisely the same
classes of interrupt that can be handled in the interrupted thread
(with its stack). Handling those in another thread causes just so
MANY unnecessary problems!

But for unrelated interrupts, it's better.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Jan 25, 2012, 10:51:24 AM1/25/12

to

On Jan 24, 10:59 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]

> (b) the fact that an interrupt has occurred,
> the fact that the user state is saved on the user stack, might be
> visible to other threads in the same user process. I.e. it is not
> properly virtualized.

Do some garbage collectors require visibility of register
contents that might contain addresses?

I admit that I do not understand the great benefit of
true (invisible except for performance) virtualization.
Allowing a guest to know that it is a guest does not
seem that problematic for most uses. (For some
security-oriented uses, secrecy might be useful.)
Obviously a more general mechanism is preferred if
it has no/minimal additional costs, but sometimes it
seems that excessive effort is expended to get the
last 1%. (Of course, I also undervalue compatibility,
so my views on this are both ignorant and distorted.)

MitchAlsup

unread,

Jan 25, 2012, 1:23:02 PM1/25/12

to an...@spam.comp-arch.net

On Tuesday, January 24, 2012 9:59:52 PM UTC-6, Andy (Super) Glew wrote:
> I don't recall the Gould hardware doing this, but in general the amount
> of state that gets saved/loaded may need to be [[parameterizable |
> parameterizable interrupt state save/load]].

Background, the S.E.L. (A.K.A. Gould) machines were 360-like minicomputers. As such most of their function was controlled with a Program Status Doubleword.

I was the architect of the task-to-task interrupt architecture of the Gould 32/87. One program could do an SVC, the SVC would point to an interrupt vector DoubleWord. One word contatined the program status and the memory mapping tables, the other word contained a save/restore table address. HW would save the registers of teh SVC-ing task to its current save area, load the PSDW of the new task and restore the registers from that task's save area. Presto, a complete context switch without an excursion through the OS. Effective context switch time was one Store-multiple instruction plus one Load-multiple instruction, plus one load doubleword, plus some cycles to resynch the memory management. Real Time people loved this, Unix not so much.

----------------------------------------------------------

It seems to me that the interrupt/exception/syscall architectures of modern machines is not very good, and not supported very well with instruction to offload much of the work/effort. Unfortunately the one machine that sort-of got it right got so many other things wrong--SPARC. The light weight context switch to routines to flush and fill register windows is more what we are looking for, yet we end up with things that smell a lot more like FAR CALL through a Call-gate that changes many more things than necessary.

So, it seems to me that what we want is a save area where CPU registers can be dumped that is associated with the current context but not necessarily accessible by the current context, a save area where new registers can be picked up but not otherwised accessible by the new context, and that the amount of registers being saved and restored can be minimized but always remains sufficient and the whole cabodle remains 'safe' in priviledge sense, protection sense, and in a secure sense. This necessarily drags the memory management registers through the process.

Given that many entrances to the OS are serviced immediately and then return with little likely hood of needing to schedule a different task, one can apply lazy 'evaluation' of the save and restore process (possibly controlled by the control vector of the to and from save areas.) Given lazy 'evaluation' it simply makes sense for the HW to manage this function. If the need arrises to finish the save/restore job so another context switch can be performed, HW (or microcode) is in the position to know what registers are affected, where they go, and how to get them there.

Mitch

Robert Wessel

unread,

Jan 25, 2012, 5:44:25 PM1/25/12

to

There's not really a problem giving the guest OS a way to find out
that it's running under a hypervisor, and almost all (all?)
hypervisors provide some sort of interface*, even for nominally
"purely" virtualized guests.

But the problem is exposing the VM by altering the ISA assumed by the
OS. If the OS can be modified to run under the VM, you can certainly
take some liberties with the ISA (IOW, the OS promises not to become
confused by incorrect information exposed in certain places). But if
the OS has not, or cannot, be modified, any alteration of the ISA can
lead to issues if the OS uses them. Let's say a guest OS has some
function that can be invoked by code at different privilege levels,
but that it needs to behave differently based on which. The OS could
keep track of what mode it's in independently (which would not cause
problems), or it could check the current privilege level by looking at
(on x86) the flags register. If the IOPL is "funny" because the VM is
using one of the privilege levels to run itself, the OS will likely
become confused and do something wrong. Now some *other* method of
letting the guest know is no problem, and that can even be architected
into the ISA (S/370, for example the first byte of the value stored by
"Store CPU ID" is defined to be 0xff under VM). The problem is not
knowing what the OS might do if it's lied to. Put another way, if you
alter the ISA, the OS will suddenly be running on some ISA that it was
not written for. Bad results follow.

*All sort of things can usefully be done that way. For example, even
if the guest OS is completely unaware of the VM, you can still write a
device driver for that OS that can talk to the VM. This can be used
to implement more efficient I/O devices. For example, rather than
having the hypervisor tease meaning out of a serious of virtualized
port/memory mapped I/O operations to figure out that the guest OS want
to read a sector of a virtual disk image, the device driver can just
say "read sector 123 on drive 7" directly, leading to efficiencies at
both ends. Another useful function is being able to issue various
configuration/control commands to the hypervisor. Let's say there's a
real tape drive (which is obviously not shareable) on the system,
there might be a VM command to attach the tape drive to a particular
guest. A bit of an application and a device driver can pass that
command to the hypervisor from a guest.

Andy (Super) Glew

unread,

Jan 25, 2012, 10:18:08 PM1/25/12

to

On 1/25/2012 10:23 AM, MitchAlsup wrote:
> On Tuesday, January 24, 2012 9:59:52 PM UTC-6, Andy (Super) Glew wrote:
>> I don't recall the Gould hardware doing this, but in general the amount
>> of state that gets saved/loaded may need to be [[parameterizable |
>> parameterizable interrupt state save/load]].
>
> Background, the S.E.L. (A.K.A. Gould) machines were 360-like
> minicomputers. As such most of their function was controlled with a
> Program Status Doubleword.
>
> I was the architect of the task-to-task interrupt architecture of
> the Gould 32/87. One program could do an SVC, the SVC would point to
> an interrupt vector DoubleWord. One word contatined the program
> status and the memory mapping tables, the other word contained a
> save/restore table address. HW would save the registers of teh
> SVC-ing task to its current save area, load the PSDW of the new task
> and restore the registers from that task's save area. Presto, a
> complete context switch without an excursion through the
> OS. Effective context switch time was one Store-multiple instruction
> plus one Load-multiple instruction, plus one load doubleword, plus
> some cycles to resynch the memory management. Real Time people loved
> this, Unix not so much.

I did Real Time UNIX on this machine. I don't recall using the
task-to-task interrupt architecture you describe, but the I/O
interrupt architecture was very similar. (Did you do that, Mitch?)
And it was very pleasant to work with.

> So, it seems to me that what we want is a save area where CPU
> registers can be dumped that is associated with the current context
> but not necessarily accessible by the current context, a save area
> where new registers can be picked up but not otherwised accessible
> by the new context, and that the amount of registers being saved and
> restored can be minimized but always remains sufficient and the
> whole cabodle remains 'safe' in priviledge sense, protection sense,
> and in a secure sense. This necessarily drags the memory management
> registers through the process.

Here's where I think many systems mess up.

Different events need different amount of state saving / loading.

Virtual machine events need almost all control register state to be
saved for the guest and loaded for the host. Only a few user mode
registers need to be saved / loaded, however - enough so that the host
can do work.

(By the way, I say again: when I say save / load, I am equally happy if
you switch to an alternate control register set. Or if you switch to a
mode, call it VMX, where if that mode bit is set various control
registers are hardwired to certain fixed values (i.e. [[switch to
alternate control register set in ROM]]). Perhaps is is best to say
"register switch" rather than "register save/load" - although I spent
much of my last post explaining that you need an alternate register set
to switch to for every possible set of nested interruption. Many bugs
occur because people try to escape that simple fact.
And, I also say that if it is done in memory, you can painlessly
add new events at almost any time. Whereas if you are using switched
register (sub)sets, you will realize that an event that you thought was
not nestable must be, and tie yourself in knots.)

An event that can be handled by the same thread, on the same thread
stack, doesn't need memory management registers to be switched. Heck,
it doesn't need the stack pointer to be switched.

An event that is delivered to a user process that switches to the OS
must necessarily change privilege. But it may not need to change memory
management, since many (but not all) OSes live in all user process
memory maps. And it must be possible tio switch stacks, althouigh that
can be done by staging through locore.

An event that is delivered to OS code that is already running on the OS
interrupt stack must NOT switch stack pointers

I can easily imagine a mask that indicates what needs to be
saved/loaded, or switched. Or a list of such registers. And if you don't
like interpreting such a list, just hardwire the few special cases you
want to support. But if you don;t hardwire the worst case, realize that
you may have to add it later.

> Given that many entrances to the OS are serviced immediately and
> then return with little likely hood of needing to schedule a
> different task, one can apply lazy 'evaluation' of the save and
> restore process (possibly controlled by the control vector of the to
> and from save areas.) Given lazy 'evaluation' it simply makes sense
> for the HW to manage this function. If the need arrises to finish
> the save/restore job so another context switch can be performed, HW
> (or microcode) is in the position to know what registers are
> affected, where they go, and how to get them there.
>
> Mitch

Now we have an intellectual framework to evaluate interrupt, exception,
and event architectures against.

Terje Mathisen

unread,

Jan 26, 2012, 2:44:01 AM1/26/12

to

Andy (Super) Glew wrote:
> On 1/25/2012 10:23 AM, MitchAlsup wrote:
> (By the way, I say again: when I say save / load, I am equally happy if
> you switch to an alternate control register set. Or if you switch to a
> mode, call it VMX, where if that mode bit is set various control
> registers are hardwired to certain fixed values (i.e. [[switch to
> alternate control register set in ROM]]). Perhaps is is best to say
> "register switch" rather than "register save/load" - although I spent
> much of my last post explaining that you need an alternate register set
> to switch to for every possible set of nested interruption. Many bugs
> occur because people try to escape that simple fact.

This sounds a lot like the original Norsk Data ND10:

It had 16 hardwired interrupt/privilege levels (including the base),
with separate register sets for each level. This meant that it could
take any interrupt in a single cycle as long as 16 strictly priority
ordered processed matched what you were trying to do. The machine was
designed for process control afaik.

> And, I also say that if it is done in memory, you can painlessly add new
> events at almost any time. Whereas if you are using switched register
> (sub)sets, you will realize that an event that you thought was not
> nestable must be, and tie yourself in knots.)

And this is where the ND10 got in trouble, when it tried to run 10-100
users at the same (base) level. I presume it also would have problems
with multiple processes/drivers sharing any of the higher levels.

>
> An event that can be handled by the same thread, on the same thread
> stack, doesn't need memory management registers to be switched. Heck, it
> doesn't need the stack pointer to be switched.
>
> An event that is delivered to a user process that switches to the OS
> must necessarily change privilege. But it may not need to change memory
> management, since many (but not all) OSes live in all user process
> memory maps. And it must be possible tio switch stacks, althouigh that
> can be done by staging through locore.
>
> An event that is delivered to OS code that is already running on the OS
> interrupt stack must NOT switch stack pointers
>
> I can easily imagine a mask that indicates what needs to be
> saved/loaded, or switched. Or a list of such registers. And if you don't

That sounds like the kind of reasoning that led other architectures to
load/store multiple opcodes with a bit mask indicating the registers to
be handled.

> like interpreting such a list, just hardwire the few special cases you
> want to support. But if you don;t hardwire the worst case, realize that
> you may have to add it later.

I believe it is better to have just 2 or 3 fixed variants:

1) Save/restore everything: Use as the default for any context switch

2) Save/restore all the main stuff, excepting only things like huge
vector register sets.

3) Save/restore minimum: Say 5-8 integer registers, enough to service a
simple IO request.

For (2) and (3) you can discuss if you should also protect the unsaved
parts of the context with a dirty flag, so that you will get a trap if a
driver writer messes up and accesses/modifies something she shouldn't.

If you do so, then you also have the option of doing transparent lazy
save/restore.

In an ideal world you would be able to mark a process with the resources
it has been observed to use, so that it can start by using fast context
switches, then be converted to full if needed.

>> Given that many entrances to the OS are serviced immediately and
>> then return with little likely hood of needing to schedule a
>> different task, one can apply lazy 'evaluation' of the save and
>> restore process (possibly controlled by the control vector of the to
>> and from save areas.)

And here it seems like Mitch wants the same. :-)

Given lazy 'evaluation' it simply makes sense
>> for the HW to manage this function. If the need arrises to finish
>> the save/restore job so another context switch can be performed, HW
>> (or microcode) is in the position to know what registers are
>> affected, where they go, and how to get them there.

Indeed, this does seem like the best (or at least pretty good) approach.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Jan 26, 2012, 12:07:56 PM1/26/12

to an...@spam.comp-arch.net

On Wednesday, January 25, 2012 9:18:08 PM UTC-6, Andy (Super) Glew wrote:
> I did Real Time UNIX on this machine. I don't recall using the
> task-to-task interrupt architecture you describe, but the I/O
> interrupt architecture was very similar. (Did you do that, Mitch?)
> And it was very pleasant to work with.

The I/O interrupt architecture was already existent. What I did was
provide access to that infrastructure via the SVC instruction. Thus, I
was merely standing on the shoulders of others, here.

Mitch

Tim McCaffrey

unread,

Jan 26, 2012, 12:10:22 PM1/26/12

to

For simplicity, I liked the 68000 (switch to Supervisor stack pointer), I
forget where it pushed the current IP (user or supervisor stack). Nested
interrupts, etc., are handled fine, just make sure the supervisor stack has
the room :).

The ARM isn't too bad, although for what I was using it for I found the fast
interrupts fairly useless, but I can see scenarios where they might work out.
Nesting of interrupts is not encouraged (I recall that it was a painful, but
I don't recall the specifics).

And the 8086 was ok for an embedded part, but what were the 286 architects
thinking?

- Tim

Paul A. Clayton

unread,

Jan 26, 2012, 1:10:17 PM1/26/12

to

On Jan 25, 5:44 pm, Robert Wessel <robertwess...@yahoo.com> wrote:
> On Wed, 25 Jan 2012 07:51:24 -0800 (PST), "Paul A. Clayton"

> <paaronclay...@gmail.com> wrote:
[snip]

> >I admit that I do not understand the great benefit of
> >true (invisible except for performance) virtualization.
> >Allowing a guest to know that it is a guest does not
> >seem that problematic for most uses.

[snip]

> But the problem is exposing the VM by altering the ISA assumed by the
> OS. If the OS can be modified to run under the VM, you can certainly
> take some liberties with the ISA (IOW, the OS promises not to become
> confused by incorrect information exposed in certain places). But if
> the OS has not, or cannot, be modified, any alteration of the ISA can
> lead to issues if the OS uses them. Let's say a guest OS has some
> function that can be invoked by code at different privilege levels,
> but that it needs to behave differently based on which. The OS could
> keep track of what mode it's in independently (which would not cause
> problems), or it could check the current privilege level by looking at
> (on x86) the flags register. If the IOPL is "funny" because the VM is
> using one of the privilege levels to run itself, the OS will likely
> become confused and do something wrong.

OK, that makes sense. I was thinking of a particular example
from Itanium where the page table hash function was
unprivileged. I suppose this could confuse an OS if the
actual page size was different than the page size that the
OS expected (had previously set). Not generating an
exception when non-privileged code uses this instruction
might make fixing the OS's view of reality impractically
difficult if the actual page size was different than the
one used by the OS.

True virtualization presumably does make the system
simpler (easier to understand and validate), but some
information leaks (like the Itanium hash instruction)
seem rather harmless (at least in most cases?).

I can certainly understand the temptation to ignore
writes to bits that would cause privilege exceptions
(though this is more obviously something to be avoided),
and allowing non-privileged reading of system information
like a time counter or even the base page size seems
more reasonable (so in my ignorance, I might define an
architecture to allow such).

> Now some *other* method of
> letting the guest know is no problem, and that can even be architected
> into the ISA (S/370, for example the first byte of the value stored by
> "Store CPU ID" is defined to be 0xff under VM). The problem is not
> knowing what the OS might do if it's lied to. Put another way, if you
> alter the ISA, the OS will suddenly be running on some ISA that it was
> not written for. Bad results follow.

I suppose it is possible that an OS might become
confused if the time counter jumped 100ms between "10ms"
interrupts.

Virtualization seems to have complexity similar to
cache coherence and consistency, so avoiding additional
complexity which does not have a large benefit elsewhere
seems reasonable.

Joe keane

unread,

Jan 26, 2012, 3:05:39 PM1/26/12

to

In article <jfkjf9$eq8$1...@USTR-NEWS.TR.UNISYS.COM>,
Tim McCaffrey <timca...@aol.com> wrote:
>(and the OS wants to use that zero-page as well, and the since the I/O is
>memory mapped it wants to be located there as well....)

The 6510 had two locations to do I/O.

IIRC it was for the joystick and to unmap the ROMs and video/sound
chips.

Robert Wessel

unread,

Jan 26, 2012, 7:14:57 PM1/26/12

to

Therein lies the rub. How do you know it's going to be harmless?

>I can certainly understand the temptation to ignore
>writes to bits that would cause privilege exceptions
>(though this is more obviously something to be avoided),
>and allowing non-privileged reading of system information
>like a time counter or even the base page size seems
>more reasonable (so in my ignorance, I might define an
>architecture to allow such).

For performance reasons, you often want to allow such, but it does
cause problems. Which is why many ISAs, even those that meet the
Popek and Goldberg requirements, have grown virtualization "modes" of
some sort.

>> Now some *other* method of
>> letting the guest know is no problem, and that can even be architected
>> into the ISA (S/370, for example the first byte of the value stored by
>> "Store CPU ID" is defined to be 0xff under VM). The problem is not
>> knowing what the OS might do if it's lied to. Put another way, if you
>> alter the ISA, the OS will suddenly be running on some ISA that it was
>> not written for. Bad results follow.
>
>I suppose it is possible that an OS might become
>confused if the time counter jumped 100ms between "10ms"
>interrupts.
>
>Virtualization seems to have complexity similar to
>cache coherence and consistency, so avoiding additional
>complexity which does not have a large benefit elsewhere
>seems reasonable.

Although the goal in many cases remains running a unmodified OS as a
guest. That severely limits the assumptions you can make.

Andy (Super) Glew

unread,

Jan 26, 2012, 7:53:17 PM1/26/12

to

On 1/25/2012 7:51 AM, Paul A. Clayton wrote:
> On Jan 24, 10:59 pm, "Andy (Super) Glew"<a...@SPAM.comp-arch.net>
> wrote:
> [snip]
>> (b) the fact that an interrupt has occurred,
>> the fact that the user state is saved on the user stack, might be
>> visible to other threads in the same user process. I.e. it is not
>> properly virtualized.
>

> I admit that I do not understand the great benefit of
> true (invisible except for performance) virtualization.

Paravirtualization is fine for many uses. E.g. for webhosting, cloud
computing.

But if you are using virtualization to do something like debug a guest
system [*] that was exhibiting a bug on real hardware, well, you would
like the behavior to be as similar as possible so that you can get the
same bugs. Performance differences are hard to avoid.

Similarly for testing a guest system in a VM.

Similarly, if you want to run a guest OS which is only prepared to run
on real hardware... (BTW, I have a Windows 9 tablet PC, and I wouyld
like to run an instance of Windows 9 in a VM on this tablet, i.e. Win 9
on top of Win 9, for intellectual property reasons. Does my license apply?)

I'm a big advocate of hardware and system vendors using dedicated VM
layers to provide services, fix bugs, etc. Let's say that Intel has a
bug in a chipset - say that they have flipped the polarity of a bit in a
control register. Fix it by trapping on accesses to that control
register. Trap to a private VM layer, not a hypervisor that you have to
share with other vendors... It would be nice if you didn't break
compatibility.

(How may such layers do you need? Well, theres
0) the CPU vendor
1) the chipset vendor - which may be a different organization in
same company
2) the OEM, like Dell, or HP
3) the "real" Hypervisor, like VMware
4) the OS
5) the app software
6) and did I mention I want to run virtual machines in user mode,
without having to install OS and VMM and hypervisor software
?!)

Security monitors may not monitor 100% of the time - they may only
occasionally, periodically, when suspicious, slip a VM under the guest
system to look for malware behaving badly. If the malware can detect,
via a virtualization hole, that the good guys are looking at it...

Similarly for the VMs in honeypots. (BTW, while I think it is great to
run potential malware in a VM to try to detect bad stuff, (1) I would
not assume that the VM is perfect, so I would also run it on an
appropriately isolated net, and (2) I would also not assume that the
malware can't detect taht it is being run virtualizsed, so I would have
some bare hardware systems with hardware/network isolation and monitoring.)

Conversely, malware with a VM based rootkit probably doesn't want
security software to detect that it is there.

This raises an important issue: how can you protect against VM based
rootkits, if perfect virtualization is possible? (And perfect
virtualization is possible, if only in software emulation.)

One approach is to provide a carefully designed virtualization hole,
involving the TPM (Trusted Platform Module).

Let us assume that the TPM is physically secure. (Which is not true in
all cases, but might be true if you want to protect a machine you own
against malware downloaded from the net.)

You've booted. Now, how do you prove to your credit card company website
that it is you, and not a corrupt you?

Well, the TPM can measure you. You can send the measurements to
external sites. To prevent replay attacks, the external sites may want
to converse with the TPM, challenge response, with your possibly corrupt
PC in the middle.

OK, now, how can you now trust your system, even if you have already
booted and were possibly corrupt? (Many systems are only secure if the
secure stuff is the very first thing to run. Nice if you can get it, but
vulnerable.) Late secure boot. Provide an operation that signals to an
"external" physically secure TPM (by external I mean outside the CPU,
although possibly on the same chip). Tell the secure module on the TPM
to do something the CPU can't do. Verify externally? And/or verify
against records. Provide some sort of secure I/O.

Andy (Super) Glew

unread,

Jan 26, 2012, 7:59:58 PM1/26/12

to

On 1/25/2012 11:44 PM, Terje Mathisen wrote:

> Andy (Super) Glew wrote:

>> I can easily imagine a mask that indicates what needs to be
>> saved/loaded, or switched. Or a list of such registers. And if you don't
>
> That sounds like the kind of reasoning that led other architectures to
> load/store multiple opcodes with a bit mask indicating the registers to
> be handled.
>
>> like interpreting such a list, just hardwire the few special cases you
>> want to support. But if you don;t hardwire the worst case, realize that
>> you may have to add it later.
>
> I believe it is better to have just 2 or 3 fixed variants:
>
> 1) Save/restore everything: Use as the default for any context switch
>
> 2) Save/restore all the main stuff, excepting only things like huge
> vector register sets.
>
> 3) Save/restore minimum: Say 5-8 integer registers, enough to service a
> simple IO request.

I'm fine by this.

But let me point out that save / load everything is *EVERYTHING*.

I.e. if you add a virtual machine mode that can take interruptsm you add
more.

People run into troubles when they have 1), 2), 3), and then add more
stuff. But don't want to change the definition of the existing "save
everything" interrupt. They tie themselves into knots trying to
persuade themselves that they really don;t need to save everrything.

BTW, save everrthing doesn't mean that hardware/firmware has to do it.

It juist has to be possible to write a software routine, which is
appropriately isolated, to accomplish all of the above.

And if you create a specual mode for that software routine, well, you
will have to save that within 5 years as well.

Andy (Super) Glew

unread,

Jan 26, 2012, 10:58:06 PM1/26/12

to

On 1/26/2012 4:53 PM, Andy (Super) Glew wrote:

> Paravirtualization is fine for many uses. E.g. for webhosting, cloud
> computing.
>
> But if you are using virtualization to do something like debug a guest
> system [*] that was exhibiting a bug on real hardware

Note that I do not say "a guest OS".

I mean a guest system - possibly one that is itself running a VMM or
hypervisor.

Yes: you may need to slide a new virtual machine layer underneath the
one that is already there.

Paul A. Clayton

unread,

Jan 26, 2012, 11:50:34 PM1/26/12

to

On Jan 26, 7:53 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]

> (How may such layers do you need? Well, theres
> 0) the CPU vendor
> 1) the chipset vendor - which may be a different organization in
> same company
> 2) the OEM, like Dell, or HP
> 3) the "real" Hypervisor, like VMware
> 4) the OS
> 5) the app software
> 6) and did I mention I want to run virtual machines in user mode,
> without having to install OS and VMM and hypervisor software
> ?!)

I am more inclined toward Mike Haertel's "Any performance
problem can be solved by removing a level of indirection"
than David Wheeler's "All problems in computer science
can be solved by another level of indirection", but I
have never had to solve a bug problem at any level and
live in a dream world where cooperation and communication
(and perfection!) are the norm. :-)

> Security monitors may not monitor 100% of the time - they may only
> occasionally, periodically, when suspicious, slip a VM under the guest
> system to look for malware behaving badly. If the malware can detect,
> via a virtualization hole, that the good guys are looking at it...

Yeah, I was thinking of that when I wrote "(For some

security-oriented uses, secrecy might be useful.)"

> Similarly for the VMs in honeypots. (BTW, while I think it is great to
> run potential malware in a VM to try to detect bad stuff, (1) I would
> not assume that the VM is perfect, so I would also run it on an
> appropriately isolated net, and (2) I would also not assume that the
> malware can't detect taht it is being run virtualizsed, so I would have
> some bare hardware systems with hardware/network isolation and monitoring.)

Well, that is just (the often neglected) principle of
layered security.

> Conversely, malware with a VM based rootkit probably doesn't want
> security software to detect that it is there.

And this is especially scary with firmware-implanted
malware. Even putting in a new hard drive will not
remove the infection. It also seems that some
vendors are not paranoid about avoiding software
corruption (ISTR hearing about a case where a
software CD contained malware.).

Terje Mathisen

unread,

Jan 27, 2012, 2:16:12 AM1/27/12

to

Andy (Super) Glew wrote:
>> I believe it is better to have just 2 or 3 fixed variants:
>>
>> 1) Save/restore everything: Use as the default for any context switch
>>
>> 2) Save/restore all the main stuff, excepting only things like huge
>> vector register sets.
>>
>> 3) Save/restore minimum: Say 5-8 integer registers, enough to service a
>> simple IO request.
>
> I'm fine by this.
>
> But let me point out that save / load everything is *EVERYTHING*.

I do know.

You have personally told me several stories about how hard it was to get
an unnamed OS vendor to accept a new (variable) save/restore context
instruction, where the OS could query the HW on startup to determine how
much space *EVERYTHING* would require.

>
> I.e. if you add a virtual machine mode that can take interruptsm you add
> more.
>
> People run into troubles when they have 1), 2), 3), and then add more
> stuff. But don't want to change the definition of the existing "save
> everything" interrupt. They tie themselves into knots trying to persuade
> themselves that they really don;t need to save everrything.
>

Yes indeed, and I believe said OS vendor punted by setting aside 2x-3x
the currently required size, just so the save area size could be
compiled in as a constant. (I assume they added a startup check that the
size was large enough!)

> BTW, save everrthing doesn't mean that hardware/firmware has to do it.
>
> It juist has to be possible to write a software routine, which is
> appropriately isolated, to accomplish all of the above.
>
> And if you create a specual mode for that software routine, well, you
> will have to save that within 5 years as well.
>

If you want to change the context size without having any OS support,
you need a (potentially) variable, potentially microcoded, HW
instruction to do it.

(Having a OS-invisible VM level which runs the routine is morally
equivalent to sw loadable microcode, right?)

EricP

unread,

Jan 27, 2012, 12:58:18 PM1/27/12

to

MitchAlsup wrote:
>
> Given that many entrances to the OS are serviced immediately and then return with little likely hood of needing to schedule a different task, one can apply lazy 'evaluation' of the save and restore process (possibly controlled by the control vector of the to and from save areas.) Given lazy 'evaluation' it simply makes sense for the HW to manage this function. If the need arrises to finish the save/restore job so another context switch can be performed, HW (or microcode) is in the position to know what registers are affected, where they go, and how to get them there.
>
> Mitch

While I like the lazy-load/optimized-store idea for task switching
I don't think it would be any help for interrupts.

These days most OS interrupt handling and driver code would be
written in C, and a particular compiler on an architecture will
specify in its ABI which general registers are destroyed by a call.

There is a thin layer of assembler that interfaces between
the hardware and the C handler code, and it need only save
the few registers not preserved by the ABI.
e.g. for MS on x86: EAX, ECX, EDX
on x64: RAX, RCX, RDX, R8, R9, R10, R11

(On WinNT, floating point is mostly forbidden in drivers,
or the state must be save/restored manually.
Partly this is to avoid saving float registers,
partly to avoid dealing with potential exceptions.)

Since all other general registers are preserved by the calling
standard, doing a SaveAll for every interrupt is unnecessary,
so an optimized context save/restore won't help much.

For x86/x64 this is a minor difference.
However for something like an IA64 or some of the large
register set RISC cpus, not doing a SaveAll could be significant
(depends on the ABI).

Also for this lazy HW to be worth the trouble .vs. just executing a
sequence of normal instructions, it seems to me that the load/store
queue, rather than moving 8 byte values, would have to move whole
64 byte cache lines around. That may make it pretty expensive.
If you bypass the load/store queue then coherency issues arise,
and doing a queue flush to sync everything might cancel out your
benefits.

Eric

MitchAlsup

unread,

Jan 27, 2012, 3:02:16 PM1/27/12

to

On Friday, January 27, 2012 11:58:18 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> >
> > Given that many entrances to the OS are serviced immediately and then return with little likely hood of needing to schedule a different task, one can apply lazy 'evaluation' of the save and restore process (possibly controlled by the control vector of the to and from save areas.) Given lazy 'evaluation' it simply makes sense for the HW to manage this function. If the need arrises to finish the save/restore job so another context switch can be performed, HW (or microcode) is in the position to know what registers are affected, where they go, and how to get them there.
> >
> > Mitch
>
> While I like the lazy-load/optimized-store idea for task switching
> I don't think it would be any help for interrupts.
>
> These days most OS interrupt handling and driver code would be
> written in C, and a particular compiler on an architecture will
> specify in its ABI which general registers are destroyed by a call.

Still, the handler requires a few registers and a stack pointer.

But this is a good point.

> Also for this lazy HW to be worth the trouble .vs. just executing a
> sequence of normal instructions, it seems to me that the load/store
> queue, rather than moving 8 byte values, would have to move whole
> 64 byte cache lines around. That may make it pretty expensive.
> If you bypass the load/store queue then coherency issues arise,
> and doing a queue flush to sync everything might cancel out your
> benefits.

Here, you need to make a clean distinction between the queues that manage loads and stores and the queues that manage lines moving back and forth in the memory hierarchy. In Opteron, there are 22 queue entries to manage loads and stores and 8 different entries to manage line traffic. They are different structures and organized differently.

The HW could easily bypass the LS queueing and use the Miss buffer directly. In addition the Opteron queues are designed so that when the new data arrives, it pushes out the oud (write back data) so there is only one real buffer that manages 2 different lines. No bypassing of cache coherence required.

Mitch

Tim McCaffrey

unread,

Jan 27, 2012, 4:46:47 PM1/27/12

to

In article <tiv9v8-...@ntp6.tmsw.no>, "terje.mathisenattmsw.no" says...

>
>Andy (Super) Glew wrote:
>>> I believe it is better to have just 2 or 3 fixed variants:
>>>
>>> 1) Save/restore everything: Use as the default for any context switch
>>>
>>> 2) Save/restore all the main stuff, excepting only things like huge
>>> vector register sets.
>>>
>>> 3) Save/restore minimum: Say 5-8 integer registers, enough to service a
>>> simple IO request.
>>
>> I'm fine by this.
>>
>> But let me point out that save / load everything is *EVERYTHING*.
>
>I do know.
>
>You have personally told me several stories about how hard it was to get
>an unnamed OS vendor to accept a new (variable) save/restore context
>instruction, where the OS could query the HW on startup to determine how
>much space *EVERYTHING* would require.
>>
>> I.e. if you add a virtual machine mode that can take interruptsm you add
>> more.
>>
>> People run into troubles when they have 1), 2), 3), and then add more
>> stuff. But don't want to change the definition of the existing "save
>> everything" interrupt. They tie themselves into knots trying to persuade
>> themselves that they really don;t need to save everrything.
>>
>
>Yes indeed, and I believe said OS vendor punted by setting aside 2x-3x
>the currently required size, just so the save area size could be
>compiled in as a constant. (I assume they added a startup check that the
>size was large enough!)
>

As stupid as this sounds, I can easily envision an OS design that can
cause this situation. Probably fixing it requires some very low level
changes that impact everything since boot-up (where dynamic memory management
is not available, as yet).

That doesn't mean it isn't a good idea. It would also have been a good idea
if Intel had implemented it 25 years ago when they designed the 386. CPUID,
for instance, didn't show up until the 3rd set of ISA changes (Pentium), and
programmers were writing code to test if they were on a an 8086 or 8088 back
in the early 80s. How many times did Intel have to be hit with the
clue-by-4?

- Tim

Tim McCaffrey

unread,

Jan 27, 2012, 4:55:13 PM1/27/12

to

In article <pCBUq.10323$Sh7....@newsfe15.iad>,
ThatWould...@thevillage.com says...

It is fairly common after an ISR executes that you need to do a task switch.
So, instead of:

Save REGS
Do ISR (Probably saves some regs redundantly)
Task switch to new thing (load regs, switch pages, etc)

You have:
Save some REGS
Do ISR (Save some more regs) (Restore the regs)
Save the rest of the REGS
Task switch (load all the regs)

With the ABI imposed saving, I guess you might save a little time, although
it would be interesting to actually time it with today's out-of-order
load/store pipelines to see which actually is faster.

- Tim

Joe keane

unread,

Jan 27, 2012, 6:36:10 PM1/27/12

to

In article <4F21F70E...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:

>But let me point out that save / load everything is *EVERYTHING*.

It doesn't have to be.

If your handler doesn't touch certain state, you don't need to save it.

Probably best is a small integer parameter.

SAVE 0
;; save a few registers
SAVE 1
;; save all general registers
SAVE 2
;; also save floating-point
SAVE 3
;; save the BWY registers
SAVE 4
;; this doesn't exist yet

Andy (Super) Glew

unread,

Jan 28, 2012, 10:29:37 AM1/28/12

to

You may have missed the whole point.

Yes, if your handler doesn't touch the whole state. Those were the more
efficient layers we were talking about.

But, worst case, if the interrupt handler and the interruptee doen't
trust each other, then you have to ACT AS IF the whole interruptee state
has been saved somewhere inaccessible to the interrupt handler, and ACT
AS IF the interrupt handler has a complete fresh state absolutely free
of any taint of the interruptee.

You can do this by saving and loading state from separate memory buffers.

Or you can do it by switching register sets.

Or having a special mode.

What you describe is the special case, the most common case, that we
have said "Yeah, do that, of course."

nm...@cam.ac.uk

unread,

Jan 28, 2012, 10:39:58 AM1/28/12

to

In article <4F241461...@SPAM.comp-arch.net>,

So far, so good :-)

>What you describe is the special case, the most common case, that we
>have said "Yeah, do that, of course."

Ugh. That's only if you regard benchmarketing as more important
than RAS. That trick has caused large numbers of serious RAS and
security problems over the decades, on every system I know of
that has used it, from the System/360 to the Itanic.

You may still do it if you regard RAS as more important, but there
is no "of course" about it, and you use program proving to ensure
that your logic really is bulletproof.

I don't think that we are disagreeing - I am merely being more
pedantic!

Regards,
Nick Maclaren.

Andy (Super) Glew

unread,

Jan 28, 2012, 10:58:05 AM1/28/12

to

On 1/27/2012 1:55 PM, Tim McCaffrey wrote:
> In article<pCBUq.10323$Sh7....@newsfe15.iad>,
> ThatWould...@thevillage.com says...
>>
>> MitchAlsup wrote:
>>>

>>> lazy 'evaluation' of the save and restore process
>

> It is fairly common after an ISR executes that you need to do a task switch.
> So, instead of:
>
> Save REGS
> Do ISR (Probably saves some regs redundantly)
> Task switch to new thing (load regs, switch pages, etc)
>
> You have:
> Save some REGS
> Do ISR (Save some more regs) (Restore the regs)
> Save the rest of the REGS
> Task switch (load all the regs)
>
> With the ABI imposed saving, I guess you might save a little time, although
> it would be interesting to actually time it with today's out-of-order
> load/store pipelines to see which actually is faster.
>
> - Tim

By the way, you can also make the laziness cross the task switch boundary:

...user code...
Interrupt...
HW saves some regs, marks others for lazy save
ISR saves some regs
... run for a while
... Task switch
... use of of the lazy regs ... save some more

To do this properly,of course, you have to ensure that the lazy saving
of the rest of the registers can still be done after the task switch.
E.g. if done by hardware, the save area better not be to a virtual
address that is remapped by the task switch.

Similarly, tasks may be swapped in and out, not by remapping, but by
copying or DMA I/O. You must ensure that the lazy flush is performed
before a task is swapped to disk, or else you may be doing the lazy save
to some other task's memory. If the hardware cannot detect I/Os...

The lazy save can also be done by OS software, e.g. by causing a trap if
the lazy state is accessed. But the OS software that handles the lazy
save trap must do the right thing.

There was a bug related to this with Intel MMX, particularly related to
the EMMS instruction. People wanted to do lazy save of the MMX state,
using the existing OS trap for lazy save of the x87 state. But they
also wanted to make it clear that MMX was integer code, not really FP code.

Can you see were this went?

Interrupts when MMX state was in use set the TS bit (task switched,
triggers lazy save). Even when the x87 FPU was disabled.

The OS was set up so that FP tasks knew about lazy save, but non-FP
tasks did not. The OS assumed that FP state was saved before switching
from an FP task to a non-FP task.

But now non-FP tasks were getting the TS lazy save trap.

Since they were not expecting it, they crashed. If you were lucky.

The really sad thing was that the trap occurred in non-MMX tasks,
potentially much later, potentially several task switches away. So it
took a long time to figure out that MMX was to blame.

Moral: if you are going to do a lazy save, you have to make sure that
the entire food chain is ready to handle the lazy save. If lazy save by
trapping, you have to make sure that the lazy save trap can be handed
everywhere you go to. Don't just assume that, because the OS already
handles the lazy save trap in some places, that it is handled in all
places. Similarly, if lazy save in hardware/microcode, make sure that
you can still do it everywhere and everywhen, and flush before it is not
correct to do so.

I remember the MMX / EMMS discussion because I spent a long time arguing
against the optimization that led to the bug. I was all in favor of lazy
save, but I wanted a fall back. All of my arguments were like "The OS
may be doing this", and "An OS doesn't need to have done that", but they
were all dismissed as hypothetical. People were confident that the OS
would be handling the TS lazy save trap in all circumstances. It was
not. Now, my hypothetical examples did not include exactly the case
that caused the bug, but they were close.

How should such cases be handled? I'd like to be able to do a theorem
proof, that an architectural extension is provably safe amongst all
possible, reasonably sane, OS implementations. In which case
hypotheticals are valid arguments.

Cost is related. Since computer companies usually do not pay anything
for the loss of user time caused by a bug, a comparatively small amount
of cost on the CPU vendor side outweighs a large amount of cost on the
customer side.