Delay between idea and implementation

Paul A. Clayton

unread,

Dec 23, 2011, 4:23:58 PM12/23/11

to

A recent post on the Real World Technologies (Moderated) forum
indicating that Intel's Ivy Bridge will use move elimination makes
me wonder how long it takes for an idea to reach implementation.
(The idea of handling moves during register rename--especially
moves preceding destructive operations, which are common in x86--
must be somewhat old by now.)

I am in the middle of reading "High-Bandwidth Address Translation
for Multiple-Issue Processors" (1996; Todd M. Austin, Gurindar S.
Sohi), which has some "obvious" (at least in hindsight) ideas for
reducing the number of TLB accesses--piggybacking (sharing TLB
accesses for the same virtual page), pretranslation (associating
a translation with the base pointer register), interleaving
(banking), and multi-level. Of these, I receive the impression
that only the multi-level idea has been implemented. (A power
use optimization similar to banking could be to have SP-relative
accesses be limited to a fraction of the TLB entries.)

Obviously banking would be the next simplest bandwidth enhancement,
but banking can produce variable timing, which might be excessively
problematic for an L1 TLB (and the L2 TLB is unlikely to require
as much bandwidth). Piggybacking and pretranslation would seem
likely to be much more complex but have some potential for a
reduction in power use. (Pretranslation could also be used in a
way that would cooperate with address compression in registers,
cache tags, and/or TLB tags. Even just pretranslating for the
stack pointer might be useful.)

(It is quite possible that such ideas have been implemented and
there existence is simply not documented, though some might be
somewhat detectable by software using such for reducing power
use would probably be effectively undetectable.)

As Andy Glew posted recently, avoiding complexity can prevent
clever ideas from being implemented (rightly--even if I do not
like it), but it seems that some of the above ideas are old
enough that even with due caution they should have been tried
by now.

(For high bandwidth address translation, the relatively low
number of cache accesses supported per cycle may make that
goal less useful. [In 1996, 16-wide fetch and issue may have
been considered a future possibility.] However, some of the
power-saving aspects could be helpful in modern designs, it
seems.)

Skewed associativity also seems to be a neglected idea
(though I think one Sun SPARC processor--possibly Rock, which
was never launched--used a skewed associative address
translation predictor and I read that a minor processor
design provider supplied a skewed-associative cache option),
and it also could be used to reduce power consumption. (Yes,
such could require AGU and/or array indexing integration for
L1, but even L2 caches do not support skewed associativity.)

(My own--relatively obvious--idea of having a large page
TLB that can hold large page PTEs or PDEs seems not to have
been implemented, and I do not _think_ there would be
substantial complexity barriers with such. Admittedly, such
would only apply to Architectures with hierarchical page
tables [though a similar technique could apply to linear
page tables], but such includes x86 and ARM.)

While some almost drop-in design changes might only have a
two-year to product delay, even an idea introduced during a
tock with implementation delayed to the next tock should
only have an eight year delay. (I almost receive the
impression that some ideas are tried but the project fails
and the project failure is tied to the idea. Even when an
idea contributes to a project failure--via excess complexity--
sometimes a large fraction of the complexity overhead was
handled in the failed project so that a second attempt
could be much easier. SMT at Intel came back relatively
quickly after the Pentium4 'failure', but other companies
implementing intra-core multithreading may have removed some
of the stench of failure. [Perhaps a future POWER processor
will implement runahead like POWER6--another interesting idea
in a 'failed' processor--, perhaps using a pseudo-SMT4 mode
in which the registers of one cluster are used as a checkpoint
while the other cluster is used for runahead.])

Joe keane

unread,

Dec 23, 2011, 6:20:38 PM12/23/11

to

In article <6dbf909d-8fbe-41cb...@n6g2000vbz.googlegroups.com>,

Paul A. Clayton <paaron...@gmail.com> wrote:
>A recent post on the Real World Technologies (Moderated) forum
>indicating that Intel's Ivy Bridge will use move elimination makes
>me wonder how long it takes for an idea to reach implementation.
>(The idea of handling moves during register rename--especially
>moves preceding destructive operations, which are common in x86--
>must be somewhat old by now.)

my thought on this is that it's nice if you have simple tricks

but if the simple tricks delay you from making the not-simple stuff
really competent then you actually lost

MitchAlsup

unread,

Dec 24, 2011, 1:29:48 PM12/24/11

to

Andy was at AMD when I was proposing a machine that was to do Move Eliminations (circa 2000). At that point in time, Athlon was already doing move elimination in the x87 floating point stack rename process.

Mitch

Paul A. Clayton

unread,

Dec 24, 2011, 2:41:57 PM12/24/11

to

I forgot about the x87 stack optimization. That probably had a
relatively big benefit per cost since almost every operation
uses the top of stack as one operand.

So that c.2000 proposal would have become a shipping product in
about 2005? Intel is only about 7 years late? :-)

Thanks yet again for sharing historical information.

That example may add credibility to the theory that ideas are tainted
by the 'failures' of the projects in which they are first introduced.

(Part of the non- or delayed adoption of ideas after a 'failure' may
come from a lack of knowledge transfer between "tock" designs. If
an idea does not fit the perspective of the later team, there might
be a feeling of "Why waste time looking into an uninteresting idea,
especially one that so recently failed?". Perhaps something like
generational paradigm shift theory, there needs to be a changing of
the guard before some ideas can become acceptable. [It does seem
that having "tock" teams with different perspectives would be helpful
in avoiding stagnation even if it also means that some good ideas
considered in a previous generation are not reviewed until one or
more generations later. With such complex trade-offs and minimal
ability to model choices, I am glad that I am radically unsuited to
management.])

EricP

unread,

Dec 24, 2011, 4:25:54 PM12/24/11

to

It is also possible that an alternative approach
achieves the same result in a simpler manner.
Their experience with the x87 would give them
a good idea of the pros and cons.

Move elimination lowers pressure on the
Physical Register File by tracking where the
register value moved. That means that multiple
Architectural Registers can map to a single Physical Register
so the PR usage tracking changes from a bit to a counter.
The many-to-one AR:PR map may also complicate checkpointing
and precise exception roll back.

The alternative is simply make the PRF larger.
A scan of the sample code base would tell you how many such
moves you are likely to encounter in the register window.

Anyway... someone is apparently thinking about it.
This patent was filed 24 Dec, 2010:

MOVE ELIMINATION AND NEXT PAGE PREFETCHER
http://www.google.com/patents/US20110208918

It doesn't say who the assignee is, but one of the
filers David Sager is, according to LinkedIn,
currently Principle Architect at Intel.

Eric

Paul A. Clayton

unread,

Dec 24, 2011, 5:10:53 PM12/24/11

to

On Dec 24, 4:25 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
[snip]

> It is also possible that an alternative approach
> achieves the same result in a simpler manner.
> Their experience with the x87 would give them
> a good idea of the pros and cons.

A possibly more likely scenario is that targeting
the design resources to a different area/problem
is more attractive.

> Move elimination lowers pressure on the
> Physical Register File by tracking where the
> register value moved. That means that multiple

I think a more significant affect is that it
can shorten the critical path and can reduce
the overhead of movement (8 bits being cheaper to
move than 64). It also removes an instruction
from the issue queue (effectively increasing its
capacity).

> Architectural Registers can map to a single Physical Register
> so the PR usage tracking changes from a bit to a counter.
> The many-to-one AR:PR map may also complicate checkpointing
> and precise exception roll back.

Such complexity should not be worse than macro-op
fusion (e.g., compare fused with branch), I would
guess.

> Anyway... someone is apparently thinking about it.
> This patent was filed 24 Dec, 2010:
>
> MOVE ELIMINATION AND NEXT PAGE PREFETCHER
> http://www.google.com/patents/US20110208918
>
> It doesn't say who the assignee is, but one of the
> filers David Sager is, according to LinkedIn,
> currently Principle Architect at Intel.

A very quick scan of the claims--particularly the
earlier ones--does not seem to present what I would
consider especially novel and non-obvious (for a
late 2010 filing). I read the later claims even
less carefully (and they seemed less familiar), so
there might be patent-worthy content there. It also
seems a bit odd that prefetching and move elimination
would be joined in a single patent.

EricP

unread,

Dec 25, 2011, 12:34:12 PM12/25/11

to

Paul A. Clayton wrote:
> On Dec 24, 4:25 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> [snip]
>> It is also possible that an alternative approach
>> achieves the same result in a simpler manner.
>> Their experience with the x87 would give them
>> a good idea of the pros and cons.
>
> A possibly more likely scenario is that targeting
> the design resources to a different area/problem
> is more attractive.

I'm just pointing out that engineering considerations
rather than political machinations is a viable explanation.

>> Move elimination lowers pressure on the
>> Physical Register File by tracking where the
>> register value moved. That means that multiple
>
> I think a more significant affect is that it
> can shorten the critical path and can reduce
> the overhead of movement (8 bits being cheaper to
> move than 64). It also removes an instruction
> from the issue queue (effectively increasing its
> capacity).
>
>> Architectural Registers can map to a single Physical Register
>> so the PR usage tracking changes from a bit to a counter.
>> The many-to-one AR:PR map may also complicate checkpointing
>> and precise exception roll back.
>
> Such complexity should not be worse than macro-op
> fusion (e.g., compare fused with branch), I would
> guess.

Depends if it is limited or general.
Fusion joins 2 instructions into one
(how do they handle exceptions? Maybe they only fuse instructions
that cannot cause exceptions. Otherwise they'd need to be able to
roll back to the fused instruction program counter.)

However there can be multiple moves that qualify for ME
and they are not necessarily sequential.
Register-register moves can't cause exceptions
but you still need to track the program counter
so retire knows what value to set.

So there has to be some kind of 'anchor' instruction
that moves though the queue to transport the PC to retire.

Also new moves can arrive after the 'anchor' has been dispatched
but before it retires, so new moves (and their PC) can join
the set as they travel.

And what if an intervening non-move instruction traps?
Which PC do we roll back to?

The general case is looking very messy.

Eric

Niels Jørgen Kruse

unread,

Dec 25, 2011, 1:43:57 PM12/25/11

to

EricP <ThatWould...@thevillage.com> wrote:

> Depends if it is limited or general.
> Fusion joins 2 instructions into one
> (how do they handle exceptions? Maybe they only fuse instructions
> that cannot cause exceptions. Otherwise they'd need to be able to
> roll back to the fused instruction program counter.)
>
> However there can be multiple moves that qualify for ME
> and they are not necessarily sequential.
> Register-register moves can't cause exceptions
> but you still need to track the program counter
> so retire knows what value to set.
>
> So there has to be some kind of 'anchor' instruction
> that moves though the queue to transport the PC to retire.
>
> Also new moves can arrive after the 'anchor' has been dispatched
> but before it retires, so new moves (and their PC) can join
> the set as they travel.
>
> And what if an intervening non-move instruction traps?
> Which PC do we roll back to?
>
> The general case is looking very messy.

Isn't that the sort of problem that replay solves?

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Andy "Krazy" Glew

unread,

Dec 25, 2011, 3:52:01 PM12/25/11

to

On 12/24/2011 1:25 PM, EricP wrote:
> Paul A. Clayton wrote:
>> On Dec 24, 1:29 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>>> Andy was at AMD when I was proposing a machine that was to do Move
>>> Eliminations (circa 2000). At that point in time, Athlon was already
>>> doing move elimination in the x87 floating point stack rename process.

Mitch was at Motorola when I proposed a mchine that did MOVE
elimination. For the M68000. Circa 1988. As part of my original work on
HaRRM, a Hardware Register Renaming Mechanism. (Of course, Mitch was
M88K RISC focussed at that time.)

>> So that c.2000 proposal would have become a shipping product in
>> about 2005? Intel is only about 7 years late? :-)

Heck, we discussed MOVE elimination in P6 circa 1991.

We decided not to go that way because we decided to go for a RFOB/RRF
design. If we had gone for a PRF design, as in HaRMM or circa
1987-1991, move elimination was a lot easier.

I think there are public comp.arch posts on the subject.

EricP

unread,

Dec 25, 2011, 4:48:24 PM12/25/11

to

It is also desirable to eliminate the move uOp from the
instruction stream similar to fusion. If it only fuses a
single move to a prior instruction then it could be relatively
simple and provide most of the benefit for little extra cost.

For example

add r0, [r1] // r0 = r0 + [r1]
mov r2, r1
mov r4, r6

it could fuse the mov r2, r1 onto the add so they retire together,
but additionally fusing mov r4, r6 may be too complicated.

Furthermore, one can fuse a mov after an add because if
the add traps (e.g. page faults) then we roll back to its
prior checkpoint. However if the mov was prior to the add

mov r2, r1
add r0, [r1] // r0 = r0 + [r1]

then it would be problematic because if add trapped we would
need to roll back to the state after the mov but before the add
(which we eliminated the uOp for and probably the checkpoint too.)

Eric

Niels Jørgen Kruse

unread,

Dec 26, 2011, 3:37:08 AM12/26/11

to

EricP <ThatWould...@thevillage.com> wrote:

> Furthermore, one can fuse a mov after an add because if
> the add traps (e.g. page faults) then we roll back to its
> prior checkpoint. However if the mov was prior to the add
>
> mov r2, r1
> add r0, [r1] // r0 = r0 + [r1]
>
> then it would be problematic because if add trapped we would
> need to roll back to the state after the mov but before the add
> (which we eliminated the uOp for and probably the checkpoint too.)

I was assuming that fusing instructions was disabled during replay.

Torben �gidius Mogensen

unread,

Jan 2, 2012, 6:15:52 AM1/2/12

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

> A recent post on the Real World Technologies (Moderated) forum
> indicating that Intel's Ivy Bridge will use move elimination makes
> me wonder how long it takes for an idea to reach implementation.
> (The idea of handling moves during register rename--especially
> moves preceding destructive operations, which are common in x86--
> must be somewhat old by now.)

If you have enough physical registers, a lot of move elimination can be
done by the compiler. Basically, if you have code in SSA form, all move
instructions in the original code can be eliminated by simple copy
elimination, leaving only the implicit moves present in the phi-nodes
(which can be a lot, I grant). The latter can often be eliminated by
the register allocator: Using the same register for all variables in a
phi-node eliminated the implicit moves. Copying basic blocks can also
eliminate moves. As a simple example, consider the loop for calculating
Fibonacci numbers:

while (n-->0) {
t = a;
a = a+b;
b = t;
}

If you unroll once, you get:

while (n-->0) {
t = a;
a = a+b;
b = t;
if (n--==0) break;
t = a;
a = a+b;
b = t;
}

After renaming the variables in the first half, we get

while (n-->0) {
t' = a;
a' = a+b;
b' = t';
if (n--==0) {a = a'; break;}
t = a';
a = a'+b';
b = t;
}

which reveals that most of the copies can be eliminated:

while (n-->0) {
b = a+b;
if (n--==0) {a = b; break;}
a = b+a;
}

The general case is more complicated, of course, but not too bad.

My point is that you might want to consider solving the problem in the
compiler rather than throwing hardware at it. You might not get all the
benefit that a hardware solution can, but if you can get 90%, then the
cost/benefit ratio of a hardware solution might not look so good.

Torben

Terje Mathisen

unread,

Jan 2, 2012, 9:53:42 AM1/2/12

to

Torben Ćgidius Mogensen wrote:
> which reveals that most of the copies can be eliminated:
>
> while (n-->0) {
> b = a+b;
> if (n--==0) {a = b; break;}
> a = b+a;
> }

In this particular case it is of course better to handle odd counts
outside the loop, avoiding the test inside:

a = 0; b = 1;
if (n & 1) {
a = 1;
}

// The lines above can be replaced by simply:
a = n & 1; b = 1;

n >>= 1;
while (n-- > 0) {
b += a;
a += b;
}
return a;

The expected runtime should be very close to 2 cycles/iteration (the
loop check can be overlapped with the dependent adds), i.e. (n).

>
> The general case is more complicated, of course, but not too bad.
>
> My point is that you might want to consider solving the problem in the
> compiler rather than throwing hardware at it. You might not get all the
> benefit that a hardware solution can, but if you can get 90%, then the
> cost/benefit ratio of a hardware solution might not look so good.

They key problem that move elimination can solve on x86 doesn't really
exist on most other architectures:

2-operand vs 3-operand instructions

Nearly all x86 instructions (with important exceptions for LEA and IMUL)
use the target as one of the inputs, so every algorithmic fanout will
require a standalone copy/move operation.

Since OoO cores tend to allocate a new physical target register for
every operation, in order to be able to back out of a mis-speculated
operation, getting rid of MOVs can make a lot of sense.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

James Van Buskirk

unread,

Jan 2, 2012, 12:40:58 PM1/2/12

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in message
news:n0t8t8-...@ntp6.tmsw.no...

> They key problem that move elimination can solve on x86 doesn't really
> exist on most other architectures:

> 2-operand vs 3-operand instructions

> Nearly all x86 instructions (with important exceptions for LEA and IMUL)
> use the target as one of the inputs, so every algorithmic fanout will
> require a standalone copy/move operation.

Some SSE instructions (PSHUFL/HW, PSHUFD, MOVDDUP, MOVSL/HDUP) don't
have this problem. AVX has 3-register operations. Seems like it
should help with FFT and convolution computation.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Andy "Super" Glew

unread,

Jan 2, 2012, 1:40:08 PM1/2/12

to

On 1/2/2012 3:15 AM, Torben �gidius Mogensen wrote:
> My point is that you might want to consider solving the problem [move elimination] in the

> compiler rather than throwing hardware at it. You might not get all the
> benefit that a hardware solution can, but if you can get 90%, then the
> cost/benefit ratio of a hardware solution might not look so good.
>
> Torben

That is one of the reasons why move elimination, which as I mentioned I
first played with in the 1980s (and which was probably considered by
folks at IBM decades earlier), took so long to get into hardware.

After the x86 register file issues, then came the "Why not do it in a
new instruction set?" discussions - first the various RISC proposals to
replace x86, then Itanium, then x86-64.

Paul's original post wondered how long it takes for an idea to reach
implementation. One factor is how credible the alternatives are (that
may later turn out to be insufficient or unsatisfactory).

Andy "Super" Glew

unread,

Jan 2, 2012, 1:39:13 PM1/2/12

to

On 1/2/2012 3:15 AM, Torben Ægidius Mogensen wrote:
> My point is that you might want to consider solving the problem [move elimination] in the

> compiler rather than throwing hardware at it. You might not get all the
> benefit that a hardware solution can, but if you can get 90%, then the
> cost/benefit ratio of a hardware solution might not look so good.
>
> Torben

Torben Ægidius Mogensen

unread,

Jan 3, 2012, 8:05:10 AM1/3/12

to

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:

> They key problem that move elimination can solve on x86 doesn't really
> exist on most other architectures:
>
> 2-operand vs 3-operand instructions

That is true. I still think a lot of these moves can be eliminated by
the compiler, but not nearly as many as if you have 3-address
instructions.

Torben

Terje Mathisen

unread,

Jan 3, 2012, 8:43:37 AM1/3/12

to

As I wrote, (nearly) every time you have an algorithmic fan-out, you
will need a copy operation. (I.e. as long as the fan-out happens along
the critical path.)

Just an hour ago I wrote some code over in c.l.a.x86 where the missing
3-operand instructions turned 2-cycle latency steps into 3-cycle operations.

Another example is the Full Adder where you (afair) end up with 7 x86
instructions vs 5 on an Alpha. (I use this code for compression of bit
arrays before counting bits.)

// s0 = a ^ b ^ c;
// s1 = (a & b) | ((a ^ b) & c);

The groupings are by latency:

t = a;
a ^= b;

b &= t;
t = c;
c &= a;

a ^= t; // S0
b |= c; // S1

James Van Buskirk

unread,

Jan 3, 2012, 9:52:41 AM1/3/12

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in message
news:a9dbt8-...@ntp6.tmsw.no...

> Torben Ægidius Mogensen wrote:

> As I wrote, (nearly) every time you have an algorithmic fan-out, you will
> need a copy operation. (I.e. as long as the fan-out happens along the
> critical path.)

> Just an hour ago I wrote some code over in c.l.a.x86 where the missing
> 3-operand instructions turned 2-cycle latency steps into 3-cycle
> operations.

> Another example is the Full Adder where you (afair) end up with 7 x86
> instructions vs 5 on an Alpha. (I use this code for compression of bit
> arrays before counting bits.)

> // s0 = a ^ b ^ c;
> // s1 = (a & b) | ((a ^ b) & c);

> The groupings are by latency:

> t = a;
> a ^= b;

> b &= t;
> t = c;
> c &= a;

> a ^= t; // S0
> b |= c; // S1

How quickly you forget the benefits of negative logic!

http://groups.google.com/group/comp.lang.asm.x86/msg/8dab77a28ee7f005?hl=en

Terje Mathisen

unread,

Jan 3, 2012, 11:49:25 AM1/3/12

to

James Van Buskirk wrote:
> "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message

>> Another example is the Full Adder where you (afair) end up with 7 x86
>> instructions vs 5 on an Alpha. (I use this code for compression of bit
>> arrays before counting bits.)
>

> How quickly you forget the benefits of negative logic!
>
> http://groups.google.com/group/comp.lang.asm.x86/msg/8dab77a28ee7f005?hl=en
>

I did not!

:-)

I just wanted a relatively simple example, besides I think we agree that
no compiler will ever discover that trick you invented over in clax.

Andy "Super" Glew

unread,

Jan 4, 2012, 3:16:45 AM1/4/12

to

On 1/3/2012 8:49 AM, Terje Mathisen wrote:
> James Van Buskirk wrote:
>> "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message
>>> Another example is the Full Adder where you (afair) end up with 7 x86
>>> instructions vs 5 on an Alpha. (I use this code for compression of bit
>>> arrays before counting bits.)
>>
>> How quickly you forget the benefits of negative logic!
>>
>> http://groups.google.com/group/comp.lang.asm.x86/msg/8dab77a28ee7f005?hl=en
>>
>>
>
> I did not!
>
> :-)
>
> I just wanted a relatively simple example, besides I think we agree that
> no compiler will ever discover that trick you invented over in clax.
>
> Terje

I wonder if a superoptimizer and/or genetic algorithm might be able to
discover that trick?

Terje Mathisen

unread,

Jan 4, 2012, 6:34:50 AM1/4/12

to

Almost certainly not:

If you check out that post James wrote, he did a lot more than just
finding a faster way from A to B, it was more like redefining the
problem to be C to D instead.

Let me put it another way:

Compilers will generate optimal code for Itanium a long time before they
are capable of tricks like that.

Anton Ertl

unread,

Jan 4, 2012, 7:24:16 AM1/4/12

to

"Andy \"Super\" Glew" <an...@SPAM.comp-arch.net> writes:
>>> http://groups.google.com/group/comp.lang.asm.x86/msg/8dab77a28ee7f005?hl=en
...

>I wonder if a superoptimizer and/or genetic algorithm might be able to
>discover that trick?

Superoptimizer yes, if you ask it the right (not too limited)
question, but it might take longer than you are willing to wait.

Of course, now that we know what we are looking for, we can ask a
question that's wide enough to allow discovering this trick, yet
narrow enough that the answer will come up in a realistic time frame,
but I doubt that we will ask the question in such a way if we don't
know that this trick is out there.

For genetic algorithms, we also would have to ask the right question,
and have the right fitness function, and crossover and mutation
operators. Even knowing what we search for, I don't see a fitness
function that would give a not-quite-there variant of this trick a
better fitness than any other piece of code that produces the wrong
result, so I don't see any advantage of genetic algorithms over
brute-force search (like the superoptimizer) here. But an expert on
GAs might have better ideas than I have.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

nm...@cam.ac.uk

unread,

Jan 4, 2012, 8:09:55 AM1/4/12

to

In article <2012Jan...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>"Andy \"Super\" Glew" <an...@SPAM.comp-arch.net> writes:
>>>> http://groups.google.com/group/comp.lang.asm.x86/msg/8dab77a28ee7f005?hl=en
>...
>>I wonder if a superoptimizer and/or genetic algorithm might be able to
>>discover that trick?
>
>Superoptimizer yes, if you ask it the right (not too limited)
>question, but it might take longer than you are willing to wait.
>
>Of course, now that we know what we are looking for, we can ask a
>question that's wide enough to allow discovering this trick, yet
>narrow enough that the answer will come up in a realistic time frame,
>but I doubt that we will ask the question in such a way if we don't
>know that this trick is out there.

Quite. The key fact about global optimisation of ANY form is that
it is an intractable problem unless you can use some previous
knowledge to produce a targetted search method.

>For genetic algorithms, we also would have to ask the right question,
>and have the right fitness function, and crossover and mutation
>operators. Even knowing what we search for, I don't see a fitness
>function that would give a not-quite-there variant of this trick a
>better fitness than any other piece of code that produces the wrong
>result, so I don't see any advantage of genetic algorithms over
>brute-force search (like the superoptimizer) here. But an expert on
>GAs might have better ideas than I have.

I doubt it. It isn't true that they are simply snake oil, but it
isn't unfair to call them it, either. No incremental method can
find more than the local minimum, though exactly what 'local' means
depends on how much randomness you introduce (i.e. the factors you
mention). 'Genetic algorithms' were simply a gimmicky variant of
methods used in the 1960s and, while thgey might be a slightly more
systematic or efficient approach, aren't going to find any minima
that those wouldn't find.

Regards,
Nick Maclaren.

James Van Buskirk

unread,

Jan 4, 2012, 10:01:20 AM1/4/12

to

"Andy "Super" Glew" <an...@SPAM.comp-arch.net> wrote in message
news:4F040AED...@SPAM.comp-arch.net...

> On 1/3/2012 8:49 AM, Terje Mathisen wrote:

>> James Van Buskirk wrote:

>>> "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message
>>>> Another example is the Full Adder where you (afair) end up with 7 x86
>>>> instructions vs 5 on an Alpha. (I use this code for compression of bit
>>>> arrays before counting bits.)

>>> How quickly you forget the benefits of negative logic!

>>> http://groups.google.com/group/comp.lang.asm.x86/msg/8dab77a28ee7f005?hl=en

>> I did not!

>> I just wanted a relatively simple example, besides I think we agree that
>> no compiler will ever discover that trick you invented over in clax.

> I wonder if a superoptimizer and/or genetic algorithm might be able to
> discover that trick?

Depends on the run time of the optimizer. If it were allowed to go
on until AVX was available then it could replace a sequence such as

movaps xmm2, xmm0
xorps xmm2, xmm1

with

vxorps xmm2, xmm0, xmm1

and the problem of 2-register instructions would again be eliminated.

MitchAlsup

unread,

Jan 4, 2012, 2:03:46 PM1/4/12

to

Should I remind Terje that we (S.E.L computers) did* a compiler "optimizer" to specifically recognize the ATAN(...) and SQRT(...) loops in Whetstone and reduce these algebraiclly to constant evaluations in 1984. The performanc improved by about 17X and was one ot the reasons Whetstone was dropped as a useful benchmark.

If one can recognize that kind of "algebra" optimizing the conversation going on here is not that different. Sort-of like converting an XOR P-G generator in an ALU-adder into a OR-NAND P-G gnerator in an adder to save a gate delay.

Mitch

(*) where 'did' means hired some external compiler vendor

Terje Mathisen

unread,

Jan 4, 2012, 3:39:53 PM1/4/12

to

MitchAlsup wrote:
> Should I remind Terje that we (S.E.L computers) did* a compiler
> "optimizer" to specifically recognize the ATAN(...) and SQRT(...)
> loops in Whetstone and reduce these algebraiclly to constant
> evaluations in 1984. The performanc improved by about 17X and was one
> ot the reasons Whetstone was dropped as a useful benchmark.
>

:-)

Benchmarketing has never been a very useful activity, except when they
actually generate sales, but just like the Spec rules I would require
any such optimization to be general enough to handle a lot more than
just a single specific function.

I.e. I want a compiler, not a benchmark recognizer.

> If one can recognize that kind of "algebra" optimizing the
> conversation going on here is not that different. Sort-of like
> converting an XOR P-G generator in an ALU-adder into a OR-NAND P-G
> gnerator in an adder to save a gate delay.

If you can get your compiler to take such a bitslice algorithm written
with unsigned C vars, and figure out that inverted logic would be
faster, more power to you!

Paul A. Clayton

unread,

Jan 4, 2012, 6:54:44 PM1/4/12

to

On Jan 4, 3:39 pm, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
> MitchAlsup wrote:
> > Should I remind Terje that we (S.E.L computers) did* a compiler
> > "optimizer" to specifically recognize the ATAN(...) and SQRT(...)
> > loops in Whetstone and reduce these algebraiclly to constant
> > evaluations in 1984. The performanc improved by about 17X and was one
> > ot the reasons Whetstone was dropped as a useful benchmark.

I think a Whetstone optimization of this type was mentioned in an
earlier edition of Computer Architecture: A Quantitative Approach.

> :-)
>
> Benchmarketing has never been a very useful activity, except when they
> actually generate sales, but just like the Spec rules I would require
> any such optimization to be general enough to handle a lot more than
> just a single specific function.

While recognizing that the log(sqrt(n)) equals 0.5*log(n) might not
be broadly useful, there may be some cases where an expression
is not simplified by the programmer because the simplification
would reduce ease of understanding. (I am not sure that any
commonly used programming languages support 'overloading' of
functions with semantically equivalent definitions--so that an
easily understood implementation would be provided while an
alternative optimized version would be available. Even if the
compiler did not prove equivalence--to simplify maintaining the
code--, such might be useful.) If the code is in a function that is
inlined, the compiler may be able to recognize a special case
which is not available in the source code (and profile-guided
optimization might be able to find even more optimization
opportunities, though I think most profiling does not provide
value information).

> I.e. I want a compiler, not a benchmark recognizer.

ISTR that Sun broke the 'art' subbenchmark by doing a valid
array of structures to structure of arrays transformation. In
theory, especially with profile information, a compiler could
perform a wide variety of data structure optimizations.

It is also not clear how refactoring tools should be handled.
While the main purpose of such is to improve maintainability,
some refactoring can improve performance as well. If a
refactoring (or rejuvenation) is safe, it is effectively a
compiler optimization. If it must be confirmed by a
programmer, the refactoring would still be much less heavy
weight than hand-tuning the program and so might be
considered a valid optimization method for a broader
range of applications. If dynamically testing the safety of
the optimization was sufficiently inexpensive, the fast
path could use the optimization. If the benchmark
never triggers the slow path, it may be a true optimization
for that benchmark.

Likewise for programs that have little parallelism running
on more parallel systems, it might be possible to improve
performance by running different implementations of
subtasks in parallel and choosing the fastest correct
result. (A recent paper proposed this.) Likewise, a
compiler might be able to generate a scout thread to
prefetch data into a shared cache.

One of the problems with optimizing compilers--especially
some JIT compilers--is that they can make synthetic
benchmarks (for testing a single factor) more difficult to
write.

Andy "Super" Glew

unread,

Jan 4, 2012, 8:06:42 PM1/4/12

to

Yes, but a programmer (you, Mitch?) added code or pattern recognizers to
the compiler for these, I'm guessing?

Now, constant propagation is an example of a more general technique that
would have accomplished the same effect.

I'm not so sure as Terje is that a genetic algorithm might not be able
to generate JvB's code. One formulation of such algos is to apply, in a
semi-random manner, value preserving transforms, and then explore the space.

Looking at James' code, I think it amounts to applying the transformation

POPCNT(x) = Wordsize-POPCNT(NOT(x)).

And thereafter (super) optimizing the hell out of it.

Now, perhaps nobody has thought to add such an identity to the list of
transforms that genetic algorithms might apply.

But I suspect that some such can themselves be generated by genetic
algorithms, possibly operating on symbolic representations.

A more realistic question, I think, is whether a compiler would ever be
able to decide how much time to dedicate to optimizing any such a piece
of code.

Andy "Super" Glew

unread,

Jan 4, 2012, 8:15:40 PM1/4/12

to

On 1/4/2012 4:24 AM, Anton Ertl wrote:
> "Andy \"Super\" Glew"<an...@SPAM.comp-arch.net> writes:
>>>> http://groups.google.com/group/comp.lang.asm.x86/msg/8dab77a28ee7f005?hl=en
> ...
>> I wonder if a superoptimizer and/or genetic algorithm might be able to
>> discover that trick?
>
> Superoptimizer yes, if you ask it the right (not too limited)
> question, but it might take longer than you are willing to wait.
>
> Of course, now that we know what we are looking for, we can ask a
> question that's wide enough to allow discovering this trick, yet
> narrow enough that the answer will come up in a realistic time frame,
> but I doubt that we will ask the question in such a way if we don't
> know that this trick is out there.
>
> For genetic algorithms, we also would have to ask the right question,
> and have the right fitness function, and crossover and mutation
> operators. Even knowing what we search for, I don't see a fitness
> function that would give a not-quite-there variant of this trick a
> better fitness than any other piece of code that produces the wrong
> result, so I don't see any advantage of genetic algorithms over
> brute-force search (like the superoptimizer) here. But an expert on
> GAs might have better ideas than I have.
>
> - anton

Genetic algorithms do not always have sex.

Sometimes they just explore asexual reproduction.

Or perhaps I am just using "genetic algorithms" because it is the
faddish modern term we used to call Bernouilli optimization of simulated
annealing.

(Simulated annealing = approximately asexual genetic algorithms.)

Fitness function:

* exactly what Terje and James were discussing: number of instructions
and/or estimated execution time.

Mutation operators:

a) any code scheduling rearrangement of instructions

b) a large library of transformations that are semantically neutral,
such as POP(x) = Nbits - POP(NOT(x)), etc. AArithmetic and logical
identities.

Annealing means that you do not apply the fitness function on every
generation. Instead, you occasionally allow subjects to become worse,
giving them a chance to get out of local minima.

I would be surprised if bisexual genetic algorithms did not alsioi do
the same sort of annealing.

Anton Ertl

unread,

Jan 5, 2012, 10:16:44 AM1/5/12

to

"Andy \"Super\" Glew" <an...@SPAM.comp-arch.net> writes:
>Fitness function:
>
>* exactly what Terje and James were discussing: number of instructions
> and/or estimated execution time.

I was wondering how correctness would play into the fitness function,
but you neatly define the problem away:

>Mutation operators:
>
>a) any code scheduling rearrangement of instructions
>
>b) a large library of transformations that are semantically neutral,
>such as POP(x) = Nbits - POP(NOT(x)), etc. AArithmetic and logical
>identities.

All your changes maintain correctness, so that problem is solved. Of
course the search space will be much more limited with that than that
of the superoptimizer, so there can (and probably will) be solutions
that this kind of search cannot find. But the trick mentioned earlier
can probably be found with the right library of transformations. But
would you have all the right rules in the library if you did not know
what to search for?

I also wonder if Simulated Annealing or GA (or other heuristic search
algorithms) are really better for this kind of stuff than, say, an
exhaustive best-first search.

>Annealing means that you do not apply the fitness function on every
>generation. Instead, you occasionally allow subjects to become worse,
>giving them a chance to get out of local minima.
>
>I would be surprised if bisexual genetic algorithms did not alsioi do
>the same sort of annealing.

From what I know about GA, that is achieved by having a certain
population size, and possible not being totally strict about survival
of the fittest.

nm...@cam.ac.uk

unread,

Jan 5, 2012, 10:38:27 AM1/5/12

to

In article <2012Jan...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>

>I also wonder if Simulated Annealing or GA (or other heuristic search
>algorithms) are really better for this kind of stuff than, say, an
>exhaustive best-first search.

Yes, because the latter will not necessarily finish while the
universe can still power the computer. However, the empirical
searches pioneered in the 1960s are probably as good as any of
the fancily named ones.

For example, inserting a few randomisations into a simple
steepest descents (or each all-neighbour) search, with an ad-hoc
backoff strategy to ensure improvement over a suitable timescale.
That was used in multi-dimensional scaling algorithms, which are
just a global search of a discontinuous objective function in a
very large number of dimensions.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Jan 5, 2012, 10:53:09 AM1/5/12

to

nm...@cam.ac.uk writes:
>In article <2012Jan...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>
>>I also wonder if Simulated Annealing or GA (or other heuristic search
>>algorithms) are really better for this kind of stuff than, say, an
>>exhaustive best-first search.
>
>Yes, because the latter will not necessarily finish while the
>universe can still power the computer.

Finish? With any of these searches and the search space outlined
earlier, you always have a "current best" solution, and you never have
a proof that it is optimal (ok, with an exhaustive search like
best-first you may have proof if the search space has a certain
structure, but I don't think the search space outlined earlier has
this structure), so there is no real finish; you finish the program
when you want, and take the best solution up to that point.

What the best-first search gives you is the knowledge that you are
doing an exhaustive search, whereas the heuristic searches can leave
out parts completely and do other parts repeatedly. If the guidance
given by the fitness function is good, that's a good tradeoff, but for
the search space and fitness function we have been discussing, I doubt
that the fitness function really helps much in guiding the search,
that's why I have my doubts about the effectiveness of the heuristic
approaches in this setting.

Ordinary best-first search will run out of memory pretty soon, but one
can simulate it with iterative deepening which uses much less memory
(although probably more time).

In contrast to the approaches mentioned above, with the
superoptimizer, if you find a solution, you know that it's optimal
(for the given set of selectable instructions and the question asked).

nm...@cam.ac.uk

unread,

Jan 5, 2012, 11:38:33 AM1/5/12

to

In article <2012Jan...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>>
>>>I also wonder if Simulated Annealing or GA (or other heuristic search
>>>algorithms) are really better for this kind of stuff than, say, an
>>>exhaustive best-first search.
>>
>>Yes, because the latter will not necessarily finish while the
>>universe can still power the computer.
>
>Finish? With any of these searches and the search space outlined
>earlier, you always have a "current best" solution, and you never have
>a proof that it is optimal (ok, with an exhaustive search like
>best-first you may have proof if the search space has a certain
>structure, but I don't think the search space outlined earlier has
>this structure), so there is no real finish; you finish the program
>when you want, and take the best solution up to that point.

Unless you use a sufficiently subtle filling pattern for the
exhaustive search, it will check all of one area before moving
on to the next. Cutting it short will not provide a useful
result.

Designing and coding suitable filling pattern for general search
spaces isn't within most people's ability, either.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Jan 5, 2012, 12:05:25 PM1/5/12

to

nm...@cam.ac.uk writes:
>In article <2012Jan...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>Finish? With any of these searches and the search space outlined
>>earlier, you always have a "current best" solution, and you never have
>>a proof that it is optimal (ok, with an exhaustive search like
>>best-first you may have proof if the search space has a certain
>>structure, but I don't think the search space outlined earlier has
>>this structure), so there is no real finish; you finish the program
>>when you want, and take the best solution up to that point.
>
>Unless you use a sufficiently subtle filling pattern for the
>exhaustive search, it will check all of one area before moving
>on to the next. Cutting it short will not provide a useful
>result.

Are you thinking about a depth-first search? I was writing about
best-first search. Or maybe these terms are too fancy for you:-).

nm...@cam.ac.uk

unread,

Jan 5, 2012, 12:48:07 PM1/5/12

to

In article <2012Jan...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>
>>>Finish? With any of these searches and the search space outlined
>>>earlier, you always have a "current best" solution, and you never have
>>>a proof that it is optimal (ok, with an exhaustive search like
>>>best-first you may have proof if the search space has a certain
>>>structure, but I don't think the search space outlined earlier has
>>>this structure), so there is no real finish; you finish the program
>>>when you want, and take the best solution up to that point.
>>
>>Unless you use a sufficiently subtle filling pattern for the
>>exhaustive search, it will check all of one area before moving
>>on to the next. Cutting it short will not provide a useful
>>result.
>
>Are you thinking about a depth-first search? I was writing about
>best-first search. Or maybe these terms are too fancy for you:-).

Sigh. The term depth-first search implies a particular ordering
of the objective space, and structuring of the relationships.
The term best-first search is a fancy neologism for heuristic
methods related to the ones I described, that were also used in
the 1960s. Unless you can specify those characteristics, your
distinction is merely hot air.

Furthermore, the word exhaustive requires either a suitable
space-filling search pattern or particular knowledge that can
exclude areas of the objective space once one point and a bound
on the minimum is known.

That's all standard mathematical search theory, and has been
shown to be reliable by five decades of computing.

Regards,
Nick Maclaren.

Andy (Super) Glew

unread,

Apr 6, 2013, 1:03:00 AM4/6/13

to

On 1/2/2012 3:15 AM, Torben �gidius Mogensen wrote:
> My point is that you might want to consider solving the problem [move elimination] in the

> compiler rather than throwing hardware at it. You might not get all the
> benefit that a hardware solution can, but if you can get 90%, then the
> cost/benefit ratio of a hardware solution might not look so good.
>
> Torben

Edward Feustel

unread,

Apr 9, 2013, 6:24:19 AM4/9/13

to

Doesn't it also depend on the cost of technology? At one point the
additional cost of memory seemed prohibitive w.r.t. to implementing
tagged/descriptor based architecture. That is not true now. At one
point, the cost or rewriting SW could prevent the implementation of a
new architecture, -- this before the more general use of architecture
independent higher level languages.
Unfortunately, the use of C instead of something more like IBM's SPL
did not help the move to a tagged/descriptor architecture since it is
so "casual" about the generation of addresses. And unfortunately this
is ingrained in OSs like Unix/Linux.
Ed Feustel

MitchAlsup

unread,

Apr 9, 2013, 11:24:55 AM4/9/13

to

On Tuesday, April 9, 2013 5:24:19 AM UTC-5, Edward Feustel wrote:
> Doesn't it also depend on the cost of technology? At one point the
> additional cost of memory seemed prohibitive w.r.t. to implementing
> tagged/descriptor based architecture. That is not true now.

Correct, but what is now ingrained is that data is 32-bits or 64-bits
or 128-bits, and that the size of memory is 4KBytes or 8KBytes. These
boundaries permiate the design space, leaving no room for descriptors
or tags. So the cost per bit WOULD allow, the organization of the bits
does not.

Mitch

John D. McCalpin

unread,

Apr 9, 2013, 2:52:46 PM4/9/13

to

I would be delighted to have more bits that I could use for tags
and descriptors, but have never been able to convince the design
team that both:
(1) the functionality will be widely demanded, and
(2) adequate performance cannot be obtained using a virtual machine
with a good JIT.

Andy (Super) Glew

unread,

Apr 9, 2013, 9:58:52 PM4/9/13

to

> On Tuesday, April 9, 2013 10:24:55 AM UTC-5, MitchAlsup wrote:
>> On Tuesday, April 9, 2013 5:24:19 AM UTC-5, Edward Feustel wrote:
>>
>>> Doesn't it also depend on the cost of technology? At one point the
>>> additional cost of memory seemed prohibitive w.r.t. to implementing
>>> tagged/descriptor based architecture. That is not true now.
>>
>> Correct, but what is now ingrained is that data is 32-bits or 64-bits
>> or 128-bits, and that the size of memory is 4KBytes or 8KBytes. These
>> boundaries permiate the design space, leaving no room for descriptors
>> or tags. So the cost per bit WOULD allow, the organization of the bits
>> does not.
>>
>> Mitch

I am not so sure that 2^N sized data packets need always endure. But I
agree with Mitch that 2^N sized data packets are what we have now, and
are unlikely to change any time soon.

I note that hardware and software data packets need not be the same.
Even if hardware mandates 2^N, software can handle 2^N-a few tag bits.
And does, on a regular basis, in JITs and interpreters for languages
like Javascript and ... But, of course, stupid C code that assumes that
int is 32 bits, etc., will break in such languages - if those languages
wrap, say, at 31 bits. And may well break if those languages transition
smoothly to extended precision - let alone FP the way so many do.
Basically,stupid C code breaks all over the place.

But even without this, there is another way: outlying, non-adjacent,
metadata. Tag bits that do not live next to the data in question.

Or, you can use Poor Man's ECC to have 2^N sized hardware data, 2^M
sized software data, plus a few adjacent bits.

--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies, which has been acquired by
Imagination Technologies; in the past of companies such as Intellectual
Ventures and QIPS, Intel, AMD, Motorola, and Gould), I reveal this only
so that the reader may account for any possible bias I may have towards
my employer's products. The statements I make here in no way represent
my employers' positions on the issue, nor am I authorized to speak on
behalf of my employers, past or present.

Terje Mathisen

unread,

Apr 10, 2013, 3:00:12 AM4/10/13

to

Andy (Super) Glew wrote:
> I am not so sure that 2^N sized data packets need always endure. But I
> agree with Mitch that 2^N sized data packets are what we have now, and
> are unlikely to change any time soon.

The only real chance for tag bits have already been used for a while, in
the form of specialized memory controllers that can steal a bit from ECC
RAM, right?

In order to use this from any language you would only need a way for the
OS to initialize data blocks as "invalid/not set", this would at least
give you the ability to detect use before set of global or heap data.

In order to do the same for stack variables you would need much more
fine-grained access, i.e. a way to write 65-bit "NaN" patterns.

Getting more than a single bit like this is of course much harder...

Ivan Godard

unread,

Apr 10, 2013, 4:16:57 AM4/10/13

to

On 4/10/2013 12:00 AM, Terje Mathisen wrote:
> Andy (Super) Glew wrote:
>> I am not so sure that 2^N sized data packets need always endure. But I
>> agree with Mitch that 2^N sized data packets are what we have now, and
>> are unlikely to change any time soon.
>
> The only real chance for tag bits have already been used for a while, in
> the form of specialized memory controllers that can steal a bit from ECC
> RAM, right?
>
> In order to use this from any language you would only need a way for the
> OS to initialize data blocks as "invalid/not set", this would at least
> give you the ability to detect use before set of global or heap data.
>
> In order to do the same for stack variables you would need much more
> fine-grained access, i.e. a way to write 65-bit "NaN" patterns.
>
> Getting more than a single bit like this is of course much harder...

Why? Storing anything except a pointer on top of a pointer would clear
the reserved bit, so ordinary memset makes everything become data even
if it had been descriptor. Trying to dereference a non-descriptor
faults. Or am I misunderstanding you?

Ivan

Terje Mathisen

unread,

Apr 10, 2013, 5:41:55 AM4/10/13

to

Yeah, I wasn't aggressive enough:

I was only considering a single bit to differentiate between initialized
and un-initialized memory, you could of course use the same bit to
designate a pointer, but at that point you would need new opcode(s) to
generate/store a pointer and compiler modifications to use them.

It may well be that the ability to uniquely identify all pointers is
more important that the ability to detect attempts to access random
garbage, but even the latter would generate some overhead in order to
flag any newly allocated stack space as such, you could probably get
most of the benefit by teaching your compiler to always zero out all
stack locations.

Paul A. Clayton

unread,

Apr 10, 2013, 8:19:53 AM4/10/13

to

On Apr 10, 3:00 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:

> Andy (Super) Glew wrote:
> > I am not so sure that 2^N sized data packets need always endure. But I
> > agree with Mitch that 2^N sized data packets are what we have now, and
> > are unlikely to change any time soon.
>
> The only real chance for tag bits have already been used for a while, in
> the form of specialized memory controllers that can steal a bit from ECC
> RAM, right?

If one cannot widen a memory channel (because of commodity
memory chip width constraints), one could add a spare channel
as the last (?) Alpha system did (though it used such as an
optional mechanism, to provide "module-kill").

Andy Glew has suggested using "Poor Man's ECC" (as done by
a recent-ish GPU?), where the metadata (ECC) is stored as
ordinary memory in a fixed address. (I think at least one
early directory system proposal used such reserved address
space for directory storage.)

Although I am not certain that Andy has mentioned it, from
his other comments on using space freed by compression for
storing metadata, I suspect that he has considered using
some such space as a cache for such alternate-storage
metadata (an adequately compressible chunk allowing a
"cache hit"). (Metadata could be sorted by criticality or
frequency of use so that the most important metadata could
have the highest hit rates.)

> In order to use this from any language you would only need a way for the
> OS to initialize data blocks as "invalid/not set", this would at least
> give you the ability to detect use before set of global or heap data.
>
> In order to do the same for stack variables you would need much more
> fine-grained access, i.e. a way to write 65-bit "NaN" patterns.
>
> Getting more than a single bit like this is of course much harder...

For some uses that treat the data as invalid, it might be
possible to use a maximally uncorrectable error. (ISTR
that this is done for handling uncorrectable errors on
writeback so that any next use [read] is extremely likely
to generate a detected an uncorrectable error.)

Some time ago I suggested that it might be possible to
use a similar mechanism to support a key checking
mechanism where the key is hashed with the data and/or
ECC metadata such that any unmatching key would produce
an uncorrectable error. (Using correctable error
encodings would allow more key values but force any
detected error to make a check using auxiliary metadata.)
Such might be used for limited match-oriented tags
(though hardware could--with some effort--extract all
the metadata and a probability of its correctness for a
given SEU probability). I have not done the math, but
I _think_ this mechanism would work.

Ivan Godard

unread,

Apr 10, 2013, 11:40:08 AM4/10/13

to

On 4/10/2013 2:41 AM, Terje Mathisen wrote:

<snip>

> I was only considering a single bit to differentiate between initialized
> and un-initialized memory, you could of course use the same bit to
> designate a pointer, but at that point you would need new opcode(s) to
> generate/store a pointer and compiler modifications to use them.

New opcodes and compiler mods shouldn't be necessary. Registers carry
the extra bit, so normal load/store preserves pointer-hood. LEA
generates a pointer, and pointer-like registers (SP, FP, whatever) are
pointers. Pointer +/- non-pointer is a pointer, all other arithmetic is
a non-pointer.

Open question whether you check pointer arithmetic at arithmetic time or
dereference time.

Ivan

nm...@cam.ac.uk

unread,

Apr 10, 2013, 12:31:10 PM4/10/13

to

In article <kk4101$l17$1...@dont-email.me>,

In a sane language, yes. In C or C++, don't even think of doing
the former!

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Apr 10, 2013, 4:18:03 PM4/10/13

to

Ivan Godard wrote:
> On 4/10/2013 2:41 AM, Terje Mathisen wrote:
>
> <snip>
>
>> I was only considering a single bit to differentiate between initialized
>> and un-initialized memory, you could of course use the same bit to
>> designate a pointer, but at that point you would need new opcode(s) to
>> generate/store a pointer and compiler modifications to use them.
>
> New opcodes and compiler mods shouldn't be necessary. Registers carry
> the extra bit, so normal load/store preserves pointer-hood. LEA
> generates a pointer, and pointer-like registers (SP, FP, whatever) are
> pointers. Pointer +/- non-pointer is a pointer, all other arithmetic is
> a non-pointer.

XOR'ing two pointers together (the classic double-link in a single
pointer trick) would probably blow up, right?

Well, good riddance anyway. :-)

>
> Open question whether you check pointer arithmetic at arithmetic time or
> dereference time.

Probably deref...

Ivan Godard

unread,

Apr 10, 2013, 5:26:49 PM4/10/13

to

On 4/10/2013 1:18 PM, Terje Mathisen wrote:
> Ivan Godard wrote:
>> On 4/10/2013 2:41 AM, Terje Mathisen wrote:
>>
>> <snip>
>>
>>> I was only considering a single bit to differentiate between initialized
>>> and un-initialized memory, you could of course use the same bit to
>>> designate a pointer, but at that point you would need new opcode(s) to
>>> generate/store a pointer and compiler modifications to use them.
>>
>> New opcodes and compiler mods shouldn't be necessary. Registers carry
>> the extra bit, so normal load/store preserves pointer-hood. LEA
>> generates a pointer, and pointer-like registers (SP, FP, whatever) are
>> pointers. Pointer +/- non-pointer is a pointer, all other arithmetic is
>> a non-pointer.
>
> XOR'ing two pointers together (the classic double-link in a single
> pointer trick) would probably blow up, right?
>
> Well, good riddance anyway. :-)

Doesn't have to. XOR the indices together, and then use the result to
index the arena (which is a pointer).

>> Open question whether you check pointer arithmetic at arithmetic time or
>> dereference time.
>
> Probably deref...

I'd favor at least a mode that does it at arithmetic time, to check the
C rules. Of course, users will complain that a fault means the hardware
is broken, not that their program is buggy :-)

Ivan

nm...@cam.ac.uk

unread,

Apr 10, 2013, 5:36:05 PM4/10/13

to

In article <kk4la1$mo5$1...@dont-email.me>,

And they might be right :-( Sorry, but things ain't that simple
in C. There are significant differences between C90, C99 and C11,
as well as a variety of semi-defined states where a pointer value
can be valid but dereferencing it (or even SOME ways of doing so)
can be invalid.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Apr 11, 2013, 1:33:45 AM4/11/13

to

You _cannot_ do this, or at least not strictly, since C++ have blessed
pointers one item past the allocated array: This is a core feature of
the entire STL, right?

Ivan Godard

unread,

Apr 11, 2013, 2:07:14 AM4/11/13

to

On 4/10/2013 10:33 PM, Terje Mathisen wrote:
> Ivan Godard wrote:
>> On 4/10/2013 1:18 PM, Terje Mathisen wrote:
>>> Ivan Godard wrote:
>>>> Open question whether you check pointer arithmetic at arithmetic
>>>> time or
>>>> dereference time.
>>>
>>> Probably deref...
>>
>> I'd favor at least a mode that does it at arithmetic time, to check the
>> C rules. Of course, users will complain that a fault means the hardware
>> is broken, not that their program is buggy :-)
>
> You _cannot_ do this, or at least not strictly, since C++ have blessed
> pointers one item past the allocated array: This is a core feature of
> the entire STL, right?

Of course you can check it at the arithmetic; you just accept
one-past-the-end results without fault. And refuse to dereference them
of course. The compiler has to tell you the stride, but you'd want
pointer arithmetic to do that anyway. rD = rP + rI*imm, probably with an
optimizing instruction rD = rP + (rI << imm).

Ivan

nm...@cam.ac.uk

unread,

Apr 11, 2013, 3:29:28 AM4/11/13

to

In article <piuh3a-...@ntp-sure.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

>Ivan Godard wrote:
>> On 4/10/2013 1:18 PM, Terje Mathisen wrote:
>>> Ivan Godard wrote:
>>>> Open question whether you check pointer arithmetic at arithmetic time or
>>>> dereference time.
>>>
>>> Probably deref...
>>
>> I'd favor at least a mode that does it at arithmetic time, to check the
>> C rules. Of course, users will complain that a fault means the hardware
>> is broken, not that their program is buggy :-)
>
>You _cannot_ do this, or at least not strictly, since C++ have blessed
>pointers one item past the allocated array: This is a core feature of
>the entire STL, right?

Right. Except, if that were all, we would be laughing! I have
several times tried to get this cleaned up to a state that even
WG14 or WG21 could understand it, and have failed dismally every
time. I know for certain that no mere mortal knows what the
standard intends, let alone what it says or what compilers do.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Apr 11, 2013, 3:42:35 AM4/11/13

to

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
>> New opcodes and compiler mods shouldn't be necessary. Registers carry
>> the extra bit, so normal load/store preserves pointer-hood. LEA
>> generates a pointer, and pointer-like registers (SP, FP, whatever) are
>> pointers. Pointer +/- non-pointer is a pointer, all other arithmetic is
>> a non-pointer.
>
>XOR'ing two pointers together (the classic double-link in a single
>pointer trick) would probably blow up, right?

pointer^pointer -> non-pointer
pointer^non-pointer -> pointer

Should work, no?

>Well, good riddance anyway. :-)

It's a cute trick, but its practical use is limited. I used it once,
maybe 25 years ago. Since then I have managed to avoid doubly-linked
lists.

Terje Mathisen

unread,

Apr 11, 2013, 3:42:34 PM4/11/13

to

Ivan Godard wrote:
> On 4/10/2013 10:33 PM, Terje Mathisen wrote:
>> You _cannot_ do this, or at least not strictly, since C++ have blessed
>> pointers one item past the allocated array: This is a core feature of
>> the entire STL, right?
>
> Of course you can check it at the arithmetic; you just accept
> one-past-the-end results without fault. And refuse to dereference them

This was exactly what I meant by "at least not strictly", i.e. pointing
at the very next byte after the allocation limit has to be OK, as long
as you never try to access it.

> of course. The compiler has to tell you the stride, but you'd want
> pointer arithmetic to do that anyway. rD = rP + rI*imm, probably with an
> optimizing instruction rD = rP + (rI << imm).

In STL you're probably adding a constant to the current pointer, so no
optimization needed.

Terje Mathisen

unread,

Apr 11, 2013, 3:46:40 PM4/11/13

to

Anton Ertl wrote:
> Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
>>> New opcodes and compiler mods shouldn't be necessary. Registers carry
>>> the extra bit, so normal load/store preserves pointer-hood. LEA
>>> generates a pointer, and pointer-like registers (SP, FP, whatever) are
>>> pointers. Pointer +/- non-pointer is a pointer, all other arithmetic is
>>> a non-pointer.
>>
>> XOR'ing two pointers together (the classic double-link in a single
>> pointer trick) would probably blow up, right?
>
> pointer^pointer -> non-pointer
> pointer^non-pointer -> pointer
>
> Should work, no?

Except the specification said "all other arithmetic" which includes XOR
obviously, results in a non-pointer...

>
>> Well, good riddance anyway. :-)
>
> It's a cute trick, but its practical use is limited. I used it once,
> maybe 25 years ago. Since then I have managed to avoid doubly-linked
> lists.

I tend to end up with either (semi-)contiguous arrays, or plain hash
tables when next/prev isn't needed.

Single-item linked lists only make sense when each item is far larger
than a pointer, and you're doing sufficient work on them to make the
pointer access overhead negligible.

Andy (Super) Glew

unread,

Apr 15, 2013, 1:42:13 AM4/15/13

to

On 4/10/2013 12:00 AM, Terje Mathisen wrote:

> Andy (Super) Glew wrote:
>> I am not so sure that 2^N sized data packets need always endure. But I
>> agree with Mitch that 2^N sized data packets are what we have now, and
>> are unlikely to change any time soon.
>
> The only real chance for tag bits have already been used for a while, in
> the form of specialized memory controllers that can steal a bit from ECC
> RAM, right?

This has been used for a while, but it is not good enough to achieve
mass market at the moment, for the simple reason that most PCs and cell
phones do not have ECC.

And even on the systems that use ECC, they usually use codes stringer
that SECDED, that use all of the ECC bits.

So, no, my conclusion is that stealing from ECC is not the only real
chance for tag bots. It is/was a dead end that has held back computers
for 20+ years.

At the moment, outlying metadata and/or Poor Man's ECC is the only real
chance for tag bits.

>
> In order to use this from any language you would only need a way for the
> OS to initialize data blocks as "invalid/not set", this would at least
> give you the ability to detect use before set of global or heap data.
>
> In order to do the same for stack variables you would need much more
> fine-grained access, i.e. a way to write 65-bit "NaN" patterns.
>
> Getting more than a single bit like this is of course much harder...
>
> Terje
>

--

Andy (Super) Glew

unread,

Apr 15, 2013, 1:48:26 AM4/15/13

to

On 4/10/2013 5:19 AM, Paul A. Clayton wrote:
> On Apr 10, 3:00 am, Terje Mathisen <"terje.mathisen at tmsw.no">
> wrote:
>> Andy (Super) Glew wrote:
>>> I am not so sure that 2^N sized data packets need always endure. But I
>>> agree with Mitch that 2^N sized data packets are what we have now, and
>>> are unlikely to change any time soon.
>>
>> The only real chance for tag bits have already been used for a while, in
>> the form of specialized memory controllers that can steal a bit from ECC
>> RAM, right?
>
> If one cannot widen a memory channel (because of commodity
> memory chip width constraints), one could add a spare channel
> as the last (?) Alpha system did (though it used such as an
> optional mechanism, to provide "module-kill").

David Wang says this is increasngly unlikely, as DRAM widths grow past
8b to 16b, 32b, etc. 1024b on chip stacking. Adding an extra channel
for width oriented ECC also becomes a dead end.

So long as DRAMs provide 2^n channel widths...

With chip stacking, however, we may hope that 2^n + ecc_width, e.g. 1024
+ ecc_width, channels may be standardized. But, again, they will be
likely to have all ecc_width bits used for ECC.

>
> Andy Glew has suggested using "Poor Man's ECC" (as done by
> a recent-ish GPU?), where the metadata (ECC) is stored as
> ordinary memory in a fixed address. (I think at least one
> early directory system proposal used such reserved address
> space for directory storage.)
>
> Although I am not certain that Andy has mentioned it, from
> his other comments on using space freed by compression for
> storing metadata, I suspect that he has considered using
> some such space as a cache for such alternate-storage
> metadata (an adequately compressible chunk allowing a
> "cache hit"). (Metadata could be sorted by criticality or
> frequency of use so that the most important metadata could
> have the highest hit rates.)

Storing metadata in space freed up by compression in fixed size data
blocks is a performance optimization. You still need to be able to
handle the wost case of no compression - via outlying metadata.

Andy (Super) Glew

unread,

Apr 15, 2013, 1:55:05 AM4/15/13

to

C/C++ allow, require, pointers one entry past the end or before the
beginning of an array.

So you must have deref checks for at least these.

Which suggests that checking at arithmetic time would be redundant.

In other languages, checking at arithmetic time constrains optimization:

You cannot do

pointer + (int1 - int2)*scale

as

(pointer + int1*scale) - (int2*scale)

because the intermediate calculation might go out of bounds even though
the final address does not.

So at the very least you want a way to do address arithmetic and defer
the check.

>
> Ivan

Ivan Godard

unread,

Apr 15, 2013, 11:13:06 AM4/15/13

to

On 4/14/2013 10:55 PM, Andy (Super) Glew wrote:
> On 4/10/2013 2:26 PM, Ivan Godard wrote:

>> I'd favor at least a mode that does it at arithmetic time, to check the
>> C rules. Of course, users will complain that a fault means the hardware
>> is broken, not that their program is buggy :-)
>
>
> C/C++ allow, require, pointers one entry past the end or before the
> beginning of an array.

One entry after but not one entry before.

>
> So you must have deref checks for at least these.

Arithmetic check is different from and catches different errors than
deref check. The pointer may be within bounds but the deref (i.e.
pointer + size of reference) may be out of bounds.

> Which suggests that checking at arithmetic time would be redundant.

Merely a different error. I favor any and all error checking that
doesn't impact speed, and most checking that does.

> In other languages, checking at arithmetic time constrains optimization:
>
> You cannot do
>
> pointer + (int1 - int2)*scale
>
> as
>
> (pointer + int1*scale) - (int2*scale)
>
> because the intermediate calculation might go out of bounds even though
> the final address does not.

On what machine is this an optimization? Looks like worse code on all
machines I know of. Can you give a code example?

> So at the very least you want a way to do address arithmetic and defer
> the check.

???

Andy (Super) Glew

unread,

Apr 15, 2013, 11:44:56 AM4/15/13

to

int foo[100];

pointer = value known at t0, say &(foo[23]);
int1 = value known at t0
int2 = load(cache miss) = value known at t>1000+

... foo[int1+int2] ...;

pointer in eax
int1 in ebx
int2 in ecx

** Pre-optimization:

edx := lea( ebx + ecx )
edx := lea( eax + edx<<2 )

** Post-optimization

edx := lea( eax + ebx<<2 )
edx := lea( edx + ecx<<2 )

It's not much of a difference, only one cycle of latency on my
postulated 1000 cache miss. But I trust that you are not blind to
generalizations.

And, yes, I am assuming P6-class optimization of LEA. Various Intel
chips have optimized LEA to various degrees.

Ivan Godard

unread,

Apr 15, 2013, 1:41:56 PM4/15/13

to

So the gain is that you are able to hoist an op over a potential miss,
letting it be evaluated in parallel with the miss? Well, I suppose so in
your hypothetical, but how does the compiler know that it is int2 that
will miss and not int1, for which the rewrite is an anti-optimization?

And even if the compiler does know - the gain is miniscule, the
applicability is far-fetched, and the cost in compiler complexity and
entropy for the added no-check operations would be paid by everyone. So,
do you have a *real* example such that the optimization should be a
consideration in defining an new ISA? (which was the original subject,
about adding checking pointer arithmetic).

Ivan

Andy (Super) Glew

unread,

Apr 15, 2013, 2:03:55 PM4/15/13

to

No, Ivan: the cost in adding checked arithmetic operations rather than
checked derefs would be born by everyone.

You have to have unchecked arithmetic operations. Not everything is
pointer arithmetic - and pointer arithmetic uses the very same ALUs as
normal arithmetic.

You have to have checked derefs.

The only question is whether you need checked arithmetic operations as
well. They are certainly the additional cost.

Heck, man, the real question is whether there should be any checking at
all. Whether checking should not just be done by additional instructions.

I get utterly pissed off by "show me an example of X" or "prove Y"
questions, where it boils down to the winner being whoever asked the
question first. Step back a bit, and look at the entire set of
alternatives.

Andy (Super) Glew

unread,

Apr 15, 2013, 2:05:29 PM4/15/13

to

By the way, modern compilers have heuristics to do this, just as modern
compilers have heuristics to estimate branch direction.

Plus, if you are willing to go there, profiling feedback.

>
> And even if the compiler does know - the gain is miniscule, the
> applicability is far-fetched, and the cost in compiler complexity and
> entropy for the added no-check operations would be paid by everyone. So,
> do you have a *real* example such that the optimization should be a
> consideration in defining an new ISA? (which was the original subject,
> about adding checking pointer arithmetic).
>
> Ivan
>
>
>

Ivan Godard

unread,

Apr 15, 2013, 4:55:41 PM4/15/13

to

On 4/15/2013 11:03 AM, Andy (Super) Glew wrote:
> On 4/15/2013 10:41 AM, Ivan Godard wrote:
>> On 4/15/2013 8:44 AM, Andy (Super) Glew wrote:
>
>> And even if the compiler does know - the gain is miniscule, the
>> applicability is far-fetched, and the cost in compiler complexity and
>> entropy for the added no-check operations would be paid by everyone. So,
>> do you have a *real* example such that the optimization should be a
>> consideration in defining an new ISA? (which was the original subject,
>> about adding checking pointer arithmetic).
>>
>> Ivan
>
> No, Ivan: the cost in adding checked arithmetic operations rather than
> checked derefs would be born by everyone.
>
> You have to have unchecked arithmetic operations. Not everything is
> pointer arithmetic - and pointer arithmetic uses the very same ALUs as
> normal arithmetic.

Ah! You are assuming that pointer arithmetic is the same as integer
arithmetic, whereas I am assuming that it is not and so there must be
distinct pointer arithmetic operations, which (IMO) may as well be checked.

There are good reasons for pointers to not be the same as integers.

> You have to have checked derefs.
>
> The only question is whether you need checked arithmetic operations as
> well. They are certainly the additional cost.
>
> Heck, man, the real question is whether there should be any checking at
> all. Whether checking should not just be done by additional instructions.

I show my bias on that one: IMO everything that can be, should be
checked. 1) Simplifies the compiler 2) Cuts down the debugging 3) Deals
with 3rd party binaries that you can't compile a debug version of 4)
Doesn't get overlooked. 5) More humane. It's the same argument as for
using parity on DRAM :-)

> I get utterly pissed off by "show me an example of X" or "prove Y"
> questions, where it boils down to the winner being whoever asked the
> question first. Step back a bit, and look at the entire set of
> alternatives.

Sorry to offend :-( Your example optimization was meaningful whether
pointers are distinct from integers or not; I question it's
cost-benefit. I'd still like to see something with more benefit; not
trying to one-up you, trying to learn from your experience. If there
really are precluded optimizations then I would be more than willing to
rethink our pointer operations.

Ivan

MitchAlsup

unread,

Apr 16, 2013, 12:20:07 AM4/16/13

to

On Monday, April 15, 2013 12:41:56 PM UTC-5, Ivan Godard wrote:
> And even if the compiler does know - the gain is miniscule, the
> applicability is far-fetched, and the cost in compiler complexity and
> entropy for the added no-check operations would be paid by everyone.

Errr, not quite.....

Say one has ::

Push EAX
Push EBX
Push ECX

when one does the transformationa thte operation level rather than the
instruction level; you can get::

ST EAX,[SP-4]
ST EBX,[SP-8]
ST ECX,[SP-12]
SUB SP,SP,12

Making 3 serially dependent instructions into 4 parallel operations.
The serial dependency has been removed. Here, the gain is demonstrable.

Mitch

Ivan Godard

unread,

Apr 16, 2013, 12:52:41 AM4/16/13

to

Now you have me even more confused than Andy did!

I'm trying to find a case in which it is beneficial, before or after
optimization, for the result of a pointer manipulation operation to be
outside the language-permitted range of within the array or one after
it, but restored to a legal value before it is dereferenced. I grant
that Andy's case is one such, but question the benefit.

But I don't see how your example above applies at all - there's no
array, and the stack pointer, while it may be a "pointer" in machine
terms, surely isn't a pointer in linguistic terms even if you can add to
it. And even if you want to think of it as such, the "array" is the
whole potential stack, not just the occupied part, as witness that the
program can read or write from the "empty" part without a dereference
error. So you are not outside the array bounds.

Please explain, because I'm not getting it.

Andy (Super) Glew

unread,

Apr 16, 2013, 3:21:16 AM4/16/13

to

On 4/15/2013 1:55 PM, Ivan Godard wrote:
> On 4/15/2013 11:03 AM, Andy (Super) Glew wrote:
>> On 4/15/2013 10:41 AM, Ivan Godard wrote:
>>> On 4/15/2013 8:44 AM, Andy (Super) Glew wrote:
>>
>>> And even if the compiler does know - the gain is miniscule, the
>>> applicability is far-fetched, and the cost in compiler complexity and
>>> entropy for the added no-check operations would be paid by everyone. So,
>>> do you have a *real* example such that the optimization should be a
>>> consideration in defining an new ISA? (which was the original subject,
>>> about adding checking pointer arithmetic).
>>>
>>> Ivan
>>
>> No, Ivan: the cost in adding checked arithmetic operations rather than
>> checked derefs would be born by everyone.
>>
>> You have to have unchecked arithmetic operations. Not everything is
>> pointer arithmetic - and pointer arithmetic uses the very same ALUs as
>> normal arithmetic.
>
> Ah! You are assuming that pointer arithmetic is the same as integer
> arithmetic, whereas I am assuming that it is not and so there must be
> distinct pointer arithmetic operations, which (IMO) may as well be checked.
>
> There are good reasons for pointers to not be the same as integers.
>

>> Heck, man, the real question is whether there should be any checking at
>> all. Whether checking should not just be done by additional
>> instructions.
>
> I show my bias on that one: IMO everything that can be, should be
> checked. 1) Simplifies the compiler 2) Cuts down the debugging 3) Deals
> with 3rd party binaries that you can't compile a debug version of 4)
> Doesn't get overlooked. 5) More humane. It's the same argument as for
> using parity on DRAM :-)

Trouble is, you and I want very much the same things.

I want to check as much as possible.

I want to check memory dereferences to prevent buffer overflow and
dangling pointer security holes.

I also think that it would be good to check many integer values for
going outside their specified-by-the-user bounds. I may not have used
Mary, but I grew up with the Ada spec. I would hope that a restricted
range integer type would be checked if it went outside the range.

For that matter, I would like have checked ranges on floating point
values. And also check for integer overflow/underflow.

But: that's more work. And some languages do not allow
overflow/underflow checks, like C/C++ - or at least they require 2's
complement wrapping.

And it incontrovertibly requires more work for hardware to check values
and pointers on the fly than if software can check them, prove them
safe, statically. Or if software can eliminate unnecessary checks, even
if it can't do a full static analysis.

More work: more power, certainly.

I'm willing to propose, evangelize, making some checks in hardware
mandatory, unavoidable because they have proven to be big security
risks. And basically I do not want to include the compiler in the
security perimeter. Compilers have too many bugs.

But in the interests of saving power - and also in the interests of not
scaring away too many potential customers, who are less sure of the need
for checking - I'd like to allow both checked and unchecked arithmetic.

And back to LEA:

Yes, sometimes there are reasons to make pointers not be the same as
integers.

But on many machines, pointers, addresses, *ARE* the same as integers.

>
>> I get utterly pissed off by "show me an example of X" or "prove Y"
>> questions, where it boils down to the winner being whoever asked the
>> question first. Step back a bit, and look at the entire set of
>> alternatives.
>
> Sorry to offend :-( Your example optimization was meaningful whether
> pointers are distinct from integers or not; I question it's
> cost-benefit. I'd still like to see something with more benefit; not
> trying to one-up you, trying to learn from your experience. If there
> really are precluded optimizations then I would be more than willing to
> rethink our pointer operations.

>
> Ivan
>

Terje Mathisen

unread,

Apr 16, 2013, 5:21:53 AM4/16/13

to

Andy (Super) Glew wrote:
> On 4/15/2013 1:55 PM, Ivan Godard wrote:
>> I show my bias on that one: IMO everything that can be, should be
>> checked. 1) Simplifies the compiler 2) Cuts down the debugging 3) Deals
>> with 3rd party binaries that you can't compile a debug version of 4)
>> Doesn't get overlooked. 5) More humane. It's the same argument as for
>> using parity on DRAM :-)
>
> Trouble is, you and I want very much the same things.
>
> I want to check as much as possible.

And so do I. :-)

>
> I want to check memory dereferences to prevent buffer overflow and
> dangling pointer security holes.

As I stated previously, I believe deref is the only safe place for such
checks.

>
> I also think that it would be good to check many integer values for
> going outside their specified-by-the-user bounds. I may not have used
> Mary, but I grew up with the Ada spec. I would hope that a restricted
> range integer type would be checked if it went outside the range.
>
> For that matter, I would like have checked ranges on floating point
> values. And also check for integer overflow/underflow.
>
> But: that's more work. And some languages do not allow
> overflow/underflow checks, like C/C++ - or at least they require 2's
> complement wrapping.

I liked Pascal's limited ranges.

>
> And it incontrovertibly requires more work for hardware to check values
> and pointers on the fly than if software can check them, prove them
> safe, statically. Or if software can eliminate unnecessary checks, even
> if it can't do a full static analysis.
>
> More work: more power, certainly.
>
> I'm willing to propose, evangelize, making some checks in hardware
> mandatory, unavoidable because they have proven to be big security
> risks. And basically I do not want to include the compiler in the
> security perimeter. Compilers have too many bugs.
>
> But in the interests of saving power - and also in the interests of not
> scaring away too many potential customers, who are less sure of the need
> for checking - I'd like to allow both checked and unchecked arithmetic.
>
> And back to LEA:

LEA is way too useful for short multiplications to limit it to only
generate actual pointers.

>
> Yes, sometimes there are reasons to make pointers not be the same as
> integers.
>
> But on many machines, pointers, addresses, *ARE* the same as integers.

Doesn't C(++) basically require this, with intptr_t defined as an
integer type which can hold "any pointer type"?

Ivan Godard

unread,

Apr 16, 2013, 10:36:11 AM4/16/13

to

Last time I looked, C/C++ said overflow behavior was "implementation
defined". Wrap (modulo arithmetic) is permitted but not required.
Anybody - has this changed underneath me?

> And it incontrovertibly requires more work for hardware to check values
> and pointers on the fly than if software can check them, prove them
> safe, statically. Or if software can eliminate unnecessary checks, even
> if it can't do a full static analysis.
>
> More work: more power, certainly.

We support four forms of integer overflow behavior: modulo, saturating,
excepting, and double-length result. The (exposed) latencies vary by
family member; saturating is usually the slowest where there is a
difference. Which arithmetic is used is user-controlled by compiler
option, source pragma, or explicit intrinsic; we intend eventually to
integrate the choices into the type system so you can say
__saturating short sh = 3;
with a type lattice to determine what you get in mixed-type operations.

According to Art, not all that much work is required because it's built
into the ALU, which has to have overflow detection anyway, even without
behavior selection, to support the equivalent of condition codes. With a
factor of four or so power saving in the bank, I'm prepared to spend a
little of it on this. If the alternative is checking done in software
then I'm saving power overall because I'm not executing the check codes.

We do not have hardware overflow checks for non-machine ranges, because
we don't extend the type system into the hardware that much; we are not
an IAPX432. We have looked at adding range check operations, but they
are three-input in the dynamic case (where the range is run-time), and
we can do as well with two two-input checks explicit from the compiler.
The static case is more interesting, but is hard to fit into the
encoding system, so it is also two explicit checks, at least until I
figure out a good way to encode it. The alternative of treating ranges
as a hardware primitive type lets the check be two-input, but breaks for
the widest operand size: if quad is supported in hardware, how big is
range<quad> and how do you fit one through the pipe?

Pointers are not integers, and pointer arithmetic checks that the result
is still meaningful. You can klobber the mode bits by casting an integer
to pointer of course, and that will either give you a wild address
(caught at deref time) or will hose the garbage collector if you are in
a GC environment. There is no range check because these are pointers,
not caps, and we don't have any range to check against.

>
> I'm willing to propose, evangelize, making some checks in hardware
> mandatory, unavoidable because they have proven to be big security
> risks. And basically I do not want to include the compiler in the
> security perimeter. Compilers have too many bugs.

Amen.

> But in the interests of saving power - and also in the interests of not
> scaring away too many potential customers, who are less sure of the need
> for checking - I'd like to allow both checked and unchecked arithmetic.

As we do, although I wish I had cap pointers and would range check them
if I did.

> And back to LEA:
>
> Yes, sometimes there are reasons to make pointers not be the same as
> integers.
>
> But on many machines, pointers, addresses, *ARE* the same as integers.

And it's so nice not to be working on many machines :-)

Ivan

Ivan Godard

unread,

Apr 16, 2013, 10:41:40 AM4/16/13

to

On 4/16/2013 2:21 AM, Terje Mathisen wrote:
> Andy (Super) Glew wrote:

>> Yes, sometimes there are reasons to make pointers not be the same as
>> integers.
>>
>> But on many machines, pointers, addresses, *ARE* the same as integers.

But *SHOULD* they be :-)

> Doesn't C(++) basically require this, with intptr_t defined as an
> integer type which can hold "any pointer type"?

Required to be big enough, but assignment requires a cast.
char* p;
intptr_t ip = p;
will get you a compiler diagnostic.

Ivan

Piotr Wyderski

unread,

Apr 16, 2013, 12:04:56 PM4/16/13

to

Ivan Godard wrote:

> One entry after but not one entry before.

Beg to disagree: std::vector<T>::reverse_iterator etc.

Best regards, Piotr

Piotr Wyderski

unread,

Apr 16, 2013, 12:14:43 PM4/16/13

to

Terje Mathisen wrote:

> Doesn't C(++) basically require this, with intptr_t defined as an
> integer type which can hold "any pointer type"?

No, it doesn't. The standard is extremely vague here and does
not specify what a pointer is. The requirement you mentioned
above should be read that a conversion of a pointer into
intptr_t followed by a conversion back into the original
pointer type results in a value which points to the same
entity. Nothing more.

Of course, every sane implementation of both the hardware
and the compiler assumes that a pointer is a special kind
of integral type, but strictly speaking, it doesn't have
to be so.

Best regards, Piotr

Jean-Marc Bourguet

unread,

Apr 16, 2013, 12:23:59 PM4/16/13

to

Ivan Godard <iv...@ootbcomp.com> writes:

>> But: that's more work. And some languages do not allow
>> overflow/underflow checks, like C/C++ - or at least they require 2's
>> complement wrapping.
>
> Last time I looked, C/C++ said overflow behavior was "implementation
> defined". Wrap (modulo arithmetic) is permitted but not required. Anybody -
> has this changed underneath me?

signed overflow is undefined (and some optimizers use this to guess
possible range and propagate the information backwards -- so has usual
whatever the hardware provide, the compiler has the final word about
what is really available -- ; gcc has an option to make it use wrap
around instead of the default UB; its funny to see discussion between
the two camps "UB is UB and the optimizer may do whatever it want" and
"UB is there to take care of special architecture and should behave
sanely when you know the arch").

unsigned overflow is wrap around without discussion.

[...]

Is there any public info about you arch excepted your posts?

A+

--
Jean-Marc

Piotr Wyderski

unread,

Apr 16, 2013, 12:27:00 PM4/16/13

to

Ivan Godard wrote:

>> But: that's more work. And some languages do not allow
>> overflow/underflow checks, like C/C++ - or at least they require 2's
>> complement wrapping.
>
> Last time I looked, C/C++ said overflow behavior was "implementation
> defined". Wrap (modulo arithmetic) is permitted but not required.

That is what happens in case of signed integral types. Almost
nothing is specified. It is not even specified whether the
bitwise "and" operation of two valid ints is a valid int
-- famous "trap representations". The representation may
not even be binary.

In case of unsigned integrals things are much more clear.
Guaranteed 2^N modulo operations for some (unspecified) N
with no traps, overflows etc.

> Anybody - has this changed underneath me?

Not much. They've just added the (u)intXX_types.

Best regards, Piotr

Jean-Marc Bourguet

unread,

Apr 16, 2013, 12:34:09 PM4/16/13

to

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:

> Doesn't C(++) basically require this, with intptr_t defined as an integer
> type which can hold "any pointer type"?

The mapping functions for converting a pointer to an integer or an
integer to a pointer are intended to be consistent with the addressing
structure of the execution environment. (for C)

The mapping is intended to be unsurprising to those who know the
addressing structure of the underlying machine. (for C++)

But that doesn't mean the mapping should be such that
1+(uintptr_t)p == (uintptr_t)(p+1) when p is a char pointer. I'd expect
it to be different for a segmented architecture where the ISA make it
natural to put the segment part in the LSB.

A+

--
Jean-Marc

Ivan Godard

unread,

Apr 16, 2013, 12:41:54 PM4/16/13

to

On 4/16/2013 9:23 AM, Jean-Marc Bourguet wrote:

> Ivan Godard <iv...@ootbcomp.com> writes:

>
> Is there any public info about you arch excepted your posts?

Coming, slowly. We have submitted papers to Micro46 and Hot Chips. There
will be talks in the next couple of months at Asilomar Micro Workshop,
Vail Computer Elements Workshop, and Stanford EE380. The full doc (500
pages text + 450 slides) will go on our public website when the patent
filings are finally done - a backbreaking task.

Ivan

Ivan Godard

unread,

Apr 16, 2013, 12:47:45 PM4/16/13

to

On 4/16/2013 9:04 AM, Piotr Wyderski wrote:
> Ivan Godard wrote:
>
>> One entry after but not one entry before.
>
> Beg to disagree: std::vector<T>::reverse_iterator etc.

Reverse-iterator is not a pointer, and the pointer it contains (in a
naive literal implementation) is the same as for a forward iterator; the
offset of -1 is implied in the dereference.

Ivan

Piotr Wyderski

unread,

Apr 16, 2013, 12:54:59 PM4/16/13

to

Ivan Godard wrote:

> Reverse-iterator is not a pointer

It's a simple wrapper of a pointer.

> and the pointer it contains (in a naive literal implementation) is the same
> as for a forward iterator; the offset of -1 is implied in the
dereference.

I would have never implement it that way, as it causes a lot of
trouble during debugging: the debugger-visible dereference points
to a different object than the one you might naively assume it points to.

Best regards, Piotr

Andy (Super) Glew

unread,

Apr 17, 2013, 9:22:54 PM4/17/13

to

On 4/16/2013 2:21 AM, Terje Mathisen wrote:
> Andy (Super) Glew wrote:

>> I also think that it would be good to check many integer values for
>> going outside their specified-by-the-user bounds. I may not have used
>> Mary, but I grew up with the Ada spec. I would hope that a restricted
>> range integer type would be checked if it went outside the range.
>>
>> For that matter, I would like have checked ranges on floating point
>> values. And also check for integer overflow/underflow.

[when allowed. some languages, like C, do not allow overflow checking]

>>
> I liked Pascal's limited ranges.

Heck: I like the idea of having arbitrary predicates that check every
value written to particular variables.

Heck^2: I like the idea of having arbitrary predicates that can evaluate
data structure consistency at particular places. Like at transaction
boundaries,.

But I am not proposing hardware support for everything like this.

I only propose hardware support for small stuff. Like checking bonds at
dereferences.

>>
>> But on many machines, pointers, addresses, *ARE* the same as integers.
>
> Doesn't C(++) basically require this, with intptr_t defined as an
> integer type which can hold "any pointer type"?
>
> Terje
>
>

--

Andy (Super) Glew

unread,

Apr 17, 2013, 9:34:05 PM4/17/13

to

Apparently so.

This is from a working draft - I have not yet bought my own copy of the
standard.

4. Unsigned integers, declared unsigned, shall obey the results of
arithmetic modulo 2^n where n is the number of bits in the value
expression of that particular size of integer.

Here's a ore authoritative quote from the standard, second hand:
https://www.securecoding.cert.org/confluence/display/seccode/INT30-C.+Ensure+that+unsigned+integer+operations+do+not+wrap

Section 6.2.5, paragraph 9, of the C Standard [ISO/IEC 9899:2011] states:

A computation involving unsigned operands can never overflow, because a
result that cannot be represented by the resulting unsigned integer type
is reduced modulo the number that is one greater than the largest value
that can be represented by the resulting type.

Michael S

unread,

Apr 18, 2013, 11:41:30 AM4/18/13

to

On Apr 18, 4:22 am, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:

> On 4/16/2013 2:21 AM, Terje Mathisen wrote:> Andy (Super) Glew wrote:
> >> I also think that it would be good to check many integer values for
> >> going outside their specified-by-the-user bounds. I may not have used
> >> Mary, but I grew up with the Ada spec. I would hope that a restricted
> >> range integer type would be checked if it went outside the range.
>
> >> For that matter, I would like have checked ranges on floating point
> >> values. And also check for integer overflow/underflow.
>
> [when allowed. some languages, like C, do not allow overflow checking]
>
>
>
> > I liked Pascal's limited ranges.
>
> Heck: I like the idea of having arbitrary predicates that check every
> value written to particular variables.
>
> Heck^2: I like the idea of having arbitrary predicates that can evaluate
> data structure consistency at particular places. Like at transaction
> boundaries,.
>
> But I am not proposing hardware support for everything like this.
>
> I only propose hardware support for small stuff. Like checking bonds at
> dereferences.
>
>

I don't see a difference.
Bound check at dereferences, except for coarse-grain checks required
for process-level isolation (where OS does not trust the run time),
are exactly at the same level as checks, proposed by Ivan. I.e. 90% of
time compiler knows at compile time that the reference can't reach out
of bound. Or, may be, 98% if we a talking about array-intensive
workloads.
So, as you said above: "And it incontrovertibly requires more work for

hardware to check values and pointers on the fly than if software can
check them, prove them safe, statically."

Practically it means that either you trust compiler, and then there is
no need for hardware supported checks at all, or you don't trust
compiler, and then the more checks in the hardware the merrier.
Or should I spell it "the Maryier"?

timca...@aol.com

unread,

Apr 18, 2013, 1:10:57 PM4/18/13

to

>So, as you said above: "And it incontrovertibly requires more work for
>hardware to check values and pointers on the fly than if software can
>check them, prove them safe, statically."

It may be more "work" in the physics sense (more transistors using energy),
but it can be implemented so it doesn't impact performance. After all, a
memory load on a 286 wasn't any slower in protected mode vs. real mode. It
was the segment/selector load that got real slow. That could be fixed with
more segment registers, OOO, etc.

- Tim

Michael S

unread,

Apr 18, 2013, 2:10:43 PM4/18/13

to

On Apr 18, 8:10 pm, timcaff...@aol.com wrote:
> >So, as you said above: "And it incontrovertibly requires more work for
> >hardware to check values and pointers on the fly than if software can
> >check them, prove them safe, statically."
>
> It may be more "work" in the physics sense (more transistors using energy),

Which, by the way, is bad. More so, energy, less so transistors .

> but it can be implemented so it doesn't impact performance. After all, a
> memory load on a 286 wasn't any slower in protected mode vs. real mode. It
> was the segment/selector load that got real slow. That could be fixed with
> more segment registers, OOO, etc.
>
> - Tim

286 had only 4 segments and the default pairing between base register
and segment selector that was applied most of the time. When you
wanted to use non-default pairing (segment override prefix) I don't
think it was still as fast as default. And it was never (or *almost*
never, crazy things happen) used for sort of checks that Andy and Ivan
are discussing here.

In order to assure, with 286-style checks in hardware, that each and
every access falls within bounds of particular HLL object you'll
will
option 1: lots of selectors + very frequent use of an analogy of
segment override prefix
option 2: not so many selectors/overrides, but frequent reload of
segment descriptors

And still you heavily depend on compiler for doing everything right.

Methinks, both options are insane. Likely, Andy and Ivan have in mind
something that is not similar at all to 286.

May be, sparse allocation of objects in, say, 96-bit address space
coupled with parallel comparison against dozens of ranges.
Of course, that's architecture-level description. Real hardware will
do as much caching/prediction and as few actual comparisons as
possible. But even with very sophisticated hardware, I don't think it
can be made "free" from latency/throughput perspective.

Ivan Godard

unread,

Apr 18, 2013, 3:26:57 PM4/18/13

to

:-)

There are two kinds of checking, which often get muddled together in
discussions like these.

In one kind of checking, the check enforces system security and
survivability. Page-resident checks, privileged operations, that kind of
thing. These are mandatory whenever the hardware is used by more than
one entiry. You don't need them at all if there is only one user, for
example in embedded process control - your friendly local thermostat for
example.

The other kind of checking is for user sanity and debugging, where an
unchecked failure won't harm anybody else but the failing app, but the
check sure helps the guy who has to maintain the code.

Very early on we decided that the Z80 market was safe from us and that
we would presume mutually distrustful sharing of the CPU. Consequently
we have all the mandatory checks of the first kind - you can screw
yourself on the Mill, but you can't screw your buddy without his help.

In addition, as a philosophical principle and as a long-run sales
encourager we attempt to sanity-check everything we can, as early as we
can. Thus we have a lot of the second sort of checks that are uncommon
on other architectures.

In the present discussion, dereference checking that the target is
within the bounds of the thread's memory space is category one and hence
mandatory, whatever it costs. Checking at address-arithemetic time or
dereference time that the target is within a particular array (as
opposed to within the thread address space) is category two, and will be
included in the architecture if it can be done at low enough performance
or power cost.

I don't know how to do the latter check without caps. I'm sure that the
market will ignore an optional caps system, making having one a
pointless waste of effort. I'm not sure whether the market will tolerate
a mandatory caps system. Or rather, I'm not sure whether a mandatory
caps system can be made cheap enough (in power, cycles, and developer
grok cost) for the market to tolerate it.

As for software checking: I would use it only for category two, and even
there I would prefer that the check be automatic and in hardware rather
than optional and in software. The bozos (pardon the expression) who
shut off software checking because "it's too expensive" or "the code is
obviously correct already" are precisely the ones that need obligatory
hardware checking :-)

You may have heard the CDC motto: "Wrong answers faster!".

Ivan

Chris Gray

unread,

Apr 19, 2013, 12:59:24 PM4/19/13

to

Ivan Godard <iv...@ootbcomp.com> writes:

> As for software checking: I would use it only for category two, and
> even there I would prefer that the check be automatic and in hardware
> rather than optional and in software. The bozos (pardon the
> expression) who shut off software checking because "it's too
> expensive" or "the code is obviously correct already" are precisely
> the ones that need obligatory hardware checking :-)

For my Zed language, all I need are overflow/underflow checks on signed
and unsigned integer arithmetic, "nil" checks for pointers, and a range
check. I do not need restart on any of those errors. In my current bytecode
I have rangechecks against 16, 32, and 64 bit constant fields.

Get me a Mill desktop with those things, running some kind of Linux, and
I'll buy one, even if its twice the cost of an X86-64 one! Hopefully its
not too hard to generate reasonable machine code for?

My bytecode also has other checking instructions, but they are too run-
time specific to include in general purpose hardware. E.g.:

bc_pchk, /* Proc CHecK. The single argument to this instruction
is the ref table index of a type. The type will be
a proc type. The top-of-stack value is of type
Proc/Proc_t. The type of that proc is compared to
the argument type. If they match, all is well,
and the value on the stack is unchanged. If they
do not match, top-of-stack value is replaced with
a nil value. */

:-)

--
Chris Gray

Ivan Godard

unread,

Apr 19, 2013, 2:32:34 PM4/19/13

to

The Mill does not have a type system in hardware, and leaves all issues
of type to software. The reason for that choice is that type systems
vary wildly and are mutually incompatible, and we feel we have no
business trying to adjudicate what should happen in e.g. cross language
calls.

Aside from type constraints, there are many non-type "this shouldn't
happen" things that a hardware can detect though few do. Some for us:

1) floating point fault. The hardware will (optionally) take a fault on
the occurrence of any of a specified subset set of the floating point
exceptions, rather than just setting the flags for later.

2) invalid operation fault. All instruction encodings either execute a
defined and meaningful operation or fault; there are no undefined or
silently reserved bit patterns.

3) invalid address fault. This will ultimately lead to the equivalent of
a Unix segv if not handled by the application. Includes nil pointers as
well as protection errors.

4) invalid operand fault. An operation received an operand that is
explicitly invalid (as opposed to undefined); in general uninitialized
non-memory state is invalid, while uninitialized memory state is zero.

5) resource constraint fault. The encoded operation requires use of a
resource that is currently in use; reflects a compiler bug.

6) stack underflow fault. A return operation from the lowest frame

7) unhandled fault fault. An attempt to return from a fault handler

8) semantic fault. Operation was not defined for its actual arguments,
e.g. integer overflow.

9) application fault. Thrown only by an explicit operation, for use by
software.

Paul A. Clayton

unread,

Apr 19, 2013, 6:17:04 PM4/19/13

to

On Apr 19, 2:32 pm, Ivan Godard <i...@ootbcomp.com> wrote:
[snip]

> 2) invalid operation fault. All instruction encodings either execute a
> defined and meaningful operation or fault; there are no undefined or
> silently reserved bit patterns.

So you decided not to support any backward compatible
hint operations. (Providing for future addition of
hint operations is one of Andy Glew's suggestions for
ISA design.)

Given that the Mill is a _family_ of ISAs, longer-term
binary compatibility might not be considered that
important. (Dropping the need for binary compatibility
would seem to help with VLIW. It would also allow for
a modest increase in code density [no need to leave
room for future extension]. However, without binary
compatibility managing software becomes more complex.)

Is the Mill only intended to be source-level
compatible? Or is there some other plan for
compatibility?

Paul A. Clayton

unread,

Apr 19, 2013, 7:02:06 PM4/19/13

to

On Apr 15, 1:48 am, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
> On 4/10/2013 5:19 AM, Paul A. Clayton wrote:
[snip]
>> If one cannot widen a memory channel (because of commodity
>> memory chip width constraints), one could add a spare channel
>> as the last (?) Alpha system did (though it used such as an
>> optional mechanism, to provide "module-kill").
>
> David Wang says this is increasngly unlikely, as DRAM widths grow past
> 8b to 16b, 32b, etc. 1024b on chip stacking. Adding an extra channel
> for width oriented ECC also becomes a dead end.

Death to commodity memory constraints!! :-)

On a serious note, it does seem that standard DRAM
interfaces noticeably hold back performance,
energy-efficiency, and the addition of features.
Incremental evolution (add banks, increase burst
length, etc.) and extreme price-per-bit focus are
not innovation-friendly.

16-bit chips (even with burst chop length of 4)
would seem to make some more aggressive RAS features
substantially more difficult to implement. (On the
other hand, this would allow extending a 72-bit ECC
module to 80 bits "for free". :-)

> So long as DRAMs provide 2^n channel widths...

It is not just the 2^n channel width that is the
problem; the fact that each channel optimally
provides one cache line of data also constrains the
use of multiple channels.

The use of an additional channel for metadata would
only differ from "Poor Man's ECC" by using a
hardwired location for the metadata in a separate
channel (with the possibility to optimize that
channel's width, burst length, etc. for its use to
store metadata).

> With chip stacking, however, we may hope that 2^n + ecc_width, e.g. 1024
> + ecc_width, channels may be standardized. But, again, they will be
> likely to have all ecc_width bits used for ECC.

If enough users of memory considered providing extra
bits for metadata sufficiently important, I suspect
providing such would not be too difficult or
expensive. I suspect that a large part of the
problem is that not enough users care enough.

I do wonder how Itanium systems implement their
coherence directory. While such systems have the
advantages of cost being a less significant constraint
and of buyers being willing to accept proprietary
memory modules, the problem still seems interesting.
(Recent Itanium processors have on-chip memory
controllers and directory caches, so there is
presumably some "standardization" of the external
storage of this metadata.)

[snip]
>> Although I am not certain that Andy has mentioned it, from
>> his other comments on using space freed by compression for
>> storing metadata, I suspect that he has considered using
>> some such space as a cache for such alternate-storage
>> metadata (an adequately compressible chunk allowing a
>> "cache hit"). (Metadata could be sorted by criticality or
>> frequency of use so that the most important metadata could
>> have the highest hit rates.)
>
> Storing metadata in space freed up by compression in fixed size data
> blocks is a performance optimization. You still need to be able to
> handle the wost case of no compression - via outlying metadata.

Yes, thus "cache". Of course, if the metadata is
hint-oriented, then its loss is less critical. As
has been noted, some security features can be
provided with hint-like semantics. (Security metadata
that relied on compressibility would be more vulnerable
to attacks. While lack of compressibility might be
used as a clue that an attack is being attempted--and
lack of security metadata could be recognized as less
secure--, such explicitly vulnerable security metadata
makes me a bit uncomfortable, even though only presented
as a hint [in part because the degree of added security
might be greatly overestimated by the user].)

Ivan Godard

unread,

Apr 19, 2013, 7:34:15 PM4/19/13

to

On 4/19/2013 3:17 PM, Paul A. Clayton wrote:
> On Apr 19, 2:32 pm, Ivan Godard <i...@ootbcomp.com> wrote:
> [snip]
>> 2) invalid operation fault. All instruction encodings either execute a
>> defined and meaningful operation or fault; there are no undefined or
>> silently reserved bit patterns.
>
> So you decided not to support any backward compatible
> hint operations. (Providing for future addition of
> hint operations is one of Andy Glew's suggestions for
> ISA design.)

We have a different compatibility mechanism, that generalizes to such
additions. Consequently we can both have future expansion and compatibility.

> Given that the Mill is a _family_ of ISAs, longer-term
> binary compatibility might not be considered that
> important. (Dropping the need for binary compatibility
> would seem to help with VLIW. It would also allow for
> a modest increase in code density [no need to leave
> room for future extension]. However, without binary
> compatibility managing software becomes more complex.)
>
> Is the Mill only intended to be source-level
> compatible? Or is there some other plan for
> compatibility?
>

Any Mill .exe will execute on any Mill family member. It may not execute
correctly if it uses MMIO to access things that don't exist on the
platform executing it, and it cannot hard wire the MMIO addresses even
of things that do exist on both platforms.

Of course, the same it true of all chips. However, we can say that code
above the HAL layer, whether application or OS, is portable.

We believe that the compatibility mechanism is robust enough that future
platform extension, including hint operations, can be integrated, but of
course we cannot be certain of that. Breaking compatibility would be a
very major decision, and I can't see it being taken for something as
trifling as say a 20% performance gain.

Joe keane

unread,

Apr 19, 2013, 7:43:21 PM4/19/13

to

In article <kkph8g$ajp$1...@dont-email.me>,

Ivan Godard <iv...@ootbcomp.com> wrote:
>I'm sure that the market will ignore an optional caps system, making
>having one a pointless waste of effort.

http://demotivators.despair.com/demotivational/mistakesdemotivator.jpg

Andy (Super) Glew

unread,

Apr 19, 2013, 8:03:29 PM4/19/13

to

I resemble that remark!

Andy (Super) Glew

unread,

Apr 19, 2013, 8:59:19 PM4/19/13

to

I was about to say that I don't think that is too bad, as you could
always use an extra channel for the metadata - or, more likely, rotate
the metadata through several channels, as in a RAID configuration.

Which - DRAM RAID - by the way, is already available. So in some ways
this is already being done.

But, I hold myself back from saying that this isn't too bad - because it
rounds up the cache line traffic.

A workload of random word sized accesses, my favorite stalking horse,
would not only be rounded up from a word to a cache line - it would be
rounded up to two cache lines.

Writing would be worse, however: a workload of single word writes would
be rounded up from 1 word write to 2 cacheline reads & 2 cacheline
writes. Since the net effect of RAID techniques is to convert partial
writes into 2-line read-modify-writes.

However, so would any of the various flavors of Poor Man's ECC.

(Well, except the lengthwise ... hmm... this suggests that keeping the
"ECC block", which is probably the cache coherency block size, equal to
half the burst bus transfer size, might tip the balance in our favor,
since (1) for sequential access patterns it would be the same, but (2)
for random word sized access patterns half the read accesses would only
require 1 burst transfer rather than 2, and ditto for writes, half 1rmw
rather than two.

=> Average cost for random is then 1.5r for reads read and 1.5rmw for
writes. And 1.5rmw for random GUPS like patterns.

=> Whereas for dense patterns the overhead is just the size of the metadata.

With compression making it often unnecessary to fetch an extra metadata.

> The use of an additional channel for metadata would
> only differ from "Poor Man's ECC" by using a
> hardwired location for the metadata in a separate
> channel (with the possibility to optimize that
> channel's width, burst length, etc. for its use to
> store metadata).

When David opened my eyes to the barriers to getting ubiquitous width
oriented ECC (e.g. 64+8),

* I first considered trying to persuade JEDEC to allow burst lengths of
5, or 9 - keeping the x8 or x16 interface, but allowing vendors who
cared to provide "ECC DRAM chips" that kept the same pinout, but which
optionally used 1 cycle longer bursts.

When that was shot down, for reasons including the fact that DRAM
vendors already have spare DRAM bitcell arrays on chip - which they use
via things like laser editing to improve yield (I must admit that I
never liked this argument, countering that once the chip was sold there
would be no more laser editing - so why not use the extra but unused
arrays for ECC, if there were enough of them? Although I do not like
that this would move towards ECC continuing to be a premium product.
Still, premium )

* Then I tried address scaling, the classic Poor Man's ECC. Keeping the
metadata adjacent.

* Other proposals have involved having the metadata non-adjacent, at a
fixed address.

In-cache-line-at-fixed-address compression can avoid the need to fetch
many lines of metadata. So non-adjacent may be the overall winner,
since for sequential patterns that compress there is no bandwidth loss
[*], whereas with adjacent metadata

*: no bandwidth loss compared to uncompressed. I don't think that it is
reasonable to compare to compressed, since compressed main memory is
awkward. IBM's approach, of making compressed main memory really an
extra level in the hierarchy, is probably the way to go.

>
> If enough users of memory considered providing extra
> bits for metadata sufficiently important, I suspect
> providing such would not be too difficult or
> expensive. I suspect that a large part of the
> problem is that not enough users care enough.

Yep.

Many of these proposals are transient, evolutionary:

If enough people care about ECC, then DRAM systems will support ECC
properly - and then kluges like Poor Man's ECC - or the variety that
Nvidia is apparently deploying - may not be necessary.

But the kluges seem to help the transition period. Demonstrating that
thee is a market.

It's not enough to have a clear vision of what the final architecture
should be. One also needs to have an idea of how to get there from here
- via steps, each of which can be sold.

> Yes, thus "cache". Of course, if the metadata is
> hint-oriented, then its loss is less critical. As
> has been noted, some security features can be
> provided with hint-like semantics. (Security metadata
> that relied on compressibility would be more vulnerable
> to attacks. While lack of compressibility might be
> used as a clue that an attack is being attempted--and
> lack of security metadata could be recognized as less
> secure--, such explicitly vulnerable security metadata
> makes me a bit uncomfortable, even though only presented
> as a hint [in part because the degree of added security
> might be greatly overestimated by the user].)

Yes, this makes me uncomfortable.

I say "security is a hint" only if I want to run a security enabled
binary on a system that has no security support at all.

On a system that has security support, I want to get the same answer all
the time. If there are data patterns that can lead to breakins, the
hackers will find them.

I am comfortable with putting performance information - e.g. branch
prediction info, typical physical memory addresses for prefetch AT THE
MEMORY CONTROLLER - in such transient space, that might be eliminated by
recompression with different data,

Stephen Fuld

unread,

Apr 20, 2013, 2:18:03 AM4/20/13

to

On 4/14/2013 10:48 PM, Andy (Super) Glew wrote:
> On 4/10/2013 5:19 AM, Paul A. Clayton wrote:

>> On Apr 10, 3:00 am, Terje Mathisen <"terje.mathisen at tmsw.no">

>> wrote:
>>> Andy (Super) Glew wrote:

>>>> I am not so sure that 2^N sized data packets need always endure. But I
>>>> agree with Mitch that 2^N sized data packets are what we have now, and
>>>> are unlikely to change any time soon.
>>>
>>> The only real chance for tag bits have already been used for a while, in
>>> the form of specialized memory controllers that can steal a bit from ECC
>>> RAM, right?

>>
>> If one cannot widen a memory channel (because of commodity
>> memory chip width constraints), one could add a spare channel
>> as the last (?) Alpha system did (though it used such as an
>> optional mechanism, to provide "module-kill").
>
> David Wang says this is increasngly unlikely, as DRAM widths grow past
> 8b to 16b, 32b, etc. 1024b on chip stacking. Adding an extra channel
> for width oriented ECC also becomes a dead end.
>

> So long as DRAMs provide 2^n channel widths...

IBM's AS/400 (now called system P?) famously uses a 65 bit word, with
one bit for a tag. Since they control the CPU, that part is easy. But
do they use commodity DRAM or do they use some special DRAM chip that
only IBM produces?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Anne & Lynn Wheeler

unread,

Apr 20, 2013, 11:22:20 AM4/20/13

to

Stephen Fuld <SF...@alumni.cmu.edu.invalid> writes:
> IBM's AS/400 (now called system P?) famously uses a 65 bit word, with
> one bit for a tag. Since they control the CPU, that part is easy.
> But do they use commodity DRAM or do they use some special DRAM chip
> that only IBM produces?

i remember long arguments between the 801/risc group and the as/400
group over the addition of 65th bit during work on first 64bit 801/risc

note circa 1980 there was extensive efforts to converge large variety of
internal microprocessors to 801/risc ... including the processors used
in low&mid-range 370, the follow-on to s/38 (aka as/400, originally
referred to as Fort Knox)), bunch of processors used in controllers,
etc. the low&mid-range 370 and as/400 were to use 801/risc Iliad
chip. For various reasons all of these efforts floundared ... and
efforts returned to conventional cisc chip. as a result, there were some
number of 801/risc engineers leaving and going to other vendors to work
on risc. misc. past posts mentioning 801/risc
http://www.garlic.com/~lynn/subtopic.html#801

original as/400 introduced 1988
http://www-03.ibm.com/ibm/history/exhibits/rochester/rochester_4010.html

finally does move to 801/risc in 1995
http://en.wikipedia.org/wiki/IBM_System_i

with 64bit power/pc (as opposed to rios/power)
http://en.wikipedia.org/wiki/IBM_POWER_microprocessors

Iliad 801/risc was going to be chip design used for 370s & as/400. ROMP
was chip that was going to be used displaywriter followon. when that
was canceled, the group looked around for another market and decided on
unix workstation. they hired the group that had done AT&T unix port for
pc/ix to do one for ROMP ... becomes PC/RT. Followon is RIOS/POWER chip
used in rs/6000.

All the 801/risc have been non-cache coherent ... precluding most
multiprocessor design. Somerset/AIM (apple, ibm, motorola) is formed to
do single-chip 801/risc ... which becomes power/pc (6xx). Gets cache
coherent for multiprocessor support ... somewhat influence from motorola
88000 group (the guy we directly report to, leaves to become head of
Somerset).

precursor to as/400 was ibm s/38
http://en.wikipedia.org/wiki/IBM_System/38

above references that after the failure of future system effort, some
number of people retreat to rochester to do s/38. big part of failure of
future system was its horrible performance and enormous complexity
(contributed to horrible performance as well as questionable whether it
would ever be figured out). s/38 was enormous simplification as well as
many of the performance problems were critical in s/38 market (snide
reference in above).

For instance, an analysis was if FS machine was made of fastest
technology then available (370/195), applications would have throughput
of 370/145 (about 30 times factor slowdown). misc. past posts mentioning
FS
http://www.garlic.com/~lynn/submain.html#futuresys

s/38 one-level-store (characteristic of fs) was also scatter allocation
across disk pool. For system of 2-3 disks required system shutdown and
all disks backed up as single entity. Any single disk failure required
the inverse. This was untenable in typical large 370 operation with 300+
disks. The original RAID patent was at San Jose disk plant ... and
initial deployed in product for S/38 ... in part because the impact of
single disk failure was so immense.

a couple old email mentioning iliad:
http://www.garlic.com/~lynn/2006u.html#email810422
http://www.garlic.com/~lynn/2006u.html#email830420

total random trivia, topic drift.

I had a project called HSDT ... misc. posts
http://www.garlic.com/~lynn/subnetwork.html#hsdt

and was working with several of the institutions that would eventually
become NSFNET backbone ... operational precursor to modern internet
http://www.technologyreview.com/featuredstory/401444/grid-computing/

We were suppose to get $20M to hook all the NSF supercomputers together,
congress then cut the budget and a couple other things happened
... finally NSF released RFP ... but internal politics prevented us from
bidding. Director of NSF tried to help by writting the company a letter,
copying the CEO ... but that just made the internal politics worse (as
did references to what HSDT already had running was at least five years
ahead of all bid submissions). misc. old email mentioning NSFNET
http://www.garlic.com/~lynn/lhwemail.html#nsfnet

at the same time, I was working on effort to connect a large number of
801/risc chips together (not RP3) ... and meetings were being scheduled
with presentations I was suppose to be doing to director of NSF ... old
email
http://www.garlic.com/~lynn/2011b.html#email850314
http://www.garlic.com/~lynn/2007d.html#email850315
http://www.garlic.com/~lynn/2011b.html#email870315

a little later this morphs into cluster scaleup for the ha/cmp
product we were doing ... some past posts
http://www.garlic.com/~lynn/subtopic.html#hacmp

post about early jan1992 meeting in ellison's conference
room on cluster scaleup
http://www.garlic.com/~lynn/95.html#13

some old cluster scaleup email
http://www.garlic.com/~lynn/lhwemail.html#medusa

I've periodically mentioned that within hrs of the last email in above
(end Jan1992) cluster scaleup was transferred and we were told we
couldn't work on anything with more than four processors. within weeks
it was announced as supercomputer for scientific and numeric intensive
only ... press item from 17Feb1992
http://www.garlic.com/~lynn/2001n.html#6000clusters1

part of the issue was ha/cmp cluster scaleup including working with
national labs on scientific and numerical intensive ... as well as large
filesystem ... but also working with rdbms vendors on commercial. The
commercial scaleup was perceived as having large impact on the company's
mainframe market place. It turns out that some of the same executives
involved in blocking bidding on NSFNET backbone RFP ... also were
involved in transfer of cluster scaleup and announcement as
supercomputer (for scientific and numeric intensive only) ... another
press item later in spring, 11may1992
http://www.garlic.com/~lynn/2001n.html#6000clusters2

where they mentioned that clusters had totally caught them by *surprise*

--
virtualization experience starting Jan1968, online at home since Mar1970

Michael S

unread,

Apr 20, 2013, 3:34:22 PM4/20/13

to

Former IBM AS/400, former i-series, former system i, now appears to be
called IBM i.
For 15 or so years they use the same CPUs as system p (Power/AIX and
POWER/Linux), and for 5 or so years the same physical boxes.
So, although IBM, as a company, controls AS/400 CPUs, system i
division itself does not. They the have live with what provided to
them by their more important system p co-workers.

Anne & Lynn Wheeler

unread,

Apr 20, 2013, 5:04:35 PM4/20/13

to

Anne & Lynn Wheeler <ly...@garlic.com> writes:
> post about early jan1992 meeting in ellison's conference
> room on cluster scaleup
> http://www.garlic.com/~lynn/95.html#13
>
> some old cluster scaleup email
> http://www.garlic.com/~lynn/lhwemail.html#medusa

re:
http://www.garlic.com/~lynn/2013f.html#29 Delay between idea and implementation

almost 20yrs later ... ha/cmp rdbms cluster scaleup with 100+ systems
http://www.garlic.com/~lynn/2009p.html#43 From The Annals of Release No Software Before Its Time

rebranded "purescale". finally there is non-mainframe DB2 that scales
... at the time in early 90s, the company just had mainframe relational.
recent (long-winded) post mentioning some of the work with non-IBM RDBMS
vendors in the early 90s for scaling (before being told we couldn't work
with more than four systems)
http://www.garlic.com/~lynn/2013f.html#19 Where Does the Cloud Cover the Mainframe?

other topic drift, posts mentioning original relational/sql
implementation
http://www.garlic.com/~lynn/submain.html#systemr

Stephen Fuld

unread,

Apr 20, 2013, 5:48:41 PM4/20/13

to

I thought they used a variant of the Power PC architecture that had a
special mode capability to enable the 65th bit. Is that no longer true?
If it is still true, how is the DRAM arranged to support that extra
bit? If not, did they just rewrite their intermediate level software
not to require tag bits?

Michael S

unread,

Apr 20, 2013, 6:26:17 PM4/20/13

to

On Apr 21, 12:48 am, Stephen Fuld <SF...@alumni.cmu.edu.invalid>

I don't know all details, but I never heard about 65th bit in POWER
CPUs. It certainly does not appear in public ISA documents.

> If it is still true, how is the DRAM arranged to support that extra
> bit? If not, did they just rewrite their intermediate level software
> not to require tag bits?
>

Most likely.

Anne & Lynn Wheeler

unread,

Apr 20, 2013, 7:14:41 PM4/20/13

to

Michael S <already...@yahoo.com> writes:
> Former IBM AS/400, former i-series, former system i, now appears to be
> called IBM i.
> For 15 or so years they use the same CPUs as system p (Power/AIX and
> POWER/Linux), and for 5 or so years the same physical boxes.
> So, although IBM, as a company, controls AS/400 CPUs, system i
> division itself does not. They the have live with what provided to
> them by their more important system p co-workers.

re:
http://www.garlic.com/~lynn/2013f.html#29 Delay between idea and implementation

http://www.garlic.com/~lynn/2013f.html#32 Delay between idea and implementation

rs64, cobra released in 1995 for as/400
http://en.wikipedia.org/wiki/IBM_RS64

power/pc
http://en.wikipedia.org/wiki/PowerPC_600

as/400 was converg s/38 and s/36 ... about 2/3rd way down web page, list
various power chips starting 1995
http://en.wikipedia.org/wiki/AS/400

from above:

The IBM System i, then known as the AS/400, was the continuation of the
System/38 database machine architecture (announced by IBM in October
1978 and delivered in August 1979). The AS/400 removed capability-based
addressing.[4] The AS/400 added source compatibility with the System/36
combining the two primary computers manufactured by the IBM Rochester
plant. The System/36 was IBM's most successful mini-computer but the
architecture had reached its limit. The first AS/400 systems (known by
the development code names Silverlake and Olympic) were delivered in
1988 under the tag line "Best of Both Worlds" and the product line has
been refreshed continually since then. Guy Dehond from Inventive
Designers was one of the beta-testers of Silverlake.

... snip ...

as previously mentioned as/400 was originally suppose to use 801/risc
Iliad chip ... but when that effort floundered (as well as other
801/risc activities), Rochester quickly came up with CISC chip.

old post about capability-based ... FS, i432, S/38 ... and GNOSIS (from
tymshare), spun off as KeyKOS (when M/D bought tymshare) ... I was
brought in to audit GNOSIS as part of the spinoff
http://www.garlic.com/~lynn/2012k.html#57

GNOSIS/KeyKOS somewhat spawns eros, capros, coyotos
http://en.wikipedia.org/wiki/GNOSIS
http://en.wikipedia.org/wiki/KeyKOS
http://en.wikipedia.org/wiki/EROS_%28microkernel%29
http://en.wikipedia.org/wiki/CapROS
http://en.wikipedia.org/wiki/Coyotos