Batched_Instruction_Windowing_for_Single_Threaded

Andy 'Krazy' Glew

unread,

May 28, 2010, 2:08:40 AM5/28/10

to

https://semipublic.comp-arch.net/wiki/Batched_Instruction_Windowing_for_Single_Threaded_Execution

== Assumptions ==

Say you want to speed up a single thread of execution.

Say you have N processors to "spread" the execution across.

Or say that you have N "execution clusters" (e.g. see [[Multicluster Microarchitecture]], or [[MCMT]]).

For the moment, let's just forget issues and techniques such as [[SpMT]] or [[Control Independence]] or [[Fine Grained
Deinterleaving Multithreading of Single Threaded Execution]]. Forget [[Instruction Chain Scheduling for Multicluster]],
and the like.

== Cross-cluster dependencies, the killer issue ==

Even if you can figure out how to extract parallelism, such "single thread spreading" microarchitectures often suffer a
common weakness: cross thread, or cluster, or processor, dependencies. If you can find completely independent threads,
or chains, or whatever, great. But if there are inter-chain or inter-thread dependencies, and if the cost for
communicating such dependencies is high, well, then you have a weakness.

Without loss of generality, let us assume that we are talking about spreading a single thread of execution across
multiple clusters. This applies just as much to

Thought experiment: how can we minimize dependencies when spreading a single thread across multiple clusters? Without
any special knowledge of the code. E.g. assuming that the code that we are executing on has uniformly random
inter-instruction dependencies.

Given that assumption, then it follows that the way to spread the code across clusters is to use as few "chunks" or
"batches" of code as possible. E.g. if each cluster is capable of holding 100 instructions, then start fetching
instructions. Feed the first 100 to the first cluster (plus maybe a few more, since some of them will have been
executed while the later instructions are being fetched). Feed the next instruction to the second cluster. And so on,
if you have more than 2 clusters. When you use up all of the clusters, wrap around to the beginning.

This minimizes the cross cluster dependencies for arbitrary, uniformly dependent code. It delivers contiguous batches
of instructions tio each cluster.

Sure, "smart" algorithms might be able to allocate non-contiguous groups of instructions to each cluster, for particular
dependence graphs. But such smart algorithms either fall back to the batches described above, or are worse for some codes.

Note: I am not saying that smart clustering algorithms are impossible. I am just saying that this stupid algorithm for
allocating contiguous batches of instructions to separate threads is not so bad, for the most challenging case. And it
is really simple. I suggest it as a foundation, to which embellishments can be added.

Since there are a fixed number of registers, there can only every be a fixed number of register carried data
dependencies between batches, across cluster boundaries. Worst case, you could insert microinstructions to copy all
register values from one batch to the next batch on a different cluster. That's cheap for instruction sets with a small
number of registers. More registers more expensive.
If you can identify the truly live registers... either a priori (e.g. by [[instruction to kill or indicate liveness of
registers]]); or a posteriori (pull registers from the earlier batch/cluster to the later, when the register is actually
used).

Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an
earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to
do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict
detection windows. Only in the best of times can you elide many such memory dependencies.

So here's what we have got
* 2 or more "clusters", e.g. scheduler, execution unit. maybe cache. i.e. each cluster has an instruction window
* allocate instructions to clusters in sequential, contiguous, "batches"
* a mechanism to carry registers between cluster/batches. e.g. explicit push, or pull, or ...
* a memory dependence mechanism
** perhaps a per-cluster L1 store buffer
** but definitely a shared between clusters master store-to-load conflict detection buffer.

There are many designs for the inter-cluster Store-to-Load buffer. (TBD, describe. Point to papers.)

= Now Back Off =

Now let's back off from the assumption that all batches are maximal sized.

== Smaller Batches ==

How about allocating batches that are (approximately) half, or 1/4 the per-cluster instruction window?

This way, you get multiple batches in a cluster, some old, some new. Rather than a cluster being the strictly oldest at
any time, which would tend to suggest that first one cluster executes, then another.

This way, if there is long distance parallelism, the batches spread across different clusters can find it.

== Loop Batching ==

Try to align the batch boundaries to loop boundaries. This way, you can get different loop bodies in different clusters.

MitchAlsup

unread,

May 28, 2010, 11:37:00 AM5/28/10

to

Andy:

This sounds a lot like Guri Sohi's MultiScalar thing of the late
1990s.

Mitch

Andy 'Krazy' Glew

unread,

May 28, 2010, 10:06:34 PM5/28/10

to

It might not be surprising, since Guri was my advisor at UWisc.

But, oddly enough, I never thought of Multiscalar in this regard. Perhaps because when I was working with Guri, my head
was in the SpMT fork-on-call, etc. space. (I wasn't working on SpMT; I was assuming that Haitham Akkary would have
finished DMT, so I was trying to skip to what might come after DMT.)

Maybe I backed into Multiscalar... ?

Ah, no. Multiscalar was expandable split windows. Very much like SpMT, DMT, etc. The instruction window was not
contiguous. Or, rather, there was a large instruction window, but only sdiscontiguous pieces of it were instantiated at
any time. Differs mainly from SpMT and DMT in details and mechanisms.

My batched instruction window is in some ways a retreat from discontiguous instruction windows.

Don't get me wrong: I still hope to see SpMT or the like get built. Maybe I'll even be lucky enough to build it. Or at
least publish some patents on it.

However, batched instruction window is a retreat. It doesn't look for places to skip to and run speculative threads. It
doesn't have multiple instruction pointers. It just seeks to paste together the instruction windows of two processors
or EU clusters, to make a single large instruction window.

Argument: in an MCMT machine you have the bandwidth to supply two threads. If only one is running, then can you use
that extra ifetch bandwidth? Using unrolled BTB, etc., not so hard to fetch. (Modulo branch mispredictions.) Two
cases: (1) the execution window drains as fast as the 2X ifetch can supply: cool, great! (2) the execution window clogs
up. In which case, switching to the execution window of the second cluster in a 2 cluster MCMT *might* get you some
more performance by exposing more parallelism.
Basically, it's worth a shot. The cross cluster dependencies are minimized as described, etc.

Now, I admit that I have an idea in flight that is more like batch and multiscalar. I'm even allowed to post it. But I
want to give my employer a chance to patent it.

Andy 'Krazy' Glew

unread,

May 30, 2010, 12:01:30 PM5/30/10

to Andy 'Krazy' Glew

On 5/28/2010 7:06 PM, Andy 'Krazy' Glew wrote:
> On 5/28/2010 8:37 AM, MitchAlsup wrote:

>> Andy: [[wrt my batched instruction window proposal]]

>>
>> This sounds a lot like Guri Sohi's MultiScalar thing of the late
>> 1990s.
>

> Ah, no. Multiscalar was expandable split windows.

Another way this may have reminded you of multiscalar is that the easiest way to to allocate the windows to clusters
would be in a ring. Multiscalar had a ring. I think that was one of multiscalar's biggest weaknesses.

Robert Myers

unread,

May 30, 2010, 4:36:49 PM5/30/10

to

On May 28, 2:08 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:

>
> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an
> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to
> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict
> detection windows. Only in the best of times can you elide many such memory dependencies.
>

I am reminded of the famous cartoon with "and then a miracle happens"
in the middle of a complex proof.

One advantage that I can think of for your proposal is that it imposes
a natural order on memory references. The usual parallelism problem
of who got there first has a definite answer imposed by the ordering
of a serial instruction stream.

You can't retire any write instructions until all ambiguous pending
read instructions in preceding instruction blocks are complete. You
can't count on any read instructions until all ambiguous pending write
instructions in preceding instruction blocks are complete. You can
ignore the latter constraint and fix up, if necessary, but the thought
of doing it makes my head hurt. It's the same as hoisting a load
above a store. How often does that work?

I'll save you the trouble. "Aaargh!" But maybe others are as puzzled
about the details as I am.

Robert.

Andy 'Krazy' Glew

unread,

May 31, 2010, 11:26:02 AM5/31/10

to

But that's exactly how all modern OOO machines with store to load forwarding buffers work.

See, e.g. store disambiguation, http://download.intel.com/technology/architecture/sma.pdf

It got into product fairly late, 2006 in the above publication.

But it is much older, mature (in tge R&D sense) technology. E.g. it was one of the issues in the University of
Wisconsin / Intel settlement,
http://www.technologyinnovative.com/industry/research/intel-settles-wisconsin-universitys-patent-claim-over-chips.html.
The technique was discussed by us in the P6 development, and earlier by me at the UWisc.

When I am talking about batching, or any large instruction window, the only think that is new is scaling up existing
mechanisms for store-to-load forwarding. Back in 2002 I thought these might be a challenge. But in 2003-2005 there
were a flurry of papers on how to do this effecticely.

If this causes you to say "Aargh", Robert, then, umm, you really aren't qualified to be discussing advanced computer
architecture. This is the easy stuff.

Andy 'Krazy' Glew

unread,

May 31, 2010, 12:04:45 PM5/31/10

to

On 5/30/2010 1:36 PM, Robert Myers wrote:

But that's exactly how all modern OOO machines with store to load forwarding buffers work.

See, e.g. store disambiguation, http://download.intel.com/technology/architecture/sma.pdf

It got into product fairly late, 2006 in the above publication.

But it is much older, mature (in tge R&D sense) technology. E.g. it was one of the issues in the University of
Wisconsin / Intel settlement,
http://www.technologyinnovative.com/industry/research/intel-settles-wisconsin-universitys-patent-claim-over-chips.html.
The technique was discussed by us in the P6 development, and earlier by me at the UWisc.

When I am talking about batching, or any large instruction window, the only think that is new is scaling up existing
mechanisms for store-to-load forwarding. Back in 2002 I thought these might be a challenge. But in 2003-2005 there
were a flurry of papers on how to do this effecticely.

If this causes you to say "Aargh", Robert, then, umm, you really aren't qualified to be discussing advanced computer
architecture. This is the easy stuff.

I apologize for what seems to be an ad-hominem remark, but I really don't know how to say it otherwise. Store-to-load
forwarding is old hat; predictors to allow loads to get hoisted above a store are old hat.

It's fine for you to say you don't think existing mechanisms thereof will scale. Then we can talk about the 2005 era
papers on the topic, and other stuff.

But if your head hurts just from the thought of it... I don't really know how to address that.

Robert Myers

unread,

May 31, 2010, 2:06:01 PM5/31/10

to

On May 31, 11:26 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net>

wrote:
> On 5/30/2010 1:36 PM, Robert Myers wrote:
>
>
>
>
>
> > On May 28, 2:08 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:
>
> >> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an
> >> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to
> >> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict
> >> detection windows. Only in the best of times can you elide many such memory dependencies.
>
> > I am reminded of the famous cartoon with "and then a miracle happens"
> > in the middle of a complex proof.
>
> > One advantage that I can think of for your proposal is that it imposes
> > a natural order on memory references. The usual parallelism problem
> > of who got there first has a definite answer imposed by the ordering
> > of a serial instruction stream.
>
> > You can't retire any write instructions until all ambiguous pending
> > read instructions in preceding instruction blocks are complete. You
> > can't count on any read instructions until all ambiguous pending write
> > instructions in preceding instruction blocks are complete. You can
> > ignore the latter constraint and fix up, if necessary, but the thought
> > of doing it makes my head hurt. It's the same as hoisting a load
> > above a store. How often does that work?
>
> > I'll save you the trouble. "Aaargh!" But maybe others are as puzzled
> > about the details as I am.
>
> > Robert.
>
> But that's exactly how all modern OOO machines with store to load forwarding buffers work.
>

> See, e.g. store disambiguation,http://download.intel.com/technology/architecture/sma.pdf

>
> It got into product fairly late, 2006 in the above publication.
>
> But it is much older, mature (in tge R&D sense) technology. E.g. it was one of the issues in the University of

> Wisconsin / Intel settlement,http://www.technologyinnovative.com/industry/research/intel-settles-w....

> The technique was discussed by us in the P6 development, and earlier by me at the UWisc.
>
> When I am talking about batching, or any large instruction window, the only think that is new is scaling up existing
> mechanisms for store-to-load forwarding. Back in 2002 I thought these might be a challenge. But in 2003-2005 there
> were a flurry of papers on how to do this effecticely.
>
> If this causes you to say "Aargh", Robert, then, umm, you really aren't qualified to be discussing advanced computer
> architecture. This is the easy stuff.

"Aargh" is your expletive.

I anticipated such a response. I'm not wondering how a single
processor works with a single thread or a large instruction window. I
think I understand that well enough to follow the discussion. What's
got me stopped is: that doesn't seem to be what you're talking about.

Rather than having a single instruction window juggling instructions,
you have separate instruction windows potentially (or even ideally)
with instructions from far apart in the stream. If there is
parallelism, that's great, but that's sort of like saying, "Suppose
all the problems that (say) CSP was invented to address didn't
exist." I'm visualizing instruction windows clogged with instructions
waiting to complete because some far away memory reference might
change everything. If I don't understand something, it's why you
think it will work well enough to be useful, not how it would work in
a single instance.

Let me put it another way. If feels like you've magically split the
N^2 out of order problem into M*(N/M)^2 problems or N^2/M, where M is
the number of separate instruction windows. How did you do that?

Robert.

Robert Myers

unread,

May 31, 2010, 2:47:47 PM5/31/10

to

On May 31, 12:04 pm, Andy 'Krazy' Glew <ag-n...@patten-glew.net>

wrote:
> On 5/30/2010 1:36 PM, Robert Myers wrote:
>
>
>
>
>
> > On May 28, 2:08 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:
>
> >> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an
> >> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to
> >> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict
> >> detection windows. Only in the best of times can you elide many such memory dependencies.
>
> > I am reminded of the famous cartoon with "and then a miracle happens"
> > in the middle of a complex proof.
>
> > One advantage that I can think of for your proposal is that it imposes
> > a natural order on memory references. The usual parallelism problem
> > of who got there first has a definite answer imposed by the ordering
> > of a serial instruction stream.
>
> > You can't retire any write instructions until all ambiguous pending
> > read instructions in preceding instruction blocks are complete. You
> > can't count on any read instructions until all ambiguous pending write
> > instructions in preceding instruction blocks are complete. You can
> > ignore the latter constraint and fix up, if necessary, but the thought
> > of doing it makes my head hurt. It's the same as hoisting a load
> > above a store. How often does that work?
>
> > I'll save you the trouble. "Aaargh!" But maybe others are as puzzled
> > about the details as I am.
>
> > Robert.
>
> But that's exactly how all modern OOO machines with store to load forwarding buffers work.
>

> See, e.g. store disambiguation,http://download.intel.com/technology/architecture/sma.pdf

>
> It got into product fairly late, 2006 in the above publication.
>
> But it is much older, mature (in tge R&D sense) technology. E.g. it was one of the issues in the University of

> Wisconsin / Intel settlement,http://www.technologyinnovative.com/industry/research/intel-settles-w....

> The technique was discussed by us in the P6 development, and earlier by me at the UWisc.
>
> When I am talking about batching, or any large instruction window, the only think that is new is scaling up existing
> mechanisms for store-to-load forwarding. Back in 2002 I thought these might be a challenge. But in 2003-2005 there
> were a flurry of papers on how to do this effecticely.
>
> If this causes you to say "Aargh", Robert, then, umm, you really aren't qualified to be discussing advanced computer
> architecture. This is the easy stuff.
>
> I apologize for what seems to be an ad-hominem remark, but I really don't know how to say it otherwise. Store-to-load
> forwarding is old hat; predictors to allow loads to get hoisted above a store are old hat.
>

I actually do know that. In the case where a speculation fails and
the calculation must be redone, you have to flush the pipeline and
redo the calculation. That doesn't make my head hurt. What makes my
head hurt is when you are juggling multiple instruction windows and
multiple pipelines.

One possible emotionally neutral response to my rhetorical question,
"How often does that work?" is "More often than you think,
apparently."

Robert.

Andy 'Krazy' Glew

unread,

May 31, 2010, 4:12:41 PM5/31/10

to

On 5/31/2010 11:06 AM, Robert Myers wrote:
> On May 31, 11:26 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net wrote:
>> On 5/30/2010 1:36 PM, Robert Myers wro

>>> On May 28, 2:08 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:
>>
>>>> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an
>>>> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to
>>>> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict
>>>> detection windows. Only in the best of times can you elide many such memory dependencies.
>>
>>> I am reminded of the famous cartoon with "and then a miracle happens"
>>> in the middle of a complex proof.
>>

>>> You can't retire any write instructions until all ambiguous pending
>>> read instructions in preceding instruction blocks are complete. You
>>> can't count on any read instructions until all ambiguous pending write
>>> instructions in preceding instruction blocks are complete. You can
>>> ignore the latter constraint and fix up, if necessary, but the thought
>>> of doing it makes my head hurt. It's the same as hoisting a load
>>> above a store. How often does that work?
>>
>>> I'll save you the trouble. "Aaargh!" But maybe others are as puzzled
>>> about the details as I am.
>>

>> But that's exactly how all modern OOO machines with store to load forwarding buffers work.
>

> I anticipated such a response. I'm not wondering how a single
> processor works with a single thread or a large instruction window. I
> think I understand that well enough to follow the discussion. What's
> got me stopped is: that doesn't seem to be what you're talking about.
>
> Rather than having a single instruction window juggling instructions,
> you have separate instruction windows potentially (or even ideally)
> with instructions from far apart in the stream. If there is
> parallelism, that's great,

This is a better form of discussion.

First, I apparently was not explicit enough in my earlier post about batch scheduling for instruction windows:

It is *not* expandable split windows. It is a single, contiguous, instruction window, that happens to be split across
multiple clusters of execution units, or multiple processor cores. It takes advantage of the OOO instruction window
support that each cluster has - the scheduler or RS equivalent, etc. But it stitches them together to form a single
large instruction window.

In my earlier post I discussed how this stitching is not so hard for instruction sets with a small number of registers.
Worst case, you save/restore all of the registers at batch boundaries. That's not too hard for the historical 6
registers of x86. It's harder as you add more and more registers. Eventually, you want to switch to a pull strategy,
where you only transfer the registers across clusters when the new batch needs a value live-out of the old.

You are correct to say that memory, store-to-load forwarding, is a greater challenge. You could use the existing STLF
buffers that most OOO systems have within the clusters. But that leaves the problem of stitching togethr store buffers
between batches, between clusters.

Simplest strategy would be to make the 2 (or more) cluster STLFs look like a single one. E.g. send a signal from the
older cluster to the younger cluster (if the batch occupies the whole cluster) indicating that is a store of address
unknown. That is sufficient to implement the P6-era strategy of blocking a load until all earlier stores have become
address-ready. But it is not sufficient to implement store-to-load forwarding.

A load that becomes address ready is first compared to all stores that are address ready in its local cluster STLF
buffer. If matched, great, no traffic to the outside world. If not matched, then you need to decide if you want to
send the load to the earlier cluster's STLF buffer. ... At this point, I probably need to shut up, and write up an
invention disclosure. It doesn't seem all that hard to me, I daresay it seems obvious to me, but I've been told many
times that what seems obvious to me isn't necessarily obvious to all. Witness this discussion.

Anyway, stitching together these store buffers - a pain, but I can see several, many, ways of doing so.

If this stitching isn't good enough, there are several proposals for multilevel store buffers extant, both mine and
those published in academia that I referred to in the previous post. As I have said earlier, I like an L1 STLF
associated with the cluster, and an L2 STLF shared between clusters.

Finally, my log based verify re-execution suffices to serve as a catch-all mechanism to ensure correctness. There are
several other similar proposals that use verification to ensure correctness.

We therefore have several different mechanisms to "stitch-together" STLF buffers. So, it looks buildable. Is it of any
value.

---

Now, yes, I have in other posts discussed SpMT, expandable split windows. And, yes, the original post of this thread
started off with batched, contiguous, instruction windows, but segued into split windows. If that's what we want to
discuss, yes, let's do so.

However, in my log based SpMT, I drain instructions out of the window, and keep them in the log. Which really amounts to
an L2 instruction window, just one that is very cheap.

> but that's sort of like saying, "Suppose
> all the problems that (say) CSP was invented to address didn't
> exist." I'm visualizing instruction windows clogged with instructions
> waiting to complete because some far away memory reference might
> change everything. If I don't understand something, it's why you
> think it will work well enough to be useful, not how it would work in
> a single instance.

So, first, it is well-known that there is some potential in simply increasing instruction window size. The problem is
how costly is the hardware, esp. power wise, to do so.

E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for integer, going from instruction windows of 128 to 4K
instructions.

Cristal, A., Santana, O. J., Valero, M., and Mart�nez, J. F. 2004. Toward kilo-instruction processors. ACM Trans.
Archit. Code Optim. 1, 4 (Dec. 2004), 389-417. DOI= http://doi.acm.org/10.1145/1044823.1044825

Now, that is not a very good payoff. Definitely sublinear. But the whole point is not that we are trying to build
processors that are 4096/128 = 32X bigger. First, the OOO hardware is only a fraction of processor area, especially if
you have big vector units like some recent proposals. Second, more importantly, to build big effective instruction
windows you do not need to build all that hardware. As the kilo-instruction papers, and my own work, shows.

Now, the batched instruction window proposal is on the wimpy side of this. If it keeps the instruction window
contiguous, it is paying the full cost for every instruction in the window. Other proposals optimize that; batched doesn't.

But, the batched proposal is easy to build.

I think really what I am saying is that there are a number of techniques that help improve single thread performance.
Batching is one of the simplest ones. SpMT is more aggressive. I'm collecting those techniques, and hope to pick and
choose from such a toolkit if every I am building an aggressive single thread microarchitecture again.

But, also, I am not pushing this as an alternative to improving parallel performance, such as via message-passing
techniques like CSP (I half expect Nick to say "CSP isn't message passing"), or shared memory parallelism. I'm open to
all such techniques to improve performance, and I don't denigrate any of them.

What I do denigrate is people saying "The only thing that anyone nowadays should be building is the simplest possible
in-order processors, many, many of them, to save power and ... " bet the farm on parallel programming. IMHO that does
not make sense for general purpose workloads. It might make sense for some supercomputers and other HPC, and for some
graphics. And I'll turn your point around: if you are saying the above, then you are saying "And a miracle happens..."
with respect to parallel programming. However, in truth, I don't think you are saying thae.

> Let me put it another way. If feels like you've magically split the
> N^2 out of order problem into M*(N/M)^2 problems or N^2/M, where M is
> the number of separate instruction windows. How did you do that?

First, out-of-order is only N^2 in simplistic algorithms. I.e. OOO is N^2 the same way sorting is O(N^2) - i.e. it isn't.

See, for example, the Ultrascalar papers show logarithmic gate delays and sqrt wire length, essentially by showing that
OOO is a parallel prefix problem.

Henry, D. S., Kuszmaul, B. C., and Viswanath, V. 1999. The Ultrascalar Processor-An Asymptotically Scalable Superscalar
Microarchitecture. In Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI (March 21 - 24, 1999).
ARVLSI. IEEE Computer Society, Washington, DC, 256.

Arguably their results apply only to renaming of registers and memory, but I think they can extend to all OOO execution.

Now, I'm not talking ultrascalar - I don't think I could justify building a new microarchitecture - so it is fair to ask
how I am ducking the O(N^2) of simplistic OOO. But it is important to note the theoretical limit: we aren't violating
theory, just using O^N^2) algorithms up until it gets painful, and then switching to more scalable approaches.

Wrt the batched instruction window, above I started discussing how to stitch instruction windows together. Sorry, I
started discussing, but stopped. But the basic idea is that not all traffic needs to cross the cluster boundary. You
can filter out a lot that doesn't need to cross make preductions for some of the rest. And you are left with a lot less.

Robert Myers

unread,

May 31, 2010, 9:45:50 PM5/31/10

to

It was the stitching together that seemed like magic to me, and your
expanded discussion helps a lot.

If a register is written to within the visible instruction stream,
then you don't need to pull it from the outside world. That seemed
beautiful right away. In the case of registers, you can tell what you
might need to wait on. Too bad memory doesn't work that way.

Even though you didn't explicitly address this point, the part that
worried me so much--relying on incorrectly speculated addresses--
presumably won't happen because you won't rely on an address from the
outside world until it is known definitively, and the outside world
can speculate internally and still know which addresses can be passed
on as known without risk arising from speculation.

> ---
>
> Now, yes, I have in other posts discussed SpMT, expandable split windows. And, yes, the original post of this thread
> started off with batched, contiguous, instruction windows, but segued into split windows. If that's what we want to
> discuss, yes, let's do so.
>
> However, in my log based SpMT, I drain instructions out of the window, and keep them in the log. Which really amounts to
> an L2 instruction window, just one that is very cheap.
>

[Robert wrote]:

> > but that's sort of like saying, "Suppose
> > all the problems that (say) CSP was invented to address didn't
> > exist." I'm visualizing instruction windows clogged with instructions
> > waiting to complete because some far away memory reference might
> > change everything. If I don't understand something, it's why you
> > think it will work well enough to be useful, not how it would work in
> > a single instance.
>
> So, first, it is well-known that there is some potential in simply increasing instruction window size. The problem is
> how costly is the hardware, esp. power wise, to do so.
>
> E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for integer, going from instruction windows of 128 to 4K
> instructions.
>

> Cristal, A., Santana, O. J., Valero, M., and Martínez, J. F. 2004. Toward kilo-instruction processors. ACM Trans.
> Archit. Code Optim. 1, 4 (Dec. 2004), 389-417. DOI=http://doi.acm.org/10.1145/1044823.1044825

>
> Now, that is not a very good payoff. Definitely sublinear. But the whole point is not that we are trying to build
> processors that are 4096/128 = 32X bigger. First, the OOO hardware is only a fraction of processor area, especially if
> you have big vector units like some recent proposals. Second, more importantly, to build big effective instruction
> windows you do not need to build all that hardware. As the kilo-instruction papers, and my own work, shows.
>
> Now, the batched instruction window proposal is on the wimpy side of this. If it keeps the instruction window
> contiguous, it is paying the full cost for every instruction in the window. Other proposals optimize that; batched doesn't.
>
> But, the batched proposal is easy to build.
>
> I think really what I am saying is that there are a number of techniques that help improve single thread performance.
> Batching is one of the simplest ones. SpMT is more aggressive. I'm collecting those techniques, and hope to pick and
> choose from such a toolkit if every I am building an aggressive single thread microarchitecture again.
>
> But, also, I am not pushing this as an alternative to improving parallel performance, such as via message-passing
> techniques like CSP (I half expect Nick to say "CSP isn't message passing"), or shared memory parallelism. I'm open to
> all such techniques to improve performance, and I don't denigrate any of them.
>
> What I do denigrate is people saying "The only thing that anyone nowadays should be building is the simplest possible
> in-order processors, many, many of them, to save power and ... " bet the farm on parallel programming. IMHO that does
> not make sense for general purpose workloads. It might make sense for some supercomputers and other HPC, and for some
> graphics. And I'll turn your point around: if you are saying the above, then you are saying "And a miracle happens..."
> with respect to parallel programming. However, in truth, I don't think you are saying thae.
>

I'm not, for sure. I was merely invoking conservation of difficulty.
You may have misunderstood the intent of some of my previous posts, in
particular the paper that suggests there isn't much payoff from
microarchitectural innovation for OLTP workloads. I only want to
understand why we are where we are, not to insist that it is
inevitable. I'll admit, though, that I remain pessimistic, because I
can't identify a big market for significantly improved single-thread
performance, especially if it reduces performance per watt.

>
> First, out-of-order is only N^2 in simplistic algorithms. I.e. OOO is N^2 the same way sorting is O(N^2) - i.e. it isn't.
>
> See, for example, the Ultrascalar papers show logarithmic gate delays and sqrt wire length, essentially by showing that
> OOO is a parallel prefix problem.
>
> Henry, D. S., Kuszmaul, B. C., and Viswanath, V. 1999. The Ultrascalar Processor-An Asymptotically Scalable Superscalar
> Microarchitecture. In Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI (March 21 - 24, 1999).
> ARVLSI. IEEE Computer Society, Washington, DC, 256.
>
> Arguably their results apply only to renaming of registers and memory, but I think they can extend to all OOO execution.
>
> Now, I'm not talking ultrascalar - I don't think I could justify building a new microarchitecture - so it is fair to ask
> how I am ducking the O(N^2) of simplistic OOO. But it is important to note the theoretical limit: we aren't violating
> theory, just using O^N^2) algorithms up until it gets painful, and then switching to more scalable approaches.
>
> Wrt the batched instruction window, above I started discussing how to stitch instruction windows together. Sorry, I
> started discussing, but stopped. But the basic idea is that not all traffic needs to cross the cluster boundary. You
> can filter out a lot that doesn't need to cross make preductions for some of the rest. And you are left with a lot less.

I suspect that you are making some kind of assumption about data
locality in the instruction stream, so that a large fraction of
address references can be identified by examining the local LTSB.
That assumption isn't crucial. It's just that, as it becomes less
true, you have to work a lot harder.

Robert.

nedbrek

unread,

Jun 1, 2010, 7:16:15 AM6/1/10

to

Hello all,

"Robert Myers" <rbmye...@gmail.com> wrote in message
news:72779549-89d9-4645...@d12g2000vbr.googlegroups.com...

>
> I'll admit, though, that I remain pessimistic, because I
> can't identify a big market for significantly improved single-thread
> performance, especially if it reduces performance per watt.

For the same process and circuit tricks, performance per watt will always
decrease as performance goes up (I think some information theory guy could
prove this).

Therefore, P/W is equivalent to minimizing performance (0 perf for 0 power
FTW).

Maybe you meant P^2 (or P^3) per Watt. Sorry, it is a hang-up of mine. A
lot of people say "performance per watt" and I just want to throttle them.

Another good metric is max perf for given power envelope (desktop, laptop,
contact lens, etc.).

Ned

Robert Myers

unread,

Jun 1, 2010, 9:55:33 AM6/1/10

to

On Jun 1, 7:16 am, "nedbrek" <nedb...@yahoo.com> wrote:
> Hello all,
>
> "Robert Myers" <rbmyers...@gmail.com> wrote in message

I was being imprecise. The metric that really matters for non-mobile
installations (or even desktop replacement notebooks) is the perceived
cost of getting things done. That's a calculation that extends well
beyond computer architecture.

One big selling point for the kinds of architectural innovations Andy
has been talking about is increased latency tolerance, and, there, I
think there is a market. Whether it's a big enough market to be
interesting is another question.

Robert.

Andy 'Krazy' Glew

unread,

Jun 1, 2010, 10:55:10 AM6/1/10

to

On 5/31/2010 6:45 PM, Robert Myers wrote:
> On May 31, 4:12 pm, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:
>> So, first, it is well-known that there is some potential in simply increasing instruction window size. The problem is
>> how costly is the hardware, esp. power wise, to do so.
>>
>> E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for integer, going from instruction windows of 128 to 4K
>> instructions.
>>

>> Cristal, A., Santana, O. J., Valero, M., and Mart�nez, J. F. 2004. Toward kilo-instruction processors. ACM Trans.

>> Archit. Code Optim. 1, 4 (Dec. 2004), 389-417. DOI=http://doi.acm.org/10.1145/1044823.1044825
>>
>> Now, that is not a very good payoff. Definitely sublinear. But the whole point is not that we are trying to build
>> processors that are 4096/128 = 32X bigger. First, the OOO hardware is only a fraction of processor area, especially if
>> you have big vector units like some recent proposals. Second, more importantly, to build big effective instruction
>> windows you do not need to build all that hardware. As the kilo-instruction papers, and my own work, shows.
>>

>> Now, the batched instruction window proposal is on the wimpy side of this.. If it keeps the instruction window

>> contiguous, it is paying the full cost for every instruction in the window. Other proposals optimize that; batched doesn't.

I forgot to say that the Kilo-instruction paper speedups I report above make several assumptions that are probably
conservative. E.g. 4-wide.

>> First, out-of-order is only N^2 in simplistic algorithms. I.e. OOO is N^2 the same way sorting is O(N^2) - i.e. it isn't.
>>
>> See, for example, the Ultrascalar papers show logarithmic gate delays and sqrt wire length, essentially by showing that
>> OOO is a parallel prefix problem.
>>
>> Henry, D. S., Kuszmaul, B. C., and Viswanath, V. 1999. The Ultrascalar Processor-An Asymptotically Scalable Superscalar
>> Microarchitecture. In Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI (March 21 - 24, 1999).
>> ARVLSI. IEEE Computer Society, Washington, DC, 256.
>>
>> Arguably their results apply only to renaming of registers and memory, but I think they can extend to all OOO execution.
>>
>> Now, I'm not talking ultrascalar - I don't think I could justify building a new microarchitecture - so it is fair to ask
>> how I am ducking the O(N^2) of simplistic OOO. But it is important to note the theoretical limit: we aren't violating
>> theory, just using O^N^2) algorithms up until it gets painful, and then switching to more scalable approaches.
>>
>> Wrt the batched instruction window, above I started discussing how to stitch instruction windows together. Sorry, I
>> started discussing, but stopped. But the basic idea is that not all traffic needs to cross the cluster boundary. You
>> can filter out a lot that doesn't need to cross make preductions for some of the rest. And you are left with a lot less.
>
> I suspect that you are making some kind of assumption about data
> locality in the instruction stream, so that a large fraction of
> address references can be identified by examining the local LTSB.
> That assumption isn't crucial. It's just that, as it becomes less
> true, you have to work a lot harder.

The Ultrascalar papers make no such assumptions about locality.

For the batched instruction window, I do make such assumptions. And where it is not true, we would have to fall back to
a more aggressive strategy.

--

By the way, you mentioned OLTP.

Simple, TPC-C, OLTP - simple queries, of the form "update this account", "what is this account balance?" have little
intra-query parallelism. They basically walk through linked trees until they get to the end.

More complicated transaction processing, joins, etc., have oodles of parallelism. However, if you are joining on a key
that is indexed, for which the data table itself is not sorted, the parallelism is middle loop or outer loop, not inner
loop. The inner loop is the aforementioned tree or hash pointer chase.

Of course, such joins are also parallelizable.

Andy 'Krazy' Glew

unread,

Jun 1, 2010, 10:58:28 AM6/1/10

to

On 6/1/2010 6:55 AM, Robert Myers wrote:

> One big selling point for the kinds of architectural innovations Andy
> has been talking about is increased latency tolerance, and, there, I
> think there is a market. Whether it's a big enough market to be
> interesting is another question.

I don't think there is, or will be, such a market until we have mined the low hanging fruit of parallelism, multi-core,
and many-core.

Robert Myers

unread,

Jun 1, 2010, 2:28:37 PM6/1/10

to

On Jun 1, 10:55 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:

>
> I forgot to say that the Kilo-instruction paper speedups I report above make several assumptions that are probably
> conservative. E.g. 4-wide.
>

It's a thought-provoking paper.

I really didn't mean to be insulting, just provocative. I wanted you
to talk more, and you did.

I was thinking of David Dinucci's version of coarse-grained
parallelism when I read your proposal. In that approach, you snip the
problem into carefully-chosen pieces ahead of time, carefully specify
what's coming in and what should go out, and leave it to a run-time
scheduler to put it all back together. Has some nice formal
properties. That was my context for trying to understand what you
were proposing.

I pictured you looking at his proposal and saying,"You want parallel
lumps? Easy." You take a meat cleaver to a serial instruction
stream, whack it into bite-sized chunks, and say, "There. What else
are you looking for?" There had to be some serious potential ugliness
hiding in what you were proposing. Or maybe the whole idea of trying
to identify the parallel pieces on the programming end was just wrong-
headed to begin with. I'm still thinking.

Thinking about how much you might throw into the pot at once and what
you might do with it once it's there is a useful exercise, whether it
leads to a commercially-viable architecture or not.

Robert.

nm...@cam.ac.uk

unread,

Jun 1, 2010, 2:48:27 PM6/1/10

to

In article <4C04183...@patten-glew.net>,

Andy 'Krazy' Glew <ag-...@patten-glew.net> wrote:
>
> So, first, it is well-known that there is some potential in simply
> increasing instruction window size. The problem is how costly is the
> hardware, esp. power wise, to do so.

Yes.

> E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for
> integer, going from instruction windows of 128 to 4K instructions.
>

> Now, that is not a very good payoff. Definitely sublinear. ....

Bluntly, it's dire. There are lots of other ways to optimise the
'floating-point' workloads to that scale, and a 50% improvement for
a 32-fold increase in window size (for integer) is horrible.

> But the whole point is not that we are trying to build processors that
> are 4096/128 = 32X bigger. First, the OOO hardware is only a fraction
> of processor area, especially if you have big vector units like some
> recent proposals. Second, more importantly, to build big effective
> instruction windows you do not need to build all that hardware. As the
> kilo-instruction papers, and my own work, shows.

I have downloaded the paper and will look at it, but the very fact
that people are considering such extreme measures shows that we are
running out of viable ideas. I am not denying that small speedups
can be obtained, expensively, but we have to rethink from a much more
basic stating point if we want large ones.

And there is no way that we will get there without pain - by which I
mean to the perpetrators of the current code! We have absolutely got
to simplify the hardware's task (which is one of the few points where
I agree with the Itanic's designers).

Regards,
Nick Maclaren.

Robert Myers

unread,

Jun 1, 2010, 3:31:32 PM6/1/10

to

On Jun 1, 2:48 pm, n...@cam.ac.uk wrote:
> In article <4C041839.90...@patten-glew.net>,

> Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:
>
>
>
> > So, first, it is well-known that there is some potential in simply
> > increasing instruction window size. The problem is how costly is the
> > hardware, esp. power wise, to do so.
>
> Yes.
>
> > E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for
> > integer, going from instruction windows of 128 to 4K instructions.
>
> > Now, that is not a very good payoff. Definitely sublinear. ....
>
> Bluntly, it's dire. There are lots of other ways to optimise the
> 'floating-point' workloads to that scale, and a 50% improvement for
> a 32-fold increase in window size (for integer) is horrible.
>

But what are the alternatives? Computers deal with ever-larger
datasets, with ever-increasing numbers of processors, and the data
gradually slide off the latency-tolerance horizon of any given
processor. Faster memory? Chaa, right. Violate apparently basic
laws of physics? Maybe with some as yet undiscovered twist on the EPR
paradox.

Personally, I think speed-up is the wrong focus. Sooner or later, for
reasons that seem pretty fundamental to me, we will be back to
bragging, as Intel once did, about instructions in flight.

Robert.

nm...@cam.ac.uk

unread,

Jun 1, 2010, 3:52:11 PM6/1/10

to

In article <0b318bfc-a635-4867...@m21g2000vbr.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>
>But what are the alternatives? Computers deal with ever-larger
>datasets, with ever-increasing numbers of processors, and the data
>gradually slide off the latency-tolerance horizon of any given
>processor. Faster memory? Chaa, right. Violate apparently basic
>laws of physics? Maybe with some as yet undiscovered twist on the EPR
>paradox.

Hang on. You have seen me ride my hobby-horse off into the sunset
often enough! We aren't going to get anywhere until we change the
current approaches to programming - not just the languages, but the
paradigms used. In the most extreme case, I take that as far as
saying that we need to change the way that mathematics and algorithms
are taught!

But let's not go that far today (it is getting late). We know that
we can do a lot better with far less difficulty than most people
will admit - and will get better RAS, too! Fortran has done a lot
of that, and that's a language from way back. But it assuredly
doesn't come without effort.

>Personally, I think speed-up is the wrong focus. Sooner or later, for
>reasons that seem pretty fundamental to me, we will be back to
>bragging, as Intel once did, about instructions in flight.

God help us, yes :-(

Regards,
Nick Maclaren.

Andy 'Krazy' Glew

unread,

Jun 2, 2010, 12:15:58 AM6/2/10

to

Although I have been having supper once a month with David Dinucci, ever since the Portland Geek Dinner/Friends of
Eugene Miya" started, I have not looked much at his work. Must do so.

Actually, I have spent much of the last twenty years trying to identify the parallel pieces of the program. SpMT is
part of that - restricting the pieces to start/stop at function boundaries (or loops, or IF/ENDIF). But while such
control flow techniques work pretty well, I have not yet figured out how to build hardware to automatically analyze the
data dependencies - or at least hardware that does such analysis in a way that is significantly cheaper than executing
the code itself. In my terminology, I say that such analysis is accomplished a posteriori, after forking the candidate
parallel threads, rather than a priori, beforehand.

For this reason I am very, very, happy if a programmer or compiler is willing to do so for me. Even if just in the form
of hints.

Now, by the way, I am reasonably happy with approaches that do such compiler-style analysis in "hardware", or, rather,
firmware or microcode. The log based mechanisms that I am so fond of even provide such firmware with much more data
than they would have if statically compiling, and more than most JIT code generators have access to. You can execute
the code once or a few times, record the log, and analyze the log to see where the dependences occur, and do not. Such
an approach is certainly as good as a compiler or programmer can be - since we would always give them HINT instructions
to control the fork mechanism themselves - and potentially better, since the firmware has more information to analyze
than does a compiler.

But it has been my observation that a priori predictors always do better than a posteriori. The cost of making those
few initial wrong guesses can dominate, especially for code, such as transaction processing, that tends to thrash the
I-cache.

--

You are correct in observing that my batch scheduling across processor/cluster boundaries is a retreat from such
analysis. But, I emphasize: "batch" is just a technique for creating a larger contiguous instruction window. It's
strength lies in its simplicity. It does not have great upside, and it isn't supposed to.

I am just assembling a set of tools or components for a next generation out-of-order machine:

* Multicluster Multithreading
* => cheap thread forking and migration between clusters and processors
* Batch scheduling across cluster boundaries to use those clusters for single thread
* SpMT ... ditto ...
* Log based

All, or nearly all, of these mechanisms also being useful for parallel programming.

Andy 'Krazy' Glew

unread,

Jun 2, 2010, 12:17:25 AM6/2/10

to

When did Intel ever brag about that?

I missed it. I would have clipped and framed such an ad.

Robert Myers

unread,

Jun 2, 2010, 12:36:44 AM6/2/10

to

On Jun 2, 12:17 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:

>
> > Personally, I think speed-up is the wrong focus. Sooner or later, for
> > reasons that seem pretty fundamental to me, we will be back to
> > bragging, as Intel once did, about instructions in flight.
>
> When did Intel ever brag about that?
>
> I missed it. I would have clipped and framed such an ad.

I'm not sure if it was an ad or a glitzy web page on Intel's site. It
may have been both.

It was when the P4 had just been announced and its performance was
less than impressive.

Since Intel couldn't sell benefits (like performance), it had to sell
features, like the trace cache and a huge number of instructions in
flight.

You see how my education in computer microarchitecture has progessed?
I remember it because that's where I learned the phrase "instructions
in flight."

Robert.

Robert Myers

unread,

Jun 3, 2010, 2:58:55 PM6/3/10

to

On Jun 2, 12:15 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:

>
> You are correct in observing that my batch scheduling across processor/cluster boundaries is a retreat from such
> analysis. But, I emphasize: "batch" is just a technique for creating a larger contiguous instruction window. It's
> strength lies in its simplicity. It does not have great upside, and it isn't supposed to.
>

A few random thoughts provoked by this exchange, for whatever they are
worth:

First, we may be experiencing a version of the RISC "revolution"
redux. The logic behind that non-revolution seemed so iron-clad and
at the same time transparent that it was hard to imagine how something
with as many warts as x86 would survive. We all know how that came
out.

It may seem that all the ways forward with run-time hardware are too
complicated and too power-hungry, but we've seen fairly amazing
progress with very low-cost solutions to branch prediction and memory
disambiguation. Rather than favoring the path forward as some
revolutionary new architecture, the odds favor picking off the easiest
pieces of a really smart run-time analyzer one-by-one. If we do have
a language revolution, it will probably happen the same way.

A second, unrelated thought, is that I think I know the answer to my
own question about how you finessed the theoretical N^2 complexity of
out-of-order operation. Inter-instruction dependencies have a
probability distribution, and if that probability distribution is
peaked near zero (like, say, a Gaussian) with a very small number of
dependencies outside some instruction strip W, then the complexity can
be arranged to be more like (N/W)*W^2=NW, because the dependency
matrix can be dealt with in blocks with some tolerable number of
outliers. If N=MW, where M is the number of batches, the complexity
comes out to be N^2/M, with some presumably small cost of dealing with
dependencies outside the statistically-likely dependency window. That
just formalizes what to me was the implicit assumption that
dependencies are localized. You may have always understood what I
just observed, either explicitly or implicitly.

A third thought, which you may not now care for, is that it's hard to
imagine the best solution coming out of a shop that doesn't control
both the compiler and the microarchitecture, because the answer to:
"should the parallelism come from the software end or the hardware
end?" is probably "both."

Robert.

Andy 'Krazy' Glew

unread,

Jun 4, 2010, 12:05:53 AM6/4/10

to

On 6/3/2010 11:58 AM, Robert Myers wrote:
> On Jun 2, 12:15 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:
>
>>
>> You are correct in observing that my batch scheduling across processor/cluster boundaries is a retreat from such
>> analysis. But, I emphasize: "batch" is just a technique for creating a larger contiguous instruction window. It's
>> strength lies in its simplicity. It does not have great upside, and it isn't supposed to.
>>
>
> A few random thoughts provoked by this exchange, for whatever they are
> worth:
>

> A third thought, which you may not now care for, is that it's hard to
> imagine the best solution coming out of a shop that doesn't control
> both the compiler and the microarchitecture, because the answer to:
> "should the parallelism come from the software end or the hardware
> end?" is probably "both."

I agree - or, rather, I strongly agreed back in 1991, and I overall agree now - although experience tends to suggest
that it ain't necessarily so.

Now, as for "should the parallelism come from the software end, or the hardware end?" You have created a false
dichotomy: it is really a triangle with three corners:
1. explicit programmer expressed parallelism
2. compiler expressed parallelism,
2.1. parallelization of non-parallel from the programmer
2.2. compiler supported parallelism of code that the programmer has expressed as parallel
3. hardware supported parallelization
3.1. of the explicit parallelism of 1. and 2.2
3.2. of code that the programmer and compiler treat as non-parallel

Of the above, I am most heartily in favor of

Explicit parallelism all along the way: 1. + 2.2. + 3.1.

I have a loot of experience in 3.2. hardware parallelization (and in 3.1. hardware support for explicit parallelism).

I'm all in favor of 2.1, compiler parallelization of programmer expressed non-parallel code. But I am most suspicious.

E.g. Andrew Wolfe (author of "Optimizing Supercompilers for Supercomputers") taught that compilers had never really
gotten all that good at vector parallelism. Rather, humans started learning to write code in the idioms that compilers
could vectorize.

Andy 'Krazy' Glew

unread,

Jun 4, 2010, 12:15:17 AM6/4/10

to

On 6/3/2010 11:58 AM, Robert Myers wrote:

> On Jun 2, 12:15 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:
>
>>
>> You are correct in observing that my batch scheduling across processor/cluster boundaries is a retreat from such
>> analysis. But, I emphasize: "batch" is just a technique for creating a larger contiguous instruction window. It's
>> strength lies in its simplicity. It does not have great upside, and it isn't supposed to.
>>
>
> A few random thoughts provoked by this exchange, for whatever they are
> worth:
>

> A third thought, which you may not now care for, is that it's hard to
> imagine the best solution coming out of a shop that doesn't control
> both the compiler and the microarchitecture, because the answer to:
> "should the parallelism come from the software end or the hardware
> end?" is probably "both."

I agree - or, rather, I strongly agreed back in 1991, and I overall agree now - although experience tends to suggest

that it ain't necessarily so.

Now, as for "should the parallelism come from the software end, or the hardware end?" You have created a false
dichotomy: it is really a triangle with three corners:
1. explicit programmer expressed parallelism
2. compiler expressed parallelism,
2.1. parallelization of non-parallel from the programmer
2.2. compiler supported parallelism of code that the programmer has expressed as parallel
3. hardware supported parallelization
3.1. of the explicit parallelism of 1. and 2.2
3.2. of code that the programmer and compiler treat as non-parallel

Of the above, I am most heartily in favor of

Explicit parallelism all along the way: 1. + 2.2. + 3.1.

I have a loot of experience in 3.2. hardware parallelization (and in 3.1. hardware support for explicit parallelism).

I'm all in favor of 2.1, compiler parallelization of programmer expressed non-parallel code. But I am most suspicious.

E.g. Andrew Wolfe (author of "Optimizing Supercompilers for Supercomputers") taught that compilers had never really
gotten all that good at vector parallelism. Rather, humans started learning to write code in the idioms that compilers
could vectorize.

---

Nevertheless, I still like to work with compiler teams. History:

1991: at the start of P6 Intel's compilers were not that good - but Intel made a great effort to improe them. However,
much of that effort was devoted to optimizing P5 (in-order machines need a lot of optimization). Compiler folk were
somewhat frustrated by P6 running unoptimized code almost as fast as optimized code [*], although they were happy to
learn that many supposedly P6 specific optimizations improved P6 even more.

Overall, working with Intel's compiler team in the early days was fun and productive. But it did show me the benefits
of loose coupling between teams.

Towards 1995, the compiler team started getting sucked into the black hole of Itanium. I don't think we ever really saw
real dedication to optimizing for P6-style OOO.

I wasn't there, but I have heard that the compiler was instrumental in band-aiding the worst of the stupidities of
Willamette. That was also true for P6: the compiler really helped make up for the shortsighted decisions wrt partial
registers and memory. Overall a pattern: tight interaction between compilers and hardware really helps to make up for
hardware shortcomings. However, truly aggressive compiler optimization can often be done in a more loosely coupled fashion.

2006-9: I can't leave off without thanking the compiler guy who worked with me on my last project at Intel: Bob
Kushlis, the one man compiler SWAT team. This wasn't a parallelism project, but it was nevertheless a wonderful
example of compiler/hardware collaboration.

---

Getting back to parallelism:

I'm most hopeful about programmer expressed parallelism.

I think that one of the most important things for compilers will be to large amounts of programmer expressed parallelism
in an ideal machine - PRAM? CSP? - to whatever machine you have.

Mike Hore

unread,

Jun 4, 2010, 1:02:09 AM6/4/10

to

Andy 'Krazy' Glew wrote:

> ...

> Nevertheless, I still like to work with compiler teams. History:
>
> 1991: at the start of P6 Intel's compilers were not that good - but
> Intel made a great effort to improe them. However, much of that effort
> was devoted to optimizing P5 (in-order machines need a lot of
> optimization). Compiler folk were somewhat frustrated by P6 running
> unoptimized code almost as fast as optimized code [*], although they
> were happy to learn that many supposedly P6 specific optimizations
> improved P6 even more.
>
> Overall, working with Intel's compiler team in the early days was fun
> and productive. But it did show me the benefits of loose coupling
> between teams.
>
> Towards 1995, the compiler team started getting sucked into the black
> hole of Itanium. I don't think we ever really saw real dedication to
> optimizing for P6-style OOO.
>
> I wasn't there, but I have heard that the compiler was instrumental in
> band-aiding the worst of the stupidities of Willamette. That was also
> true for P6: the compiler really helped make up for the shortsighted
> decisions wrt partial registers and memory. Overall a pattern: tight
> interaction between compilers and hardware really helps to make up for
> hardware shortcomings. However, truly aggressive compiler optimization
> can often be done in a more loosely coupled fashion.

Thanks for that fascinating stuff, Andy. I'm wondering, where was
Microsoft while this was going on? Did they use Intel's compiler at
all, or did they "do it their way?"

Cheers, Mike.

---------------------------------------------------------------
Mike Hore mike_h...@OVE.invalid.aapt.net.au
---------------------------------------------------------------

nm...@cam.ac.uk

unread,

Jun 4, 2010, 4:07:08 AM6/4/10

to

In article <4C087DD5...@patten-glew.net>,

Andy 'Krazy' Glew <ag-...@patten-glew.net> wrote:

Yes and no. Technically, I agree with you, and have been riding
that hobby-horse for nearly 40 years now! It is, after all, why
extreme HPC systems work so much better with Fortran codes than
C or C++ ones - the extra restrictions express parallelism that
the compiler is permitted to use.

Unfortunately, since the demise of Algol 68, the languages that
are favoured by the masses have been going in the other direction.
Fortran 90 has not, but it's now a niche market language. Worse,
the moves towards languages defining a precise abstract machine
are regarded as obsolete (though Java is an exception), so most
languages don't have one.

And, no, I do NOT mean the bolt-on extensions that are so common.
They sometimes work, just, but never well. You can't turn a
Model T Ford into something capable of maintaining 60 MPH, reliably,
by bolting on any amount of modern kit!

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Jun 4, 2010, 5:04:58 AM6/4/10

to

Andy 'Krazy' Glew wrote:
> E.g. Andrew Wolfe (author of "Optimizing Supercompilers for
> Supercomputers") taught that compilers had never really gotten all that
> good at vector parallelism. Rather, humans started learning to write
> code in the idioms that compilers could vectorize.

This is fine for new code, but a showstopper for existing C(++) code. :-(

> Nevertheless, I still like to work with compiler teams. History:
>
> 1991: at the start of P6 Intel's compilers were not that good - but
> Intel made a great effort to improe them. However, much of that effort
> was devoted to optimizing P5 (in-order machines need a lot of
> optimization). Compiler folk were somewhat frustrated by P6 running
> unoptimized code almost as fast as optimized code [*], although they
> were happy to learn that many supposedly P6 specific optimizations
> improved P6 even more.

------------------------------------------ P5 specific --- I assume?

This was my experience as well: Since the P6 tries to execute the oldest
instructions first, good P5 scheduling tended to also make the P6 run
faster.

>
> Overall, working with Intel's compiler team in the early days was fun
> and productive. But it did show me the benefits of loose coupling
> between teams.
>
> Towards 1995, the compiler team started getting sucked into the black
> hole of Itanium. I don't think we ever really saw real dedication to
> optimizing for P6-style OOO.
>
> I wasn't there, but I have heard that the compiler was instrumental in
> band-aiding the worst of the stupidities of Willamette. That was also

Willamette, i.e. first-gen P4, right?

That chip had two simultaneous glass jaws: Both integer mul and shift
were slow, so only LEA remained as a fast way to calculate addresses.

> true for P6: the compiler really helped make up for the shortsighted
> decisions wrt partial registers and memory. Overall a pattern: tight
> interaction between compilers and hardware really helps to make up for
> hardware shortcomings. However, truly aggressive compiler optimization
> can often be done in a more loosely coupled fashion.

That's like the difference between micro-optimization and improving the
basic algorithm. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

nedbrek

unread,

Jun 4, 2010, 8:21:50 AM6/4/10

to

Hello all,

"Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in message
news:s6smd7-...@ntp.tmsw.no...

>
> Willamette, i.e. first-gen P4, right?

Yes.

> That chip had two simultaneous glass jaws: Both integer mul and shift were
> slow, so only LEA remained as a fast way to calculate addresses.

The P4 optimization guide was funny:
1) Don't do multiplies, do shifts
2) Don't do shifts, do adds
3) Don't do loads
4) Don't do branches
5) Do adds
6) Don't do anything but adds and you will be ok.

Ned

nedbrek

unread,

Jun 4, 2010, 8:39:38 AM6/4/10

to

Hello all,

"Mike Hore" <mike_h...@OVE.invalid.aapt.net.au> wrote in message
news:hua1cl$43v$1...@news.eternal-september.org...

>
> Thanks for that fascinating stuff, Andy. I'm wondering, where was
> Microsoft while this was going on? Did they use Intel's compiler at all,
> or did they "do it their way?"

AFAIK, MS has always done their own thing on their compiler.

Ned

Robert Myers

unread,

Jun 4, 2010, 12:14:48 PM6/4/10

to

On Jun 4, 12:15 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:

>
> I agree - or, rather, I strongly agreed back in 1991, and I overall agree now - although experience tends to suggest
> that it ain't necessarily so.
>
> Now, as for "should the parallelism come from the software end, or the hardware end?" You have created a false
> dichotomy: it is really a triangle with three corners:
> 1. explicit programmer expressed parallelism
> 2. compiler expressed parallelism,
> 2.1. parallelization of non-parallel from the programmer
> 2.2. compiler supported parallelism of code that the programmer has expressed as parallel
> 3. hardware supported parallelization
> 3.1. of the explicit parallelism of 1. and 2.2
> 3.2. of code that the programmer and compiler treat as non-parallel
>
> Of the above, I am most heartily in favor of
>
> Explicit parallelism all along the way: 1. + 2.2. + 3.1.
>
> I have a loot of experience in 3.2. hardware parallelization (and in 3.1. hardware support for explicit parallelism).
>
> I'm all in favor of 2.1, compiler parallelization of programmer expressed non-parallel code. But I am most suspicious.
>
> E.g. Andrew Wolfe (author of "Optimizing Supercompilers for Supercomputers") taught that compilers had never really
> gotten all that good at vector parallelism. Rather, humans started learning to write code in the idioms that compilers
> could vectorize.
>
> ---

> ---
>
> Getting back to parallelism:
>
> I'm most hopeful about programmer expressed parallelism.
>
> I think that one of the most important things for compilers will be to large amounts of programmer expressed parallelism
> in an ideal machine - PRAM? CSP? - to whatever machine you have.

To latch onto the vector code example, it works because three
different things work at the same time:

1. The hardware has a feature that is extremely advantageous if
utilized properly and useless otherwise.

2. The compiler gives the programmer reasonably transparent and
lightweight mechanisms (in the case of Cray Fortran, the compiler
directive $IVDEP, for ignore vector dependency, and vector intrinsics)
for exploiting that potentially advantageous feature.

3. Programmers working from a fairly primitive understanding of the
actual hardware implementation can understand what exactly needs to be
done and why and can successfully write correct code with a high
degree of probability of success. I assume that many programmers
wrote successful code knowing only heuristic rules and not having even
a primitive hardware model in mind.

So far as I know, the only "clever" thing that vectorizing compilers
learned how to do was to turn loop nesting inside out so that the
innermost loop would vectorize. That cleverness wasn't really
necessary, as programmers could easily learn to write in that idiom to
begin with. What *was* necessary was that the compiler could allow
the programmer to invoke vector operation with minimum pain.

This is an example exactly of your 1 + 2.2 + 3.1. It was that kind of
optimization that I had mostly in mind.

So far as I know, this all worked out because it was, at one point,
all under the control of Seymour Cray. No industry-wide committees.

When will we get (for example) truly lightweight threads? When
someone has enough clout to control the first implementation from
beginning to end: hardware support, compiler support, OS support, and
programmer education. Until then, we will be manufacturing papers and
conference presentations and little else.

Once again, my intent is to be provocative.

Robert.

Neville Dempsey

unread,

Jun 5, 2010, 6:33:36 AM6/5/10

to

On Jun 4, 6:07 pm, n...@cam.ac.uk wrote:
> In article <4C087DD5.6050...@patten-glew.net>,

> Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:
>
> >On 6/3/2010 11:58 AM, Robert Myers wrote:
> >> On Jun 2, 12:15 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:

> Unfortunately, since the demise of Algol 68, the languages that
> are favoured by the masses have been going in the other direction.
> Fortran 90 has not, but it's now a niche market language.

Algol68g uses standard Posix threads to implement the 1968 Standard's
PARallel clauses on Linux, Unix, Mac OSX and Windows. Here is a code
snippet from the actual ALGOL 68 standards document, it demonstrates
the original multi-threading that is part of the ALGOL 68 language
definition.

<code>
proc void eat, speak;
sema mouth = level 1;

par begin
do
down mouth;
eat;
up mouth
od,
do
down mouth;
speak;
up mouth
od
end
</code>

Simply put a PAR before a BEGIN ~, ~, ~, ~ END block then all the ~
clauses in the block are executed in parallel.

It is revealing to think that eating and talking can actually require
parallel processing. Other real world examples would be cooking the
dishes for the sumptuous dinner in the first place. :-)

> >> On Jun 2, 12:15 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:

> [...] Worse,

> the moves towards languages defining a precise abstract machine
> are regarded as obsolete (though Java is an exception), so most
> languages don't have one.

Algol68 examples:
From Russia: "Virtual machine and fault-tolerant hardware platform
(SAMSON)"
We implemented an Algol 68 to the SAMSON virtual-machine code compiler
and VM interpreter for all the widespread microprocessors.
* http://www.at-software.com/industrial-electronics/virtual-machine-and-fault-tolerant-hardware-platform.html

From Netherlands:
Tanenbaum, A.S.: Design and Implementation of an Algol 68 Virtual
Machine
http://oai.cwi.nl/oai/asset/9497/9497A.pdf

From Cambridge:
S.R. Bourne, "ZCODE A simple Machine", ALGOL68C technical report,

From E. F. Elsworth. University of Aston in Birmingham
"Compilation via an intermediate language - ALGOL68C and ZCODE"
http://comjnl.oxfordjournals.org/cgi/reprint/22/3/226.pdf

> And, no, I do NOT mean the bolt-on extensions that are so common.
> They sometimes work, just, but never well. You can't turn a
> Model T Ford into something capable of maintaining 60 MPH, reliably,
> by bolting on any amount of modern kit!

To download Linux's Algol68 Compiler, Interpreter & Runtime:
* http://sourceforge.net/projects/algol68

N joy
NevilleDNZ

Andy 'Krazy' Glew

unread,

Jun 5, 2010, 7:07:52 PM6/5/10

to

On 6/3/2010 10:02 PM, Mike Hore wrote:
> Andy 'Krazy' Glew wrote:
>
>> ...

>> Nevertheless, I still like to work with compiler teams. History:...

>
> Thanks for that fascinating stuff, Andy. I'm wondering, where was
> Microsoft while this was going on? Did they use Intel's compiler at all,
> or did they "do it their way?"

Most interactions in my day were with Intel's compiler team. Sure, we worked with Microsoft - but they had their own
priorities, and did not necessarily want to deliver technology on the schedule that Intel needed it. If at all.

During P6 development, Intel and Microsoft were not necessarily allied. Microsoft was porting Windows to RISCs. This
resulted in things such as putting MMX in the x87 FP registers - so that an OS change would not be necessary. I.e. so
that Intel could ship MMX without waiting on Microsoft to ship a next OS version.

Most of the time, Microsoft was interested in compiler optimizations that would produce real benefits for real
customers, while Intel was interested in optimizations that would get Intel the best benchmark scores. E.g. Intel
pushed aggressive loop unrolling, and other optimizations that increased code size, while Microsoft was really
interested in optimizations that reduced code size (a) to reduce the number of CDs that a program would ship on, but
also (b) because code size really did matter more than other optimizations, for the mass market that did not max memory out.

For example, Microsoft quite naturally wanted to optimize code for the machines that dominated the mass market. Since
previous generation machines dominate for 3-5 years after a new chip comes out, that makes sense. Intel, on the other
hand, wanted compilers that made the latest and greatest chips, maxed out with high speed memory, look good. You only
really get Microsoft mindshare once the developers have the machines on their desk, whereas Intel developers grudgingly
may work with simulators pre-silicon.

I.e. I don't think that Microsoft was wrong or shortsighted. I think that Intel and Microsoft have different business
interests.

On the other hand, my last project at Intel was much more aligned with Microsoft's and and the mass market's interest.
And we had good cooperation from MS compilers.

Inteol management always complains that Microsoft is not ready to support a great new chip feature when silicon comes
out. But this will always be the case - except when Microsoft also profits immediately and directly from the new feature.

Andy 'Krazy' Glew

unread,

Jun 5, 2010, 9:04:28 PM6/5/10

to

On 6/4/2010 1:07 AM, nm...@cam.ac.uk wrote:
> In article<4C087DD5...@patten-glew.net>,
> Andy 'Krazy' Glew<ag-...@patten-glew.net> wrote:
>> On 6/3/2010 11:58 AM, Robert Myers wrote:
>>> On Jun 2, 12:15 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:
>>>
>> Getting back to parallelism:
>>
>> I'm most hopeful about programmer expressed parallelism.
>>
>> I think that one of the most important things for compilers will be
>> to large amounts of programmer expressed parallelism in an ideal
>> machine - PRAM? CSP? - to whatever machine you have.
>
> Yes and no. Technically, I agree with you, and have been riding
> that hobby-horse for nearly 40 years now!

It looks like my mind got ahead of my fingers when I typed the above. I should have said

I think that one of the most important things for compilers will be

to map the large amounts of programmer expressed parallelism in an ideal

machine - PRAM? CSP? - to whatever machine you have.

I.e. I am saying "code for thousandss or millions of threads, and let the compiler or system map it to run on the mere
dozens or hundreds of physical processing elements you have."

Still agree, Nick?

> Unfortunately, since the demise of Algol 68

Aha! Nick, I keep hearing you complain about the sad state of modern tools. But you never give details. Finally, here
we have one such:

Algol 68 is also (was also) one of my favorite languages. What aspects did you like, Nick?

Andy 'Krazy' Glew

unread,

Jun 5, 2010, 9:09:03 PM6/5/10

to

I have posted about how I think in parallel or concureently, and how I have helped beginning programmers work out bugsin
their code (on sequential machines) by teaching them how to use concurrent assignment or par clauses, and then
automaticaly translate them to sequential.

Algol 68 was one of the most useful languages in teaching me such techniques.

MitchAlsup

unread,

Jun 5, 2010, 10:26:32 PM6/5/10

to

On Jun 5, 8:04 pm, Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:
> I.e. I am saying "code for thousandss or millions of threads, and let the compiler or system map it to run on the mere
> dozens or hundreds of physical processing elements you have."
>
> Still agree, Nick?

Although I am not Nick........

This reminds me of why the pure dataflow machines never materialized.
It seems that once you have created the abstract model necessary to
express pure dataflow, your major task become compressing (managing)
it so that it does not swamp your necessarily limited resources.

Mitch