PAPER 11/01: Supporting Speculative Multithreading on Simultaneous Multithreaded Processors

Guofeng

unread,

Oct 26, 2007, 1:29:53 PM10/26/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu
and Pen-Chung Yew. "Supporting Speculative Multithreading on
Simultaneous Multithreaded Processors", in the Proc. International
Conference on High Performance Computing (HiPC'06), Bangalore, India,
December 18-21, 2006.

jun shen

unread,

Oct 31, 2007, 1:54:50 AM10/31/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

several simple questions:
1. what is speculative thread? Sorry, I can not find a suitable answer
from the web.
2.In section3.2, non-speculative thread execution, why speculative
thread not squashed when there is a non-speculative read to a SD
cache? In the two-thread model, Is there at most one speculative
thread? And please give me the reason.
3.In section3.3, I suppose there is a problem with OW: suppose a word
is 16bit, when there is 16 or more threads in a systems, the cost of
OW is very high, hence the extensibility of the model is dubious,
right? the same problem with SL bit
4. In section3.3, dependence detection, if SL bit of a non-successor
speculative thread is set, that thread also needs to be squashed,
right?

Sugan Vinayagam

unread,

Oct 31, 2007, 3:12:34 AM10/31/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi Jun,

1. A thread turns in to a speculative thread, when it starts executing the instructions from the likely branching path when it encounters a branch instruction. i.e. it starts fetching instructions from the branch predicted by the branch prediction algorithm and executes it(even before knowing the result of the branch).

2. when there is a non-speculative thread read to a SD cache line, speculative thread is not squashed because speculative thread is running instructions ahead of the actual time and hence the SD cache line may contain the future value that would be written to the cache( if the predicted branch is correct). So this is treated as a read miss by the non-speculative thread and hence it reads from the L2 cache.

There would be atmost one speculative thread in a 2-thread SMT processor.

3. I agree that the scalability is an issue when looking at 16 threads and more in one SMT processor. The processor would need to maintain 15 copies of speculative and 1 copy of non-speculative data. But looking at the other exsisting architectures(superscalar) this does fare well in 4-thread system and i hope(though not sure) this would fare well in a 8-thread system as well.

Everything performance parameter then boils down to the amount of L1 cache. This architecture would fare well with the increase in amount of L1 cache so that it can buffer lot of speculative data.

4. If the thread is specualtive non-successor thread, then this thread is executing instructions ahead of the other thread(read data - SLi bit set) trying to update data in the cache line. i.e. for ex: non-successor spec thread is executing instructions at the top of the program(in layman terms) and the thread trying to update cache is executing instructions that comes in the middle of the program( i mean in instructions execution order).

So this is not a dependence violation.

Let me know if I'm not clear in any of the answers given. I'm also guessing answers for few of your questions as they sound logical and coherent with the details in the paper.

Thank you,

Sugan.

ash...@gmail.com

unread,

Oct 31, 2007, 12:33:44 PM10/31/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Hi,

I just wanted to add a few things to what Sugan said, for your third
question (Scalability problem for 16 threads).

SMT processors usually restrict the number of threads that can be run
speculatively to a few (usually around 2 to 4). The reason is, for
each thread that you want to run speculatively, you need to duplicate
the execution contexts. Plus, as you can imagine, there are
complications due to the multiple independent commits of speculative
threads, dependency checking among the threads, etc.

So 16 threads would mean way too much functionality and it would be
difficult to get a significant improvement considering the extra cost
and power consumption of the components.

To relate this to actual implementations, the IBM POWER5 (which is
considered to be the most aggressive implementation of SMT) has
support for only 4 threads.

Ashay

> On 10/30/07, jun shen <jun.she...@asu.edu> wrote:
>
>
>
>
>
> > several simple questions:
> > 1. what is speculative thread? Sorry, I can not find a suitable answer
> > from the web.
> > 2.In section3.2, non-speculative thread execution, why speculative
> > thread not squashed when there is a non-speculative read to a SD
> > cache? In the two-thread model, Is there at most one speculative
> > thread? And please give me the reason.
> > 3.In section3.3, I suppose there is a problem with OW: suppose a word
> > is 16bit, when there is 16 or more threads in a systems, the cost of
> > OW is very high, hence the extensibility of the model is dubious,
> > right? the same problem with SL bit
> > 4. In section3.3, dependence detection, if SL bit of a non-successor
> > speculative thread is set, that thread also needs to be squashed,
> > right?
>
> > On Oct 27, 1:29 am, Guofeng <guofeng.d...@gmail.com> wrote:
> > > Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu
> > > and Pen-Chung Yew. "Supporting Speculative Multithreading on
> > > Simultaneous Multithreaded Processors", in the Proc. International
> > > Conference on High Performance Computing (HiPC'06), Bangalore, India,
> > > December 18-21, 2006.
>

> --
> Sugan Vinayagam

jun shen

unread,

Oct 31, 2007, 4:05:29 PM10/31/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Hi Sugan & Ashay
Thank you for the instant answer. However I still have some doubt
about Q4. I suppose I did not describe my problem clearly.
My understanding is: Even though the spec thread(non successor) T is
running ahead of schedule. If some other non-speculative thread tries
to perform a store instruction on cache A(let me name it as so), and T
has A's SL bit set, that means T's previous execution is incorrect now
and T shall be squashed. why that is not a dependence violation?

ash...@gmail.com

unread,

Oct 31, 2007, 4:44:41 PM10/31/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Hi Jun,

I guess I understood your question now. I guess you wanted to ask that
if a non-speculative thread does a store on a data that has already
been loaded by a speculative thread, then the speculative thread
should be squashed. Am I correct in understanding your question?

If yes, I believe it is a case of dependence violation and the
successor thread should be squashed. On page 4 of the paper (for the
two-threaded scheme), it is mentioned that the speculative thread will
be squashed in such a case.

In the "dependence detection" part of section 3.3, I guess the authors
are referring to a "store" executed by a speculative thread. In that
case, only the successors of the speculative thread have caused a
dependece violation and so only those would be squashed. Had the store
been executed by a non-speculative thread, the speculative thread
(whether it is successor or non-successor) would be squashed.

Since no reference is made in that part whether it is a speculative
store or a non-speculative store, I am assuming it is a speculative.
If it is a non-speculative store, the part metioned for the two-thread
scheme mentioned earlier would apply, saying that the speculative
thread would be squashed.

I hope that answers your question.

Ashay

> > > Sugan Vinayagam- Hide quoted text -
>
> - Show quoted text -

jun shen

unread,

Nov 1, 2007, 11:59:43 PM11/1/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

JunShen 's critique-- I dont know how to upload a file, so I paste the
critique here.

Comments on "Supporting Speculative Multithreading on Simultaneous
Multithreaded Processors"

Jun Shen 993992089 Nov 1st, 2007

1. Paper Outline
The paper tries to combine speculative thread with SMT to gain
better system throughput with least cost. Besides, the paper presents
a cache-based scheme to support large thread. This scheme overcomes
the shortcoming of LSQ. The author shows us 2-thread and 4-thread
architecture and the comparison result at last.

2. Contributions:
I. This paper shows a way how to speedup the SMT system by exploiting
speculative thread technology with comparatively less complexity. The
author adopts a cache-based scheme. In the scheme only two state "SD"
"SV" are added and for each word, two bits "SL" "SM" are added.
II. This paper gives the details of two examples of the scheme--- that
is 2 thread scheme and 4-thread scheme. In each scheme, the author
shows the additional work about speculative thread such as dependence
violation check, when to commit and squash speculative thread and how
non-speculative thread shall interactive with speculative ones.
III. This paper provides details of their experiments-the way how to
examine their architecture with existing architecture.

3. Weaknesses:
I. As I mentioned in online discussion, the scalability of speculative
thread in SMT is dubious, the reason is that each speculative thread
need 2 bit for a word, with the increase of the number of speculative
thread, the cost will go very high.
II. And the author shall explore the optimum of dependence checking
granularity. Although byte level prevents the false dependence, the
checking cost will be high.

David.S....@asu.edu

unread,

Nov 3, 2007, 4:26:17 AM11/3/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Supporting Speculative Multithreading on SMT Paper Critique

This paper addresses schemes for improving processor performance via
speculative multithreading which is a technique that allows the
processor to issue instructions (in parallel with other instructions)
that will likely be executed in the future and be able to rollback or
"squash" those instructions later if it turns out that they were not
supposed to have been executed.

The author presents Load/Store queue (LSQ) based architectures as well
as a proposed cache based architecture. The advantage of the LSQ
architecture is that it does not require a significant change to
existing architectures. Modern out-of-order-execution CPU's already
implement LSQ's to buffer "in-flight" out-of-order instructions which
are committed in program order. This same architecture could be re-
used to buffer the data produced by speculated instructions.

The author presents a scalability problem with the LSQ implementation,
specifically, the size of the queue. A large LSQ will require more
physical space, create more heat, and consume more (wasted) power.
The author favors the use of a shared L1 cache-based implementation
where speculative threads will all store their results rather than the
LSQ. Existing architectures would have to be re-designed to support
the transfer of speculated results to the L1 cache which is capable of
storing more data than the LSQ.

The author suggests that the sizes of the speculated threads can be
much larger in the cache based implementation which could potentially
improve performance.

Weaknesses:
What are the disadvantages to larger speculative threads? I did not
see any consideration of this. The cache based implementation will be
able to store more result-data from larger threads of instructions;
however, the thread size still needs to remain small when speculating
because the likelihood of mis-speculation becomes more likely as the
number of instructions per speculated thread increases.

The state transitions required to implement the author's scheme are
very complicated. Wouldn't the proposed state-transition algorithm be
difficult (expensive) to implement in hardware? This would also
require additional hardware support to transfer speculated data to/
from the cache as well as detect dependency violations. Cost
analysis?

David.S....@asu.edu

unread,

Nov 3, 2007, 4:31:25 AM11/3/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Here is a good article that explains Speculative Multithreading.

http://www.intel.com/technology/magazine/research/speculative-threading-1205.htm

Pradnyesh Gudadhe

unread,

Nov 8, 2007, 5:23:13 AM11/8/07

to asucse520-fall-07-advanc...@googlegroups.com

Please find the summary of the discussions attached with this email.
Thanks.

Regards,
Pradnyesh

On Nov 3, 2007 1:31 AM, <David.S....@asu.edu > wrote:

Here is a good article that explains Speculative Multithreading.

http://www.intel.com/technology/magazine/research/speculative-threading-1205.htm

http://www.geocities.com/paddyinpilani

Summary.pdf

Sugan Vinayagam

unread,

Nov 9, 2007, 1:25:58 PM11/9/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi Pradnyesh,

My comment about a weakness point in the summary.

"3) The paper fails to talk about disadvantage of having larger
speculative threads.
With the increasing number of instructions per thread, possibility of mis-
speculation increases which may lead to LSQ Based Speculation being better than
Cache based speculation."

LSQ based speculation technique does not support larger speculative
threads due to limitation on the LSQ size, whereas L1 cache is usually
much larger than LSQ and hence can support larger speculative threads
by buffering a large amount of speculative values.

Also I fail to understand why increase in the number of instructions
per thread increases the possibility of mis-speculation. This cannot
be true always unless there are a lot of branch instructions in the
instructions executed by the speculative thread. I would better say
that the cost of mis-speculation is high in large speculative threads.

Thanks,
Sugan.

Reply all

Reply to author

Forward