I just wanted to add a few things to what Sugan said, for your third
question (Scalability problem for 16 threads).
SMT processors usually restrict the number of threads that can be run
speculatively to a few (usually around 2 to 4). The reason is, for
each thread that you want to run speculatively, you need to duplicate
the execution contexts. Plus, as you can imagine, there are
complications due to the multiple independent commits of speculative
threads, dependency checking among the threads, etc.
So 16 threads would mean way too much functionality and it would be
difficult to get a significant improvement considering the extra cost
and power consumption of the components.
To relate this to actual implementations, the IBM POWER5 (which is
considered to be the most aggressive implementation of SMT) has
support for only 4 threads.
Ashay
> On 10/30/07, jun shen <jun.she...@asu.edu> wrote:
>
>
>
>
>
> > several simple questions:
> > 1. what is speculative thread? Sorry, I can not find a suitable answer
> > from the web.
> > 2.In section3.2, non-speculative thread execution, why speculative
> > thread not squashed when there is a non-speculative read to a SD
> > cache? In the two-thread model, Is there at most one speculative
> > thread? And please give me the reason.
> > 3.In section3.3, I suppose there is a problem with OW: suppose a word
> > is 16bit, when there is 16 or more threads in a systems, the cost of
> > OW is very high, hence the extensibility of the model is dubious,
> > right? the same problem with SL bit
> > 4. In section3.3, dependence detection, if SL bit of a non-successor
> > speculative thread is set, that thread also needs to be squashed,
> > right?
>
> > On Oct 27, 1:29 am, Guofeng <guofeng.d...@gmail.com> wrote:
> > > Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu
> > > and Pen-Chung Yew. "Supporting Speculative Multithreading on
> > > Simultaneous Multithreaded Processors", in the Proc. International
> > > Conference on High Performance Computing (HiPC'06), Bangalore, India,
> > > December 18-21, 2006.
>
> --
> Sugan Vinayagam
I guess I understood your question now. I guess you wanted to ask that
if a non-speculative thread does a store on a data that has already
been loaded by a speculative thread, then the speculative thread
should be squashed. Am I correct in understanding your question?
If yes, I believe it is a case of dependence violation and the
successor thread should be squashed. On page 4 of the paper (for the
two-threaded scheme), it is mentioned that the speculative thread will
be squashed in such a case.
In the "dependence detection" part of section 3.3, I guess the authors
are referring to a "store" executed by a speculative thread. In that
case, only the successors of the speculative thread have caused a
dependece violation and so only those would be squashed. Had the store
been executed by a non-speculative thread, the speculative thread
(whether it is successor or non-successor) would be squashed.
Since no reference is made in that part whether it is a speculative
store or a non-speculative store, I am assuming it is a speculative.
If it is a non-speculative store, the part metioned for the two-thread
scheme mentioned earlier would apply, saying that the speculative
thread would be squashed.
I hope that answers your question.
Ashay
> > > Sugan Vinayagam- Hide quoted text -
>
> - Show quoted text -
Comments on "Supporting Speculative Multithreading on Simultaneous
Multithreaded Processors"
Jun Shen 993992089 Nov 1st, 2007
1. Paper Outline
The paper tries to combine speculative thread with SMT to gain
better system throughput with least cost. Besides, the paper presents
a cache-based scheme to support large thread. This scheme overcomes
the shortcoming of LSQ. The author shows us 2-thread and 4-thread
architecture and the comparison result at last.
2. Contributions:
I. This paper shows a way how to speedup the SMT system by exploiting
speculative thread technology with comparatively less complexity. The
author adopts a cache-based scheme. In the scheme only two state "SD"
"SV" are added and for each word, two bits "SL" "SM" are added.
II. This paper gives the details of two examples of the scheme--- that
is 2 thread scheme and 4-thread scheme. In each scheme, the author
shows the additional work about speculative thread such as dependence
violation check, when to commit and squash speculative thread and how
non-speculative thread shall interactive with speculative ones.
III. This paper provides details of their experiments-the way how to
examine their architecture with existing architecture.
3. Weaknesses:
I. As I mentioned in online discussion, the scalability of speculative
thread in SMT is dubious, the reason is that each speculative thread
need 2 bit for a word, with the increase of the number of speculative
thread, the cost will go very high.
II. And the author shall explore the optimum of dependence checking
granularity. Although byte level prevents the false dependence, the
checking cost will be high.
Supporting Speculative Multithreading on SMT Paper Critique
This paper addresses schemes for improving processor performance via
speculative multithreading which is a technique that allows the
processor to issue instructions (in parallel with other instructions)
that will likely be executed in the future and be able to rollback or
"squash" those instructions later if it turns out that they were not
supposed to have been executed.
The author presents Load/Store queue (LSQ) based architectures as well
as a proposed cache based architecture. The advantage of the LSQ
architecture is that it does not require a significant change to
existing architectures. Modern out-of-order-execution CPU's already
implement LSQ's to buffer "in-flight" out-of-order instructions which
are committed in program order. This same architecture could be re-
used to buffer the data produced by speculated instructions.
The author presents a scalability problem with the LSQ implementation,
specifically, the size of the queue. A large LSQ will require more
physical space, create more heat, and consume more (wasted) power.
The author favors the use of a shared L1 cache-based implementation
where speculative threads will all store their results rather than the
LSQ. Existing architectures would have to be re-designed to support
the transfer of speculated results to the L1 cache which is capable of
storing more data than the LSQ.
The author suggests that the sizes of the speculated threads can be
much larger in the cache based implementation which could potentially
improve performance.
Weaknesses:
What are the disadvantages to larger speculative threads? I did not
see any consideration of this. The cache based implementation will be
able to store more result-data from larger threads of instructions;
however, the thread size still needs to remain small when speculating
because the likelihood of mis-speculation becomes more likely as the
number of instructions per speculated thread increases.
The state transitions required to implement the author's scheme are
very complicated. Wouldn't the proposed state-transition algorithm be
difficult (expensive) to implement in hardware? This would also
require additional hardware support to transfer speculated data to/
from the cache as well as detect dependency violations. Cost
analysis?
http://www.intel.com/technology/magazine/research/speculative-threading-1205.htm
Here is a good article that explains Speculative Multithreading.
http://www.intel.com/technology/magazine/research/speculative-threading-1205.htm
My comment about a weakness point in the summary.
"3) The paper fails to talk about disadvantage of having larger
speculative threads.
With the increasing number of instructions per thread, possibility of mis-
speculation increases which may lead to LSQ Based Speculation being better than
Cache based speculation."
LSQ based speculation technique does not support larger speculative
threads due to limitation on the LSQ size, whereas L1 cache is usually
much larger than LSQ and hence can support larger speculative threads
by buffering a large amount of speculative values.
Also I fail to understand why increase in the number of instructions
per thread increases the possibility of mis-speculation. This cannot
be true always unless there are a lot of branch instructions in the
instructions executed by the speculative thread. I would better say
that the cost of mis-speculation is high in large speculative threads.
Thanks,
Sugan.