This paper evaluates existing approaches (Threaded MultiPath Execution
and Dynamically MultiThreaded Processors) to implementing speculative
multi-threading on a Simultaneously Multi Threaded Processor in order
to gain performance improvements for single threaded programs. The
author defines a new approach called an "Implicitly Multi Threaded
Processor" that will solve short-comings of the TME and DMT
approaches.
The Implicitly Multi Threaded Processor technique relies completely on
the compiler to find parallelism and insert special instructions into
the generated code called "spawning points". At runtime, when the
processor reaches a spawn point, it begins issuing instructions that
are later in the execution in parallel with the instructions that
immediately follow the spawn point. The issued future instructions
are running "speculatively" which means that they may or may not be
performing valid operations that correspond to the code-author's
intention. The speculating instructions write all of their results to
a temporary buffer (Load Store Queue) until it is determined that the
speculated instructions were in fact supposed to execute. When the
speculated instructions are validated then their results are allowed
to be committed from the Load Store Queue to the system's memory. The
speed benefit is observed when future instructions are issued in
parallel with current instructions which allows the processor to keep
all of its functional units busy.
The author states that a weakness of the Threaded Multi Path Execution
approach is that the speculated threads are only issued when branch
instructions are encountered. The speculated thread is used to
execute the branch path that is not predicted by the processor. If
the processor mis-predicts the branch then the speculated thread is
allowed to commit. The author believes that branch mis-predictions
are uncommon lessens the overall performance gains. The author argues
that the IMT approach allows the processor to issue speculative
threads for the more common case of correct-prediction which will
improve performance at a greater scale.
The author states that the Dynamically MultiThreaded Processor
technique does not involve any compiler assistance. The DMT model
implements a complicated value prediction scheme in hardware that
usually has low accuracy which results in thread mis-speculation more
often than correct speculation which makes the DMT approach less
effective. The author argues that the compiler is much better suited
to find effective spawning points in sequential code because the
compiler has a much larger view of the memory accesses and program
control flow of the thread.
Strengths:
The author points out that a "naive" implementation of IMT that does
not dynamically account for system resources, limited thread hardware
contexts, and thread start-up latency will actually degrade
performance the a non-speculating processor.
The author shows that in order to observe a performance gain from the
IMT architecture, the processor must be able to select instructions
based upon the availability of the functional units needed by the
instruction.
The author also suggests that each hardware context be multi-plexed
(switched) between speculating threads in order to achieve more
instruction overlap from small threads.