PAPER 11/27: Hybrid multi-core architecture for boosting single-threaded performance

Guofeng

unread,

Nov 17, 2007, 12:56:20 PM11/17/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

@article{1241603,
author = {Jun Yan and Wei Zhang},
title = {Hybrid multi-core architecture for boosting single-threaded
performance},
journal = {SIGARCH Comput. Archit. News},
volume = {35},
number = {1},
year = {2007},
issn = {0163-5964},
pages = {141--148},
doi = {http://doi.acm.org/10.1145/1241601.1241603},
publisher = {ACM},
address = {New York, NY, USA},
}

jun shen

unread,

Nov 27, 2007, 9:18:51 PM11/27/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

critique by Jun Shen

1. Overview
The paper presents a kind of compiler-driven heterogeneous
multicore architecture to boost the single-threaded application's
performance. The architecture includes VLIW core and superscalar core
and other kind of ones. Furthermore, the paper proposes an idea that
shifts part of the work of hardware to compiler.

2. Contributions:
I. The paper proposes a hybrid multicore architecture which can
improves the performance of single-threaded application.
II. The hybrid architecture can achieve high ILP by the cooperation of
VLIW core and superscalar core. And the architecture can harness the
advantage of different cores while removing their disadvantages.
III. The hybrid architecture is friendlier to compilers. Hence it
lowers the hardware complexity of the architecture.
IV. The hybrid architecture is energy-efficient.
V. The architecture is useful for high-performance embedded
applications.

3. Weaknesses:
I. The hybrid architecture is only considering improving the
performance of single-threaded applications, not suitable for multi-
threaded applications.
II. Because it is a type of multi-ISA architecture, it is not easy
to migrate a thread from a core to another.

Tao (Tony) Liu

unread,

Nov 28, 2007, 2:27:12 AM11/28/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi, All

Please find out my comments on this paper.
Thanks!

Regards,
Tao Liu

Hybrid_Paper_Comments_TAO_LIU.doc

David.S....@asu.edu

unread,

Nov 28, 2007, 3:08:54 PM11/28/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Critique by David Phillips

The thesis of this paper ponders how single threaded applications can
benefit from multi-core architectures without having to be re-written
using the multi-threaded paradigm. The fundamental problem is that as
more cores are added to Chip Multi-Processors (CMPs), existing single
threaded programs are not able to benefit as they have in the past
when advances were made to uniprocessor architectures like increased
clock frequency and super-scaler out-of-order execution, etc.

The author seeks to discover an architectural solution that will
reduce the latency of single-threaded applications to benefit from
multi-core architectures while at the same time allowing multi-
threaded programs to gain performance benefits as well.

The author proposes a hybrid architecture that contains both a Very
Long Instruction Word (VLIW) execution core as well as a super-scalar
execution core.

The VLIW cores are much less complex than super-scalar because they do
not implement dynamic out-of-order execution in the hardware. A by-
product of this is that VLIW cores are much more power efficient
because they do not have all of the hardware overhead that super-
scalars require. Instead VLIW cores defer complexity to the compiler
whose role is to decipher parallelism in the code and generate the
proper instructions that take full advantage of the hardware
potential.

The advantage of the VLIW core is that the compiler can make
scheduling decisions about instructions while looking at the entire
program. The super-scalar core can only make scheduling decisions
based upon a limited instruction window. VLIW cores can also run at
faster clock rates because they have less hardware overhead. However,
the disadvantage is that when cache misses occur on the VLIW core, the
entire pipeline stalls as there is no dynamic execution logic in the
hardware to schedule more instructions while waiting.

The advantage of the super-scalar architecture is that it can better
deal with runtime events like cache misses and memory dependencies.
Furthermore, code does not have to be re-compiled for super-scalar
architectures like it has to be with VLIW architectures.

Assuming a dual core system, the compiler will take a single threaded
program and convert it into a 2-threaded program. The thread that has
the highest ILP (predicatable memory access patterns, branches , etc)
will be scheduled to run on the VLIW core. The thread with the least
ILP will be scheduled to run on the super-scalar core.

Strengths:
------------
This paper has logical thesis and presents good arguments for the
pro's and con's of VLIW versus superscalar architectures.
Furthermore, the concept of combining the advantages of each
architecture to side-step their corresponding disadvantages has great
potential.

This paper attempts to find ways for improving single threaded program
behavior even as the research trend is highly focused on multi-
threading on multi-cores. Single threaded programs are going to
account for the majority of programs for the unforeseeable future so
finding ways to improve their performance is very beneficial.

This paper presents a strategy for make CMP's more power efficient
which is going to be very important as cores become more numerous on
the die.

Weaknesses:
-------------------
The paper only considers a dual core system with 1 VLIW and 1 super-
scalar. Further work could be performed to examine a many core
system.

The success of this architecture completely depends on the compiler's
ability to locate parallel instructions and schedule them accordingly
on the multiple cores. The author does not address how complex the
compiler will be in great detail.

There is also the problem of programs having to be recompiled whenever
a new architecture is released. If the number of cores doubles then
the compiler has to find even more parallelism and generate a thread
for each thread. This requires existing code to be recompiled and it
also raises the question of how much parallism will the compiler be
able to find before it reaches diminished returns.

The solution overall seems that it will be very difficult to scale as
each time more cores are added, the compiler will have to be modified
and existing programs must be recompiled.

ash...@gmail.com

unread,

Nov 28, 2007, 9:44:51 PM11/28/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Here's my critic for this paper:

The paper describes a technique of utilizing different architectures
to get the best performance. It utilizes the capabilities of the
compiler to extract threads, further split them if necessary and
schedule them on specific processors. It also illustrates the
optimizations that can be performed by the compiler.

Strengths of the paper:
1) With the technique presented, different architectures can
complement each other's disadvantages in performance and power.
2) The technique can be extended to further cores to get the optimal
execution time.
3) The paper considers imbalance of thread sizes between the
superscalar and VLIW processors and accordingly presents a technique
to overcome it by splitting the thread and re-analyzing it.
4) The compiler also performs a number of optimizations (profitability
analysis) to determine the best combination.
5) The paper performs analysis for varying amount of L2 miss latency
and shows that with pre-execution, the L2 miss latency has little
effect on the execution time.
6) The technique of pre-execution yields good results for applications
that suffer from high L2 miss frequency.

Weaknesses:
1) There is no significant improvement in total execution time with
the use of Pre-execution and pre-fetching.
2) The average utilization of the Superscalar functional units is very
low: 4.2%
3) Currently, the focus is on a superscalar processor and a VLIW
processor of pre-defined frequency. The compiler has to be finely
tuned to extract maximum performance considering different
configurations that can be used.

Ashay

Mike

unread,

Nov 29, 2007, 4:36:49 PM11/29/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

My Critique:
Contributions
This paper addresses the question of how best to continue to speed up
single threaded applications such as the previous 20 years of legacy-
ware by using multi-core processors. A heterogeneous processor is
suggested that takes advantage of both superscalar hardware that uses
pre-fetch functionality and aggressive compilers capable of doing
their own parallelization. It does so by having 2 cores, a superscalar
core and a VLIW core. The VLIW core is much faster because of its
simpler hardware, and so if compilers take care of all the statement
reordering then programs would be sped up by this alone. Programs are
broken up into 2 threads, and the thread with the higher
predictability is given to the VLIW core.

Strengths
The author makes a good case for the heterogeneous processor he
suggests, both in the need for this type of work, and for the
complementary effect these cores can have on each other.

I personally would like to see more work done on this area, so I'll
call the topic choice alone a strength because any success in the area
will be widely applicable.

The paper uses a set of standard benchmarks which allows you to
compare his results with other papers more easily.

Weaknesses
The proposed scheme by the author does not seem overly effective in
his own tests.

The author does not address the difficulties in writing and upkeeping
a compiler to take full advantage of the VLIW core. Does the compiler
need to be rewritten every time a new socket type comes out? How about
if more cores are added? Will code need to be recompiled before it can
be run efficiently on a machine assuming a compiler is available that
does all he suggests it can?

peyman...@gmail.com

unread,

Dec 4, 2007, 1:05:50 PM12/4/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

I just had a comment on one of the anonymous critiques that I received
where he/she said:
> However, the presentation incorrectly mentions the Phase 1 of the paper as "exploiting speculative threads".

Exploiting speculative threads or extracting pre-execution threads are
both the same thing, as stated in page 4 of the paper:

"While Phase 1 focuses on exploiting speculative threads or helper
threads to boost the performance of the main thread, the second phase
will center on extracting non-speculative multi-grain parallelism for
further performance improvement."

Sugan Vinayagam

unread,

Dec 4, 2007, 3:24:52 PM12/4/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi,

Please find attached the summary of the critiques for this paper.

Thanks,
Sugan.

Summary-Sugan-Vinayagam.pdf

Reply all

Reply to author

Forward