PAPER 11/13: Optimizing the Fast Fourier Transform on a Multi-core Architecture

Guofeng

unread,

Nov 13, 2007, 11:14:25 PM11/13/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Optimizing the Fast Fourier Transform on a Multi-core Architecture

Long Chen Ziang Hu Junmin Lin Gao, G.R.
Electr. & Comput. Eng. Dept., Delaware Univ., Newark, DE;

This paper appears in: Parallel and Distributed Processing Symposium,
2007. IPDPS 2007. IEEE International
Publication Date: 26-30 March 2007
On page(s): 1-8
ISBN: 1-4244-0910-1
INSPEC Accession Number: 9516876
Digital Object Identifier: 10.1109/IPDPS.2007.370639
Posted online: 2007-06-11 10:26:41.0

YI-HSI...@asu.edu

unread,

Nov 15, 2007, 6:02:26 AM11/15/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Paper Title：Optimizing the Fast Fourier Transform on a Multi-core
Architecture
Present Date：Nov. 13, 2007
Critic：Yi-hsin Tseng

This paper provides several points of view to optimize the Fast
Fourier Transform on a Multi-core Architecture, including problem
decomposition, load balancing, work distribution, and data-reuse. The
authors first introduce what is FFT and what is it influence and
importance to be applied in the architecture. Next, the paper
describes the FFT algorithm which to compute DFT, and the basic method
for the problem. Thirdly, the authors present the C64 chip
architecture and main features. Then is optimization and analysis for
their specific C64 architecture. Finally, the paper shows the
simulation examples and conclusion.

The major contributions of this paper is：
1) Successful optimization for C64- like large-scale multi-core
architectures and match them well with some key multi-core
architecture features
2) Provide quantitative evidence on the importance of each
optimization identified in contribution 1.
3) Automatic optimization by our compiler, the design and
implementation of which is guided by the feedbacks from contribution 1
and 2.

Strengths of this paper：
1) All FFTs are independent of each other, so they can be computed in
parallel. That is, they can take the advantages of Parallel
Implementation.
2) In load balancing issue, the paper balances the work for each
thread which can solve the problem of limitation of speedup
achievable. (Use fine-grain work)
3) The authors tailored the compiler such that it accounts for the
different latencies when applying instruction scheduling to solve the
problem of different delays of memory instructions.
4) Multi-side optimized techniques are used in this paper. And
achieving the goal of better performance uses less memory space.

Weakness of this paper：
1) The architecture feature of C64 has not been fully explored is the
fast scratchpad memory (SP) for each thread unit.
2) Did not mention about the point of when data cannot be fully stored
in on chip memories, how to larger FFT problem sizes.
3) The design and optimal parameters is specially focused on the C64
architecture; can other architecture use the method without a good
deal of modifications?

Saleel Kudchadker

unread,

Nov 15, 2007, 11:39:31 AM11/15/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi

Critique attached herewith

Saleel

Graduate Student, School of Computing and Informatics
TA, Del E. Webb School of Construction
Arizona State University
Tempe, AZ, 85281

Critique_MulticoreFFT.pdf

Sugan Vinayagam

unread,

Nov 15, 2007, 12:22:21 PM11/15/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi,

Please find my critique for the paper attached with this email.

Thanks,
Sugan.

Critique2-Optimizing-FFT-Multicore-Architecture-Sugan-Vinayagam.pdf

Aarul Jain

unread,

Nov 15, 2007, 2:14:38 PM11/15/07

to asucse520-fall-07-advanc...@googlegroups.com

Please find my critique of the paper attached.

Thanks

-Aarul

-Aarul Jain
Graduate Student, EE Dept.
Arizona State University, Tempe, US
Ph: 480-278-9230

critic-aarul.pdf

Sugan Vinayagam

unread,

Nov 15, 2007, 4:03:23 PM11/15/07

to asucse520-fall-07-advanc...@googlegroups.com

I mentioned about the following point already in the critic but still i thought i would put this is in a separate email to discuss it in the groups.

The authors have not explained why they used Decimation in Time(DIT) algorithm over Decimation in Frequency(DIF) algorithm. The authors have also mentioned that bit-reversal permutation takes 5.7% of the execution.

I thought of an idea to solve this problem. Why not go for a DIF algorithm and calculate the bit-reversed permutation during the fft computation with any of the idle threads and use the results to get the natural order fft at the end of the fft computation. I believe this would bring down the bit-reversal computation time to 0% as this operation is clubbed with the FFT computation.

Thank you,
Sugan.

--
Sugan Vinayagam

Tao Liu

unread,

Nov 15, 2007, 4:25:08 PM11/15/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi, Sugan,

Thanks for your critique!

About the weakness 3 in your critique,

“The authors have not clearly mentioned about the speed-up that could be obtained if all the techniques that they mentioned are implemented together and the effect of one performance optimization technique over the other “

I think the authors did this.

The table 1 in the paper gives the 1D FFT incremental optimizations performance with 128 threads.

The column GLOPS indicates the performance achieved by combining certain number of techniques together.

i.e.

13.17Gflops is achieved by using Base implementation+ optimal work unit

16.92Gflops is achieved by using Base implementation+ optimal work unit + Special Handling of the first States

And so on and so forth.

So the best performance with all the techniques they mentioned is 20.72Glops

The effect of one performance optimization technique over the other is also can be seen from

the “Speedup Over Base Version” column in the table.

For 2D FFT, they did the same thing but no table was given. Please notice

The performance in this paper, 5.11Gflops->19.37->20.000Gflops.

Thanks,

-Tao

Raghu

unread,

Dec 4, 2007, 7:33:23 PM12/4/07

to asucse520-fall-07-advanc...@googlegroups.com, Guofeng

Hi all

Please find the summary of the paper attached. Sorry for the delay. I realised now that I missed the bounced email with my summary of this paper attached and sent on 11/20/07 due to oversight.

Regards

Sai Raghunath T

On 12/4/07, Guofeng <guofen...@gmail.com> wrote:

--
Regards
Sai Raghunath T

summary on optimising FFT on CMP.doc

Reply all

Reply to author

Forward