A partial FFT algorithm that is even faster than the build-in "fft" function

Lei

unread,

Mar 27, 2010, 2:42:06 AM3/27/10

to

The Matlab build-in function "fft" is a highly efficient algorithm for fast implementation of the Fourier transform. However, in some situations I do not need the calculate the full-point FFTs every time. Suppose I only need to calculate a subset of outputs, I have to develop a partial FFT (i.e. fractional FFT, or pruned FFT) algorithm. Actually I developed a partial FFT algorithm in an M-file, but the computation speed is far lower than the build-in "fft".

I need some tips on how to accelerate the computation speed of the partial FFT algorithm, it should be faster than the "fft" function. For example, if I only need to calculate the first output X(0) of a 1024-point FFT, my partial FFT algorithm should be faster than the standard "fft" algorithm. I need to explore possible code optimization for my partial FFT algorithm.

Is there someone who can help me cope with this problem? I can be reached at mike...@gmail.com
Thanks.

TideMan

unread,

Mar 27, 2010, 6:34:30 AM3/27/10

to

On Mar 27, 7:42 pm, "Lei " <mikej....@gmail.com> wrote:
> The Matlab build-in function "fft" is a highly efficient algorithm for fast implementation of the Fourier transform. However, in some situations I do not need the calculate the full-point FFTs every time. Suppose I only need to calculate a subset of outputs, I have to develop a partial FFT (i.e. fractional FFT, or pruned FFT) algorithm. Actually I developed a partial FFT algorithm in an M-file, but the computation speed is far lower than the build-in "fft".
>
> I need some tips on how to accelerate the computation speed of the partial FFT algorithm, it should be faster than the "fft" function. For example, if I only need to calculate the first output X(0) of a 1024-point FFT, my partial FFT algorithm should be faster than the standard "fft" algorithm. I need to explore possible code optimization for my partial FFT algorithm.
>

> Is there someone who can help me cope with this problem? I can be reached at mikej....@gmail.com
> Thanks.

And I want to spin straw into gold.
I have a spinning wheel and several bales of straw, but no matter how
hard I try it doesn't work.
Is there someone who can help me with this problem?
I can be reached at rumpels...@gmail.com

Wayne King

unread,

Mar 27, 2010, 8:15:09 AM3/27/10

to

"Lei " <mike...@gmail.com> wrote in message <hok9bu$dcu$1...@fred.mathworks.com>...

Hi Lei,
Have you tried

>>doc goertzel

You can use the

dft_data = goertzel(data,freq_indices)

syntax to compute the DFT at only the specified frequencies. If you have that type of application, the Goertzel algorithm will often give you a lot of savings.

Wayne

Doug Schwarz

unread,

Mar 27, 2010, 8:44:21 AM3/27/10

to

In article <hok9bu$dcu$1...@fred.mathworks.com>,
"Lei " <mike...@gmail.com> wrote:

Amazingly, The MathWorks has anticipated your need and has provided a
function to compute exactly the first element of the DFT. It is very
fast. Use it like this:

X0 = sum(x);

--
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.

Matt J

unread,

Mar 27, 2010, 9:03:02 AM3/27/10

to

"Wayne King" <wmki...@gmail.com> wrote in message <hokssd$ntm$1...@fred.mathworks.com>...

> Have you tried
>
> >>doc goertzel
>
> You can use the
>
> dft_data = goertzel(data,freq_indices)
>
> syntax to compute the DFT at only the specified frequencies. If you have that type of application, the Goertzel algorithm will often give you a lot of savings.

==================

This must be something from the Signal Processing Toolbox. I have no doc on a GOERTZEL command, at any rate.

I've never encountered the Goertzel algorithm before, although from this wiki article, it has obviously been around for a long time,

http://en.wikipedia.org/wiki/Goertzel

However, I can't really see what advantages it conveys over direct evaluation of the DFT. The computational complexity, as discussed in teh article, seems to be the same for both.

Matt J

unread,

Mar 27, 2010, 9:25:24 AM3/27/10

to

Doug Schwarz <s...@sig.for.address.edu> wrote in message <see-3B9FF0.0...@news.frontiernet.net>...

> Amazingly, The MathWorks has anticipated your need and has provided a
> function to compute exactly the first element of the DFT. It is very
> fast. Use it like this:
>
> X0 = sum(x);

=======

Well, yes, but in fairness, the OP offered that only as an example.

For me, until I understand better the advantages of the Goertzel algorithm, the thing to do would be an explicit DFT computation

Result=exp(j*(-2*pi/N)*freqs(:)*(0:n-1))*X;

where freqs is a vector of zero-based frequency indices.

Bruno Luong

unread,

Mar 27, 2010, 10:14:05 AM3/27/10

to

"Matt J " <mattja...@THISieee.spam> wrote in message <hokvm6$3u6$1...@fred.mathworks.com>...

> This must be something from the Signal Processing Toolbox. I have no doc on a GOERTZEL command, at any rate.
>
> I've never encountered the Goertzel algorithm before, although from this wiki article, it has obviously been around for a long time,
>
> http://en.wikipedia.org/wiki/Goertzel
>
> However, I can't really see what advantages it conveys over direct evaluation of the DFT. The computational complexity, as discussed in teh article, seems to be the same for both.

I must agree with Matt here. Goertzel's is probably intesresting for realtime system. But on Intel processor, it can be beat with simple C program as showed below (Vista 32, 2010A, 2-year-old laptop)

N = 1e6;
data=rand(1,N)+1i*rand(1,N);
k = 1000; % fft component

tic
y1=goertzel(data,k);
t1=toc % Elapsed time is 0.0280 seconds.

tic
N = length(data);
a=exp(-1j*(2*pi*(k-1)/N));
v(1:N) = a;
v(1) = 1;
v = cumprod(v);
y2 = data*v(:);
t2=toc % Elapsed time is 0.0752 seconds.

tic
N = length(data);
a = exp(-1j*(2*pi*(k-1)/N));
y3 = get1fft(data, a); % FEX
t3=toc % Elapsed time is 0.0118 seconds.

/* MEXFILE get1fft.C */
#include "mex.h"

#define X prhs[0]
#define A prhs[1]
#define S plhs[0]
void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[])
{
mwIndex N, n;
double ar, ai, anr, ani, tmp;
double sr, si;
double *xr, *xi, *xrn, *xin;
mwSize dim[2] = {1, 1};

N = mxGetN(X);
ar = *mxGetPr(A);
ai = *mxGetPi(A);
xr = mxGetPr(X);
xi = mxGetPi(X);
anr = 1;
ani = 0;

sr = si = 0;
xrn = xr;
if (xi==NULL)
{
for (n=0; n<N; n++)
{
sr += anr*(*xrn);
si += ani*(*xrn);
tmp = anr*ar-ani*ai;
ani = anr*ai+ani*ar;
anr = tmp;
xrn++;
}
} else
{
xin = xi;
for (n=0; n<N; n++)
{
sr += anr*(*xrn)-ani*(*xin);
si += ani*(*xrn)+anr*(*xin);
tmp = anr*ar-ani*ai;
ani = anr*ai+ani*ar;
anr = tmp;
xrn++;
xin++;
}
}

S = mxCreateNumericArray(2, dim, mxDOUBLE_CLASS, mxCOMPLEX);
*mxGetPr(S) = sr;
*mxGetPi(S) = si;

return;
}

% Bruno

Bruno Luong

unread,

Mar 27, 2010, 10:45:22 AM3/27/10

to

BTW thank you Wayne for the head up of goertzel's algorithm. I have leaned something new.

Bruno

Wayne King

unread,

Mar 27, 2010, 10:54:06 AM3/27/10

to

"Bruno Luong" <b.l...@fogale.findmycountry> wrote in message <hol5m2$rr0$1...@fred.mathworks.com>...

> BTW thank you Wayne for the head up of goertzel's algorithm. I have leaned something new.
>
> Bruno

No problem :) I thought that maybe in its second-order implementation, the Goertzel algorithm might help, but I see that the computational savings are not what I had thought.

Wayne

Lei

unread,

Mar 28, 2010, 2:11:05 AM3/28/10

to

"Bruno Luong" <b.l...@fogale.findmycountry> wrote in message <hol3rd$2h8$1...@fred.mathworks.com>...

Dear Bruno,
Thank you very much for your guidance. The Mexfunction approach is absolutely an brillant approach, actually I am just about to use this approach to optimize my code. I have conducted a few tests on the algorithm proposed by you, and it has huge advantages over the build-in "fft" when there are large numbers of input points (e.g. N = 2^20).
However in my application, I only need to calculate 1024-point partial FFTs. The sample points are 256, and the other input points are zeros. I did test the computational speed of this algorithm, as is shown below:

clear
N = 1024;
data1=rand(1,256)+1i*rand(1,256);
data2=zeros(1,N-256);
data = [data1 data2];
k = 2; % fft component
L = length(data);
a = exp(-1j*(2*pi*(k-1)/L));
tic
for lop = 1:1000
%y1 = p_ifft(data,N,1);
%y2 = pfft_lei(data);
y1 = get1fft(data, a); % FEX
end
t1=toc % Elapsed time is 0.0385 seconds.

tic
for lop2 = 1:1000
y2 = fft(data);
end
t2 = toc % Elapsed time is 0.0275 seconds.

It shows that under this circumstance the computational speed of the Mexfunction is slower than the build-in "fft" function. I am wondering how to solve this problem.

Before yesterday I had no idea on the Mexfunction approach, so I need to learn this programming skill at the first stage. The code below is my partial-FFT algorithm:

function Y =pfft_lei(data)
% This is a radix-2 decimation-in-time PARTIAL-FFT algorithm
N = length(data);
X=bitrevorder(data);
L=log2(N);
Y(N,L,2) = zeros;

bin1 = path_trace(N);

for q = 1:N/2 %Output index selection
k3=1; bin = bin1(q,:);
for i1=1:L %Iteration stage
for i2=bin(i1):k3:N
if (Y(i2,i1,2)==0)
k=N*(bin(i1)-1)/k3/2;
j=i2+k3;
W=complex(cos(2*pi*k/N),-sin(2*pi*k/N));
T=X(j)*W;
X(j)=X(i2)-T; X(i2)=X(i2)+T;
Y(i2,i1,1)=X(i2);
Y(j,i1,1)=X(j);
Y(i2,i1,2)=1;
Y(j,i1,2)=1;
else
X(i2)=Y(i2,i1,1);
end
end
k3=k3*2;
end
end

function bin=path_trace(N)
L = log2(N);
for s=1:L
bin(:,s)=repmat([1:2^(s-1)],1,N/2^(s-1));
end

Currently I don't know how the convert this program into an Mex file. Besides, my algorithm is quite time-comsuming and it does not have any value for practical application. I may need some help on optimizing my code by the Mex approach. Thanks.

Lei

unread,

Mar 28, 2010, 3:18:22 AM3/28/10

to

I have found that there are a lot of factors that can affect the computation time of the FFT algorithm. One factor I just found is the compiler. Firstly I compiled the source code "get1fft.c" by using the LCC compiler in Matlab, next time, I compiled the source code by using Microsoft Visual C++ 2005. I found the latter one is more efficient.

I did a few more tests and found the algorithm proposed by you is better than the build-in "fft" function. Thank you very much for your help. I am new in the Mex-file approach, so I may make a lot of mistakes.

Dave Robinson

unread,

Mar 28, 2010, 9:29:05 AM3/28/10

to

"Lei " <mike...@gmail.com> wrote in message <homvru$q4d$1...@fred.mathworks.com>...

> I have found that there are a lot of factors that can affect the computation time of the FFT algorithm. One factor I just found is the compiler. Firstly I compiled the source code "get1fft.c" by using the LCC compiler in Matlab, next time, I compiled the source code by using Microsoft Visual C++ 2005. I found the latter one is more efficient.
>
> I did a few more tests and found the algorithm proposed by you is better than the build-in "fft" function. Thank you very much for your help. I am new in the Mex-file approach, so I may make a lot of mistakes.

Remember to check your compiler options in Visual Studio. From memory I believe you have the option to compile for a minimum size code or for maximum performance. One step up from that, using the Intel compiler, which is Visual Studio compatible, this, as far as I am aware is capable of automatically generating optimized C code that will utilize the SIMD structure buried in your Pentium chip (assuming you are running an Intel based system of course), and can increase the performance of parallelizable(???) code by up to 8 times, even if you only have single core.

Hope that helps

Regards

Dave Robinson

Loren Shure

unread,

Mar 29, 2010, 8:14:44 AM3/29/10

to

In article <hok9bu$dcu$1...@fred.mathworks.com>, mike...@gmail.com
says...

Will the chirp z-transform help you out?
http://www.mathworks.com/access/helpdesk/help/toolbox/signal/czt.html

--
Loren
http://blogs.mathworks.com/loren
http://matlabwiki.mathworks.com/MATLAB_FAQ

Ahmed Fasih

unread,

Nov 5, 2010, 4:08:03 PM11/5/10

to

This question of limited output/input DFTs is very important and I recommend reading this 1993 paper by Sorensen and Burrus [1], who not only outline the Goertzel algorithm (which only is useful for extremely few output samples), but also pruning techniques (then state of the art), and their new transform decomposition method which simplifies limited output/input FFTs into a sequence of smaller FFTs. This will be very useful for Matlab users who'd like to utilize pre-existing routines, and has the advantage that it is faster than pruning techniques.

[1] http://dx.doi.org/10.1109/78.205723

Note that such techniques are also very easily applicable to multi-dimensional FFTs when you realize that these are just one-dimensional FFTs in all dimensions of the array. With a bit of work, you can string together two or three transform decomposition FFTs for 2D/3D FFTs.

The only disadvantage of their method is that it doesn't give any savings when you have limited input *and* output DFT points. They suggest using a combination of input-and-out pruning algorithms to accomplish this. I'm looking for a Matlab-compatible implementation of this, but if I can't find one, I'll just have to use their technique limiting either the input (more efficient it seems) or the output points.

Hope this helps.

Sam

unread,

May 30, 2012, 7:40:34 PM5/30/12

to

"Ahmed Fasih" <fasih....@osu.edu.nospam> wrote in message <ib1o73$719$1...@fred.mathworks.com>...

The mentioned paper is computationally very efficient (according to its results).
Does anyone have the Matlab function implementing this method? I appreciate if someone could give me an optimized Matlab function implementing this code.
I copy the Fortran code here:

CC Subroutine LOTOFT(X,Y.M,IA,L,D,E): cc
CC A transform-decomposition FFT f o r computing a limited CC
CC set of output points of a length N transform. cc
CC The number of calculated output points is L cc
CC and they are indexes through the array IA. cc
CC The algorith call a regular power-of-two algorithm cc
CC (or split-radix FFT) cc
cc cc
CC Input/output: cc
CC X Array of real part of input/output (length >= N) CC
CC Y Array of imaginary part of input/output (length >= N) CC
CC M Transform length is N=2**M cc
CC IA Index-array of desired output point cc
CC L Number of desired output points cc
CC 0 York array of length >= N cc
CC E York array of length >= N cc
cc cc
CC Calls: cc
cc cc
CC Author: cc
CC Henrik Sorensen, University of Pennsylvania, Sep. 1987 CC
CC Modified: cc
CC Henrik Sorensen, University of Pennsylvania, Feb. 1988 CC
CC Henrik Sorensen, University of Pennsylvania, Oct. 1988 CC
cc cc
CC CC
CC as this header is included cc
cc cc
cc===================================================--------------cc
CC FFT - A FFT program cc
cc Arpa address: hvsBee.upenn.edu cc
This program may be used and distributed freely as long
SUBROUTINE LOTDFT(X,Y,M.IA,L,D,E)
REAL X(l),Y(l),O(l),E(l)
REAL CC,CC2,SS,ANG,Al,A2,El,B2,T
INTEGER IA(1)
N = 2**M
R=INT(LOG(FLOAT(LI)/LOG(2.0)+0.5)
P = Z**R
9 = N/P
DO 30 J=O,Q-l
INTEGER N,M,P,R,Q,J,K,L,KO
C Choose P. Can also be input to program
DO 40 K=O,P-1
D(J*P+K+l) = X(KtQ+J+I)
E(J*P+K+l) = Y(K*Q+J+l)
CALL FFT(O(J*P+l),E(J*P+l),P,R)
40 CONTINUE
30 CONTINUE
DO 50 K=l,L
C The index array IA can be removed if only one output band is needed
C The band will be: [IOFF.IOFF+l, . . . IOFF+L-11
CCCCCCCCCCC KO = MOD(K+IOFF-l,P)+l
CCCCCCCCCCC ANG = 6.28318530718/N*(K+IOFF-11
KO = MOD(IA(K) ,P)+l
ANG = 6.28318530718/N*IA(K)
CC = COS(ANG)
SS = SIN(ANG1
cc2 = 2*CC
A2 = 0
E2 = 0
AI = D(KO+N-P)
E1 = E(KO+N-P)
DO 55 J=KO+N-Z*P,KO,-P
T = CC2*A1 - A2 + D ( J )
A2 = A1
A1 = T
T = CC2*Bl - 82 + E(J)
E2 = E1
E1 = T
55 CONTINUE
X(K) =AI - CCtA2 + SS*B2
Y(K) = E1 - SS*A2 - CC162
50 CONTINUE
RETURN
END

Sam

unread,

May 30, 2012, 7:41:26 PM5/30/12

to

"Ahmed Fasih" <fasih....@osu.edu.nospam> wrote in message <ib1o73$719$1...@fred.mathworks.com>...

Maarten Thomassen

unread,

Jun 17, 2016, 8:42:10 AM6/17/16

to

TideMan <mul...@gmail.com> wrote in message <15785874-b563-4e0c...@k36g2000prb.googlegroups.com>...

> And I want to spin straw into gold.
> I have a spinning wheel and several bales of straw, but no matter how
> hard I try it doesn't work.
> Is there someone who can help me with this problem?
> I can be reached at rumpels...@gmail.com

Wow, that is so mature!