I need some tips on how to accelerate the computation speed of the partial FFT algorithm, it should be faster than the "fft" function. For example, if I only need to calculate the first output X(0) of a 1024-point FFT, my partial FFT algorithm should be faster than the standard "fft" algorithm. I need to explore possible code optimization for my partial FFT algorithm.
Is there someone who can help me cope with this problem? I can be reached at mike...@gmail.com
Thanks.
And I want to spin straw into gold.
I have a spinning wheel and several bales of straw, but no matter how
hard I try it doesn't work.
Is there someone who can help me with this problem?
I can be reached at rumpels...@gmail.com
Hi Lei,
Have you tried
>>doc goertzel
You can use the
dft_data = goertzel(data,freq_indices)
syntax to compute the DFT at only the specified frequencies. If you have that type of application, the Goertzel algorithm will often give you a lot of savings.
Wayne
Amazingly, The MathWorks has anticipated your need and has provided a
function to compute exactly the first element of the DFT. It is very
fast. Use it like this:
X0 = sum(x);
--
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.
> Have you tried
>
> >>doc goertzel
>
> You can use the
>
> dft_data = goertzel(data,freq_indices)
>
> syntax to compute the DFT at only the specified frequencies. If you have that type of application, the Goertzel algorithm will often give you a lot of savings.
==================
This must be something from the Signal Processing Toolbox. I have no doc on a GOERTZEL command, at any rate.
I've never encountered the Goertzel algorithm before, although from this wiki article, it has obviously been around for a long time,
http://en.wikipedia.org/wiki/Goertzel
However, I can't really see what advantages it conveys over direct evaluation of the DFT. The computational complexity, as discussed in teh article, seems to be the same for both.
> Amazingly, The MathWorks has anticipated your need and has provided a
> function to compute exactly the first element of the DFT. It is very
> fast. Use it like this:
>
> X0 = sum(x);
=======
Well, yes, but in fairness, the OP offered that only as an example.
For me, until I understand better the advantages of the Goertzel algorithm, the thing to do would be an explicit DFT computation
Result=exp(j*(-2*pi/N)*freqs(:)*(0:n-1))*X;
where freqs is a vector of zero-based frequency indices.
> This must be something from the Signal Processing Toolbox. I have no doc on a GOERTZEL command, at any rate.
>
> I've never encountered the Goertzel algorithm before, although from this wiki article, it has obviously been around for a long time,
>
> http://en.wikipedia.org/wiki/Goertzel
>
> However, I can't really see what advantages it conveys over direct evaluation of the DFT. The computational complexity, as discussed in teh article, seems to be the same for both.
I must agree with Matt here. Goertzel's is probably intesresting for realtime system. But on Intel processor, it can be beat with simple C program as showed below (Vista 32, 2010A, 2-year-old laptop)
N = 1e6;
data=rand(1,N)+1i*rand(1,N);
k = 1000; % fft component
tic
y1=goertzel(data,k);
t1=toc % Elapsed time is 0.0280 seconds.
tic
N = length(data);
a=exp(-1j*(2*pi*(k-1)/N));
v(1:N) = a;
v(1) = 1;
v = cumprod(v);
y2 = data*v(:);
t2=toc % Elapsed time is 0.0752 seconds.
tic
N = length(data);
a = exp(-1j*(2*pi*(k-1)/N));
y3 = get1fft(data, a); % FEX
t3=toc % Elapsed time is 0.0118 seconds.
/* MEXFILE get1fft.C */
#include "mex.h"
#define X prhs[0]
#define A prhs[1]
#define S plhs[0]
void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[])
{
mwIndex N, n;
double ar, ai, anr, ani, tmp;
double sr, si;
double *xr, *xi, *xrn, *xin;
mwSize dim[2] = {1, 1};
N = mxGetN(X);
ar = *mxGetPr(A);
ai = *mxGetPi(A);
xr = mxGetPr(X);
xi = mxGetPi(X);
anr = 1;
ani = 0;
sr = si = 0;
xrn = xr;
if (xi==NULL)
{
for (n=0; n<N; n++)
{
sr += anr*(*xrn);
si += ani*(*xrn);
tmp = anr*ar-ani*ai;
ani = anr*ai+ani*ar;
anr = tmp;
xrn++;
}
} else
{
xin = xi;
for (n=0; n<N; n++)
{
sr += anr*(*xrn)-ani*(*xin);
si += ani*(*xrn)+anr*(*xin);
tmp = anr*ar-ani*ai;
ani = anr*ai+ani*ar;
anr = tmp;
xrn++;
xin++;
}
}
S = mxCreateNumericArray(2, dim, mxDOUBLE_CLASS, mxCOMPLEX);
*mxGetPr(S) = sr;
*mxGetPi(S) = si;
return;
}
% Bruno
Bruno
No problem :) I thought that maybe in its second-order implementation, the Goertzel algorithm might help, but I see that the computational savings are not what I had thought.
Wayne
Dear Bruno,
Thank you very much for your guidance. The Mexfunction approach is absolutely an brillant approach, actually I am just about to use this approach to optimize my code. I have conducted a few tests on the algorithm proposed by you, and it has huge advantages over the build-in "fft" when there are large numbers of input points (e.g. N = 2^20).
However in my application, I only need to calculate 1024-point partial FFTs. The sample points are 256, and the other input points are zeros. I did test the computational speed of this algorithm, as is shown below:
clear
N = 1024;
data1=rand(1,256)+1i*rand(1,256);
data2=zeros(1,N-256);
data = [data1 data2];
k = 2; % fft component
L = length(data);
a = exp(-1j*(2*pi*(k-1)/L));
tic
for lop = 1:1000
%y1 = p_ifft(data,N,1);
%y2 = pfft_lei(data);
y1 = get1fft(data, a); % FEX
end
t1=toc % Elapsed time is 0.0385 seconds.
tic
for lop2 = 1:1000
y2 = fft(data);
end
t2 = toc % Elapsed time is 0.0275 seconds.
It shows that under this circumstance the computational speed of the Mexfunction is slower than the build-in "fft" function. I am wondering how to solve this problem.
Before yesterday I had no idea on the Mexfunction approach, so I need to learn this programming skill at the first stage. The code below is my partial-FFT algorithm:
function Y =pfft_lei(data)
% This is a radix-2 decimation-in-time PARTIAL-FFT algorithm
N = length(data);
X=bitrevorder(data);
L=log2(N);
Y(N,L,2) = zeros;
bin1 = path_trace(N);
for q = 1:N/2 %Output index selection
k3=1; bin = bin1(q,:);
for i1=1:L %Iteration stage
for i2=bin(i1):k3:N
if (Y(i2,i1,2)==0)
k=N*(bin(i1)-1)/k3/2;
j=i2+k3;
W=complex(cos(2*pi*k/N),-sin(2*pi*k/N));
T=X(j)*W;
X(j)=X(i2)-T; X(i2)=X(i2)+T;
Y(i2,i1,1)=X(i2);
Y(j,i1,1)=X(j);
Y(i2,i1,2)=1;
Y(j,i1,2)=1;
else
X(i2)=Y(i2,i1,1);
end
end
k3=k3*2;
end
end
function bin=path_trace(N)
L = log2(N);
for s=1:L
bin(:,s)=repmat([1:2^(s-1)],1,N/2^(s-1));
end
Currently I don't know how the convert this program into an Mex file. Besides, my algorithm is quite time-comsuming and it does not have any value for practical application. I may need some help on optimizing my code by the Mex approach. Thanks.
I did a few more tests and found the algorithm proposed by you is better than the build-in "fft" function. Thank you very much for your help. I am new in the Mex-file approach, so I may make a lot of mistakes.
Remember to check your compiler options in Visual Studio. From memory I believe you have the option to compile for a minimum size code or for maximum performance. One step up from that, using the Intel compiler, which is Visual Studio compatible, this, as far as I am aware is capable of automatically generating optimized C code that will utilize the SIMD structure buried in your Pentium chip (assuming you are running an Intel based system of course), and can increase the performance of parallelizable(???) code by up to 8 times, even if you only have single core.
Hope that helps
Regards
Dave Robinson
Will the chirp z-transform help you out?
http://www.mathworks.com/access/helpdesk/help/toolbox/signal/czt.html
--
Loren
http://blogs.mathworks.com/loren
http://matlabwiki.mathworks.com/MATLAB_FAQ
[1] http://dx.doi.org/10.1109/78.205723
Note that such techniques are also very easily applicable to multi-dimensional FFTs when you realize that these are just one-dimensional FFTs in all dimensions of the array. With a bit of work, you can string together two or three transform decomposition FFTs for 2D/3D FFTs.
The only disadvantage of their method is that it doesn't give any savings when you have limited input *and* output DFT points. They suggest using a combination of input-and-out pruning algorithms to accomplish this. I'm looking for a Matlab-compatible implementation of this, but if I can't find one, I'll just have to use their technique limiting either the input (more efficient it seems) or the output points.
Hope this helps.