Matlab speed: abs(x)^2 vs. x.*conj(x)

Alexandre Kampouris

unread,

Apr 30, 1997, 3:00:00 AM4/30/97

to

Hello, World!

I'm interested in finding the energy contained in a given sequence, and
have played about to find the most efficient way to do it in Matlab. I came
to a surprising result, and would like other people's advice on that.

So, I want to compute the power/energy of a signal (x) from its complex
representation. The computation is:

power = Re(z)*Re(z) + Im(z)*Im(z)

or

power = z*conjugate(z)

Instinctively, though, one might write in Fortran something like
(cabs(z)**2), corresponding to the mathematical notation |z|^2. That would
be less efficient in any case.

I've tested the two approaches in Matlab, and came to the result that using
abs is actually *faster*, regardless of the fact that a useless
square-root&squaring operation takes place.

Here is summary of the tests I did:
(Config: Matlab 5.0+WinNT 4.0+Pentium 120MHz+64MB)
*******************************************************
% Generate a complex test vector, and (hopefully) get Matlab to allocate
% storage for the results.
» nn = 300000;z=exp(2*j*rand(1,nn)); p=abs(z).^2;

% Everything fits in real storage, paging should not occur. Other
applications
% turned off. Tests are then looped to improve timer resolution.
» whos p z
Name Size Bytes Class
p 1x300000 2400000 double array
z 1x300000 4800000 double array (complex)
Grand total is 600000 elements using 7200000 bytes

% Test 1: Calculate power the "dumb" way:
» tic; for i=1:20;p=abs(z).^2;end;toc;
elapsed_time = 10.4050

% Test 2: Calculate power the "smart" way:
» tic; for i=1:20;p=z.*conj(z);end;toc;
elapsed_time = 33.1480

% Test 3: Use the "smart" way, but with a longer notation
» tic; for i=1:20;p=real(z).*real(z)+imag(z).*imag(z);end;toc;
elapsed_time = 18.2970

% Test 4: Try a ridiculous way, for the sake of completeness
» tic; for i=1:20;p=exp(2*real(log(z)));end;toc;
elapsed_time = 113.3130

% Check out how much time "abs" takes by itself.
» tic; for i=1:20;p=abs(z);end;toc;
elapsed_time = 8.9730
*******************************************************

What? Huh? Am I missing something? Using "abs" with an superfluous
sqrt+squaring is *three* times faster than avoiding them? The "abs"
function must implicitely compute a sqrt somewhere, followed by squaring.
The fact that "abs" alone in test #4 takes less time than "abs^2" would
suggest that Matlab actually computes "sqrt(re^2+im^2).^2, without removing
anything.

Hypothesis: This probably comes down to the way memory is managed and code
is generated. All the operations of "abs" would be done within the FPU
stack, and only two memory accesses are required.

Then, the way the computations are done might be: (I'm conjecturing here)

Test 1: abs(z)->output real vector->.^2->output real vector
(The squaring would be done in a second stage.)
Test 2: conj(z)->temp cmplx vector->.*z->output real vector
(Is Matlab smart enough to detect at the outset that the output
vector will be entirely real, or the Real-Complex test is done
at
each element computation? Or is are the real and imag parts
are in different memory areas, breaking cache performance?)
Test 3: re(z).*re(z)->temp real vector;im(z).*im(z)->output real vector;
temp real vector+output real vector->output real vector
(Is this the way this is done? I'm trying to find a reason for
the
improvement over method 2, although the writing seems more
complex.
Are there be one, two, or zero temp vectors?)

I'm puzzled, as the results are counterintuitive, and opposite to what I
would get in Pascal or Fortran.

Is there a Matlab function which directly computes "abs(z)^2"? I haven't
found anything. If there is not, why not create one called "abs2"? I guess
that the speed improvement could be more than 50% over abs(z)^2. According
to my old I486 programmer's reference, "Fsqrt" is about 4 times slower than
than "Fmul" and "Fadd". (Probably applicable to Pentiums) Fmul is a tad
slower than the Fadd. Therefore, disregarding Fload and Fstore, the FPU
spends about half its time calculating this useless Fsqrt, which would
possibly have to be followed by "Fload/Dup. stack/Fmul/Fstore"+caching,
paging, adress calculation.

I think that |z|^2 is a basic operation in many, many, fields, and its
computation might justify a special function. It can be used in industrial
quantities, with FFTs and signal analysis, and would benefit from an
optimised BLAS-like routine. (Is there anything of this sort in BLAS?)
Otherwise, the Matlab interpreter could be made to recognize this special
case.

What do you think? Have I overlooked something? Please fell free to flame
me if I wrote something dumb. :-) (But not too hard, please...)

Greetings,

Alexandre

Cleve Moler

unread,

Apr 30, 1997, 3:00:00 AM4/30/97

to

In article <01bc557f$656cadc0$f060fdcf@crouic>,

Hi, Alexandre (and "World")

Very interesting question. It's an excellent example for a discussion
of MATLAB execution time performance.

As they say in the Dodge commercial, "The Rules Have Changed". In the
good old days, say 20 years ago, we could analyze performance of technical
computations by just counting floating point multiplications. Divisions
and square roots were so slow that we did everything we could to keep
them out of the inner loop, and all other operations were much faster
than multiplication.

With today's computers, and especially in the MATLAB environment,
arithmetic is almost infinitely fast. The most important factor in
execution time performance is the number of memory touches, i.e. loads
and stores. In fact, we really should consider different levels of
memory -- floating point registers, L1 cache, L2 cache, main memory,
virtual memory, etc. -- but let's not get into that here.

After memory touches, arithmetic operations are still important. The
complex absolute value, abs(z), is particularly important in this example.
If z = x+i*y, we can't just compute sqrt(x^2+y^2) because the intermediate
results may over- or underflow, even though the final result shouldn't.
So a numerically robust abs(z) also involves some testing and scaling that
makes it more work. But then you square the result, so overflow protection
isn't needed in this case.

There are several ways to do your computation

1) abs(z).^2
2) abs(z).*abs(z)
3) real(z).^2+imag(z).^2
4) real(z).*real(z)+imag(z).*imag(z)
5) z.*conj(z)

It is by no means obvious which of these is the fastest, or the slowest
for that matter. In fact, it can depend upon the particular machine,
and on the length of z.

Number 5, z.*conj(z), is not the fastest because MATLAB computes a
complex result whose imaginary part is all zero, and then checks that
the result is actually real. The memory references and arithmetic
involved in computing all these zero imaginary parts double the time
required. (On some modern machines, with fused multiply-add instructions,
the imaginary part ain't zero, but that's another story.)

Number 2, abs(z).*abs(z), is slower than number 1 because abs(z) is
done twice.

Number 4, real(z).*real(z)+imag(z).*imag(z), is slower than number 3
because it involves more memory references.

So it comes down to comparing abs(z).^2 with real(z).^2+imag(z).^2.
It turns out these are comparable in speed; abs(z) does more arithmetic,
but involves fewer memory references, than real(z).^2+imag(z).^2.
The ultimate winner depends upon the ratio of arithmetic speed to
memory reference speed of a particular machine. One is faster on
a Sparc, the other is faster on a PC.

Great question. Thanks for asking.

-- Cleve Moler
mo...@mathworks.com

Troy Richards

unread,

Apr 30, 1997, 3:00:00 AM4/30/97

to

On 30 Apr 1997, Cleve Moler wrote:

> In article <01bc557f$656cadc0$f060fdcf@crouic>,
> Alexandre Kampouris <junk...@videotron.ca> wrote:
> >Instinctively, though, one might write in Fortran something like
> >(cabs(z)**2), corresponding to the mathematical notation |z|^2. That would
> >be less efficient in any case.
> >
> >I've tested the two approaches in Matlab, and came to the result that using
> >abs is actually *faster*, regardless of the fact that a useless
> >square-root&squaring operation takes place.
> >

> Cleve Moler repsonse

> So it comes down to comparing abs(z).^2 with real(z).^2+imag(z).^2.
> It turns out these are comparable in speed; abs(z) does more arithmetic,
> but involves fewer memory references, than real(z).^2+imag(z).^2.
> The ultimate winner depends upon the ratio of arithmetic speed to
> memory reference speed of a particular machine. One is faster on
> a Sparc, the other is faster on a PC.

Considering how often a abs(z)^2 function is used, I must agree with Alexandre
that TMW should implement another function, say abs2 that saves the sqrt and
square operations. JMHO

***********************************************************************
Troy Richards rich...@edrd.dnd.ca
Esquimalt Defence Research Detachment Voice (250)363-2079
CFB Esquimalt, Building 199 FAX (250)363-2856
PO Box 17000 Stn Forces
Victoria, BC V9A 7N2 ***** Address and Area Code recently changed
***********************************************************************

SimGraphics

unread,

Apr 30, 1997, 3:00:00 AM4/30/97

to

In article <01bc557f$656cadc0$f060fdcf@crouic>,
Alexandre Kampouris <junk...@videotron.ca> wrote:

> According
>to my old I486 programmer's reference, "Fsqrt" is about 4 times slower than
>than "Fmul" and "Fadd". (Probably applicable to Pentiums) Fmul is a tad
>slower than the Fadd. Therefore, disregarding Fload and Fstore, the FPU
>spends about half its time calculating this useless Fsqrt, which would
>possibly have to be followed by "Fload/Dup. stack/Fmul/Fstore"+caching,
>paging, adress calculation.

>What do you think? Have I overlooked something? Please fell free to flame

>me if I wrote something dumb. :-) (But not too hard, please...)

Hello Alexandre!

I can't give you straight answer as my configuration of Matlab is very
different than yours. The biggest mistake you are making is to disregard
"fld" and "fst". This very untrue for any modern computer. In fact
you may as well disregard everything else, except memory access instructions.

Your machine has four levels of memory: Primary cache (L1), secondary
cache (L2), main RAM, and swap file. Measuring time is sort of like
blind search. You need to pay attention to cache misses and paging behaviour.

I don't know how to do it on Pentium and Windows NT, but I know that there
are people doing this type of measurements. Pentium has special hardware
counters that could be set up to count cache misses. Windows NT resource
kit has tools to gather the data about minor and major page faults.

The point I'm trying to make is that you are going to gain more insight
into how to speed up your program by watching memory references than by
counting arithmetic operations.

Good luck,

Sylvester

Tom Bryan

unread,

May 1, 1997, 3:00:00 AM5/1/97

to

Good suggestion. All product enhancement suggestions can be emailed
directly to sug...@mathworks.com.

..Tom

In article <Pine.PMDF.3.91.9704301...@edrd.dnd.ca>,
Troy Richards <rich...@edrd.dnd.ca> wrote:

> On 30 Apr 1997, Cleve Moler wrote:
>

> > In article <01bc557f$656cadc0$f060fdcf@crouic>,
> > Alexandre Kampouris <junk...@videotron.ca> wrote:
> > >Instinctively, though, one might write in Fortran something like
> > >(cabs(z)**2), corresponding to the mathematical notation |z|^2. That would
> > >be less efficient in any case.
> > >
> > >I've tested the two approaches in Matlab, and came to the result that using
> > >abs is actually *faster*, regardless of the fact that a useless
> > >square-root&squaring operation takes place.
> > >

> > Cleve Moler repsonse

> > So it comes down to comparing abs(z).^2 with real(z).^2+imag(z).^2.
> > It turns out these are comparable in speed; abs(z) does more arithmetic,
> > but involves fewer memory references, than real(z).^2+imag(z).^2.
> > The ultimate winner depends upon the ratio of arithmetic speed to
> > memory reference speed of a particular machine. One is faster on
> > a Sparc, the other is faster on a PC.
>

> Considering how often a abs(z)^2 function is used, I must agree with Alexandre
> that TMW should implement another function, say abs2 that saves the sqrt and
> square operations. JMHO
>
> ***********************************************************************
> Troy Richards rich...@edrd.dnd.ca
> Esquimalt Defence Research Detachment Voice (250)363-2079
> CFB Esquimalt, Building 199 FAX (250)363-2856
> PO Box 17000 Stn Forces
> Victoria, BC V9A 7N2 ***** Address and Area Code recently changed
> ***********************************************************************

+=== Tom Bryan ========================== tbr...@mathworks.com ===+
| The MathWorks, Inc. in...@mathworks.com |
| 24 Prime Park Way http://www.mathworks.com |
| Natick, MA 01760-1500 ftp.mathworks.com |
+=== Tel: 508-647-7669 ==== Fax: 508-647-7002 ====================+

Tom Bryan

unread,

May 1, 1997, 3:00:00 AM5/1/97

to

In article <tbryan-0105...@tbryan.mathworks.com>,
tbr...@mathworks.com (Tom Bryan) wrote:

> Good suggestion. All product enhancement suggestions can be emailed
> directly to sug...@mathworks.com.
>
> ..Tom
>

I forgot to mention that suggestions and bug reports can be entered also
on our Web site:

http://www.mathworks.com

And follow the "Contact us" information (which leads to
http://www.mathworks.com/contact1.shtml)

..Tom

Don Orofino

unread,

May 1, 1997, 3:00:00 AM5/1/97

to

Troy Richards wrote:
>
> Considering how often a abs(z)^2 function is used, I must agree with Alexandre
> that TMW should implement another function, say abs2 that saves the sqrt and
> square operations. JMHO

Just to add to the fray, the DSP Blockset V2.0 ("release date not yet
available") has
a "Magnitude Squared" block that operates on complex data; this block
arose from the
desire to eliminate the "square-root followed by square" operation that
regularly
occurred when one computed the squared magnitude of a complex result.

Mats Bengtsson

unread,

May 13, 1997, 3:00:00 AM5/13/97

to

A related note; At least from the flop counts, it seems that the
built-in norm function is very inefficiently implemented (Matlab 4.2c).
For a vector, norm(x) uses twice the number of flops as
sqrt(sum(real(x).^2+imag(x).^2)).

/Mats

Cleve Moler

unread,

May 13, 1997, 3:00:00 AM5/13/97

to

In article <goovi4n...@spitfire.e.kth.se>,

Hi.

Computing norm(x) with

sqrt(sum(real(x).^2+imag(x).^2)).

is susceptible to overflow and underflow, even though the final
result is in range. For example, x = [1.e200 1.e200]. The
built-in norm function does "extra" arithmetic to avoid this difficulty.

-- Cleve Moler
mo...@mathworks.com

wybi...@gmail.com

unread,

Jan 4, 2018, 7:05:46 PM1/4/18

to

I find that r2 = real(z).^2 + imag(z).^2 is the fastest on my machine by a long way (about a factor of 10 faster than abs(z).^2). They do however give slightly different results (on the order of eps) as one might expect.

Note that I find f(x).*f(x) is only marginally slower than f(x).^2 - this may have something to do with Matlabs interpreter being smart enough to realise they are the same thing.