Will CMEX speed up x.^y?

the cyclist

unread,

Jan 3, 2006, 4:41:37 PM1/3/06

to

In the application I am currently working on, the bottleneck is a
line that is effectively

>> z = x.^y;

The variables are 5000x100 (or so) arrays, and I do this operation
thousands of times. [But it cannot be vectorized over that
dimension.]

I have never really used CMEX before, but I am wondering if I should
expect a speedup by replacing this line with a C function that
essentially just uses "pow(x,y)" in C.

Peter Boettcher

unread,

Jan 3, 2006, 4:39:32 PM1/3/06

to

"the cyclist" <thecy...@gmail.com> writes:

No. Operations like these are very well optimized already. Do you
know anything in general about x and y?

--
Peter Boettcher <boet...@ll.mit.edu>
MIT Lincoln Laboratory
MATLAB FAQ: http://www.mit.edu/~pwb/cssm/

Rune Allnor

unread,

Jan 3, 2006, 4:42:14 PM1/3/06

to

It depends on how this is implemented in matlab. I believe the
expression
as stated above is as vectorized it can be. However, if it is a
compound
expressions like

z = (x1+x2).^(y1+y2);

you might expect cache problems. In this case I would do something like

yy= y1+y2;
xx= x1+ x2;
z = xx.^yy;

But then, the computation is almost an as simple examplas as possible
for a CMEX file, so if you know a little bit about CMEX files, it will
take you almost no time to implement it.

If you do have a go at a CMEX file, make sure you access the numbers
sequentially in memory, to reduce the number of cache misses.

Maybe it is possible to speed things up with a factor 2 or so. I would
be surprised if there is much more to be gained.

Rune

the cyclist

unread,

Jan 3, 2006, 4:56:51 PM1/3/06

to

Peter Boettcher wrote:

> "the cyclist" <thecy...@gmail.com> writes:
>
>> In the application I am currently working on, the bottleneck is
a
>> line that is effectively
>>
>>>> z = x.^y;

<*snip*>

> Do you know anything in general about x and y?

"x" is a real number, almost always in the range (1,1.1).
"y" is an integer in the range (0,360).

ysri...@gmail.com

unread,

Jan 3, 2006, 4:57:07 PM1/3/06

to

Just a suggestion...not exactly the solution you may be expecting..

If x and y are as you describe, you might want to implement a binomial
expansion solution, if you are willing to sacrifice a little bit of
precision.

(1+x)^n = 1+ n x + n*(n-1)/2 x^2 + ....

If you limit it to first order terms...the implementation should become
real fast.

Raja

the cyclist

unread,

Jan 3, 2006, 5:33:33 PM1/3/06

to

Rune Allnor wrote:

<*snip*>

> Maybe it is possible to speed things up with a factor 2 or so. I
> would
> be surprised if there is much more to be gained.

A factor of 2 would be a godsend! I'll try CMEXing if there is even
a small hope of that.

But I am left wondering why one poster thinks a factor of two may be
possible, and another thinks it is highly optimized. 8-)

Rune Allnor

unread,

Jan 3, 2006, 6:56:35 PM1/3/06

to

Because we are talking about matlab. Here is something I posted
a couple of years ago,

http://groups.google.com/group/comp.soft-sys.matlab/msg/6c19f6fb4d0ac5d3

to give an impression of what I have learned to expect of matlab.
In that perspective a potential improvment of a factor 2 qualifies
as "highly optimized."

For the record: Matlab has improved significantly since I posted
that test (I think the test posted above was done with matlab v 6.0).
Nevertheless, matlab has still a long way to go before it can be
compared to C/C++ in any serious way.

Rune

MM

unread,

Jan 4, 2006, 2:44:04 AM1/4/06

to

"the cyclist" <thecy...@gmail.com> wrote in message
news:ef212...@webx.raydaftYaTP...

If you have Intel processor try Intel MKL VML (Vector Math Library) which
has good and fast approximation of power function.

Best wishes,
MM

Steve Amphlett

unread,

Jan 4, 2006, 4:45:17 AM1/4/06

to

the cyclist wrote:
>
>
> A factor of 2 would be a godsend! I'll try CMEXing if there is
> even
> a small hope of that.
>
> But I am left wondering why one poster thinks a factor of two may
> be
> possible, and another thinks it is highly optimized. 8-)

In my experience it's rare to get beyond a factor of 2 without
cheating in some way, or eliminating known Matlab overheads. In your
example, I'd probably preallocate my result (z) array and reuse it
each time by passing it to the MEX function to be overwritten. In
the expression: z=x.^y there is a lot of overhead in recreating a z
array on each call and then (probably?) copying the intermediate
result into it. If you write a MEX function that's called like this:

inplacepow(x,y,z);

you can get a speedup of over 5 for your array sizes, so long as z is
created once only and reused each time. Here's my MEX file. It's
pretty simple and contains absolutely no error checking. But because
it works on existing, preallocated memory, it's quick. In fact I'd
be surprised if it could be done quicker.

/*************************************/
#include <mex.h>
void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray
*prhs[])
{
double *x=mxGetPr(prhs[0]);
double *y=mxGetPr(prhs[1]);
double *z=mxGetPr(prhs[2]);
int n=mxGetM(prhs[0])*mxGetN(prhs[0]);
int i;
for(i=0; i<n; i++) z[i]=pow(x[i],y[i]);
}
/*************************************/

My test script:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
z1=zeros(5000,100);
x=rand(5000,100);
y=rand(5000,100);

tic
for count=1:10
z2=x.^y;
end
toc

tic
for count=1:10
inplacepow(x,y,z1);
end
toc

max(max(abs(z1-z2)))
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Its output (1.7GHz P4 Win2k, MSVC++, Matlab 7.1):

Elapsed time is 9.945057 seconds.
Elapsed time is 1.922279 seconds.

ans =

1.1102e-016

(note that the max difference is eps/2. Not too bad. I can't
imaging a little bit of error checking would slow it down too much
either (like at least counting the inputs and comparing their sizes;
making sure z isn't a reference to any other variable; etc)

Rune Allnor

unread,

Jan 4, 2006, 5:17:04 AM1/4/06

to

Steve Amphlett wrote:
> the cyclist wrote:
> >
> >
> > A factor of 2 would be a godsend! I'll try CMEXing if there is
> > even
> > a small hope of that.
> >
> > But I am left wondering why one poster thinks a factor of two may
> > be
> > possible, and another thinks it is highly optimized. 8-)
>
> In my experience it's rare to get beyond a factor of 2 without
> cheating in some way, or eliminating known Matlab overheads.

[snip]

> Its output (1.7GHz P4 Win2k, MSVC++, Matlab 7.1):
>
> Elapsed time is 9.945057 seconds.
> Elapsed time is 1.922279 seconds.
>
> ans =
>
> 1.1102e-016
>
> (note that the max difference is eps/2. Not too bad. I can't
> imaging a little bit of error checking would slow it down too much
> either (like at least counting the inputs and comparing their sizes;
> making sure z isn't a reference to any other variable; etc)

Nah, argument checking does not take much time, on this kind
of system maybe a couple of hundred nanoseconds per call
to the mex function. Entry-point argument checking is not the
place to save run-time.

The main issue is to get a streamlined memory access up and
running, to reduce cache misses. If there is more to be gained,
one would have to look into the code of the POW function, as
others already have suggested.

Rune

Steve Amphlett

unread,

Jan 4, 2006, 11:41:38 AM1/4/06

to

Rune Allnor wrote:
>
>
> Nah, argument checking does not take much time, on this kind
> of system maybe a couple of hundred nanoseconds per call
> to the mex function. Entry-point argument checking is not the
> place to save run-time.

But it does take more than "a couple of hundred nanoseconds" to write
that sort of code though.

the cyclist

unread,

Jan 4, 2006, 12:38:59 PM1/4/06

to

Unfortunately, the second dimension (length=100 here) is not constant
over every call to the function. (In your example, it is not
constant in the loop over "count").

Last night I wrote my own C code (unsurprisingly, it is quite similar
to yours), and I got a speedup of about 40%, which is definitely
significant. Thanks for the ideas, everyone.

Rune Allnor

unread,

Jan 4, 2006, 1:17:05 PM1/4/06

to

That's great. Now, if you want to squeeze another factor 1.5 - 2 out of
it,
you loop over the data sets prior to calling your function, and
allocate
a work-area matrix of size max(N) by max(M), much like Steve suggested.

Then you implement some book-keeping to handle the different matrix
sizes...

Rune

Steve Amphlett

unread,

Jan 4, 2006, 1:54:31 PM1/4/06

to

the cyclist wrote:
>
>
>
> Unfortunately, the second dimension (length=100 here) is not
> constant
> over every call to the function. (In your example, it is not
> constant in the loop over "count").
>
> Last night I wrote my own C code (unsurprisingly, it is quite
> similar
> to yours), and I got a speedup of about 40%, which is definitely
> significant. Thanks for the ideas, everyone.

Just 40%? That's what I'd expect from a first-level MEXing. Do you
really need a dynamically varying LHS (i.e. "z" in your example)? Or
can you pass one to fill? I still think your bottleneck is
intermediate variables and their allocation.

Steve Amphlett

unread,

Jan 4, 2006, 1:59:22 PM1/4/06

to

Sorry, my reading at home isn't as clear as it is at work, so the
previous reply is probably irrelevant. However, the sentiment is
true - if you can pass you result array to your MEX file, time will
be saved. Avoid creating new arrays wherever possible.

MM

unread,

Jan 5, 2006, 1:39:34 AM1/5/06

to

"Rune Allnor" <all...@tele.ntnu.no> wrote in message
news:1136398625.5...@g14g2000cwa.googlegroups.com...
...

>> Last night I wrote my own C code (unsurprisingly, it is quite similar
>> to yours), and I got a speedup of about 40%, which is definitely
>> significant. Thanks for the ideas, everyone.
>
> That's great. Now, if you want to squeeze another factor 1.5 - 2 out of
> it,
> you loop over the data sets prior to calling your function, and
> allocate
> a work-area matrix of size max(N) by max(M), much like Steve suggested.
>
> Then you implement some book-keeping to handle the different matrix
> sizes...

Ok, so when you finish all improvements connected with memory operations go
to Intel VML to speed up powfunction.

Best wishes,
MM