Matlab faster than C++ ???

Kevin

unread,

May 17, 2005, 6:31:23 PM5/17/05

to

Hi all,

I was tasked to translate a Matlab script over to C++.

The Matlab script is a simple wrapper (basically for loops and some
easy vector manipulations) that calls 2 external C++ functions
compiled into DLL's.

All I had to do was translate the for loops and turn the vectorized
code into loops in C++ and call the native C++ functions that were
originally called through DLL in MATLAB.

Now here's the weird thing... On the same computer, running Windows
XP, the Matlab script took about 35 seconds to perform the job (timed
using internal time functions). Where as the compiled Windows
executable took about 90 seconds.... I've tried and tried to optimize
the C++ code but I just can't beat the Matlab script's run time...
What gives??? (I may not be a expert C++ programmer but my coding
skills are not that bad)

BTW, I'm using Matlab version 7.0.1.24704 (R14) Service Pack 1. My
C++ compiler is MS Visual C++ .NET 2003 (generating a Win32 console
application).

I would greatly appreciated it someone can shed some light on this
issue. I'm very confused...

Kevin

spasmous

unread,

May 17, 2005, 7:15:23 PM5/17/05

to

Kevin wrote:
> snip

>
> I would greatly appreciated it someone can shed some light on this
> issue. I'm very confused...
>

MATLAB is in many instances a wrapper to fortran libraries that have
been optimized intensively (LAPACK, FFTW, etc).

Kevin

unread,

May 17, 2005, 7:16:50 PM5/17/05

to

I just ran a profiler on my C++ code and it seems that most of the
time are spent in the external functions (i.e. the ones that were
originally written in C but compiled into a DLL to be called from
Matlab).

So my question is this: Why is it faster for Matlab to call the C
routines??? It seems that compiling the C routines along with a C
script should defitely be faster. I'm very confused...

Kevin

Brien Alkire

unread,

May 17, 2005, 8:19:27 PM5/17/05

to

As someone else pointed out, many Matlab routines depend upon highly
optimized libraries. For instance, the linear algebra routines come from
BLAS and Lapack. The fast Fourier transforms from the FFTW library (fastest
Fourier Transform in the West). These libraries are specifically tuned to
the CPU architecture to take advantage of fast memory, pipelining, etc.

So it's not that the scripted Matlab language is faster than the C code,
it's that Matlab is calling highly optimized libraries and your C/C++ code
is not.

However, you can use highly optimized libraries too. I think there's even a
way to link to the BLAS and Lapack libraries that come with Matlab, but I've
never done that. But another option is to use the Math Kernel Library
available from Intel. I think that's what Mathworks uses.

Steve Amphlett

unread,

May 18, 2005, 4:12:18 AM5/18/05

to

Your experience is not something I'm used to seeing. I normally see
a speedup going from ML to MEX (even for vectorized ML code). Can
you describe what your code is actually doing (are you rewriting BLAS
functions)? Better still, a snippet that shows the problem would be
interesting.

- Steve

Murphy O'Brien

unread,

May 18, 2005, 5:35:18 AM5/18/05

to

So you say that

a) In the fast version Matlab is just a wrapper, virtually all the
time is spent inside the external C code.

b) In the slow version, the C wrapper calls the same C functions, but
these identical functions take nearly 3 times as long.

Are you absolutely sure that both code versions do the same thing,
the same number of times?

Are you sure that the Matlab code doesn't have anything substantial
to do (Like matrix operations, FFTs etc.)?

Murphy

Kevin

unread,

May 18, 2005, 8:53:05 PM5/18/05

to

Hi,

Thank you all for the replies. Here is a summary of what my code
does. For those that are familiar with communication systems: it's
basically running performance tests (BER curves) for a convolutional
decoder.

(Original Setup)

1. Decoder functions written in C and compiled into DLL's so that it
can be called from Matlab (I didn't do this part and I don't know
much about MEX things so I can't elaborate more on that)

2. Matlab wrapper generates random bits (x = rand(1:N)) encodes using
(convenc(x,trellis)) and then adds AWGN (y = x + randn(1,N)). Then
this new vector is passed to the C decoding function (out =
decode(y)) and then Matlab script counts the number of bit errors.

(New Setup)

1. Same decoder functions as above (I have the C source).

2. Write a C version of the Matlab test wrapper (call the C "rand"
function, write my own "randn" function using Box-Muller technique,
write my own "convenc" using table lookup)

3. Compile into EXE and run.

So as you can see, there are no fancy vector/matrix manipulation
routines in the original Matlab script (i.e. no FFT, SVD, matrix
inverse...).

I profiled both the Matlab script and the C code. For both cases,
the majority of the running time (over 90%) is spent in the "decode"
function (which is written in C).

On my slow desktop running Windows XP, Matlab run time is about 33
seconds. On the same machine, compiled C EXE (using Visual C++ with
no optimization options) takes about 90 seconds. I also tried to
compile the code on a fast Linux Mandrake 10 box (turning on the
optimization using g++ -O) and it takes about 37 seconds to run.

I'm still confused...

Kevin

Rune Allnor

unread,

May 19, 2005, 1:35:49 AM5/19/05

to

Kevin wrote:

> On my slow desktop running Windows XP, Matlab run time is about 33
> seconds. On the same machine, compiled C EXE (using Visual C++ with
> no optimization options) takes about 90 seconds.

The speed difference is a factor 3 or so. Could this be due to
compiler optimizations? I've seen similar numbers when I ran an
optimizer on a C++ program compiled under HP-UX a few years ago.
Run-time improved from 50 s to 14 s when I switched from the
default compiler optimization option -O to -O3. This was with
the native HP compiler, though, not g++.

> I also tried to
> compile the code on a fast Linux Mandrake 10 box (turning on the
> optimization using g++ -O) and it takes about 37 seconds to run.
>
> I'm still confused...

Other than that, I'd look into whether complex numbers occur
anywhere in these computations. Matlab does things a bit
differently than usual, in that the real and imaginary parts
are stored in different arrays, not side by side in memory.
Such issues could have an impact on performance; you may
have to shuffle data around to fit with whatever format the
MEX file needs. I can't, on the other hand, see why complex
numbers should appear in this particular application.

Rune

Murphy O'Brien

unread,

May 19, 2005, 1:54:17 PM5/19/05

to

I think the answer could lie in either randn or convenc. The matlab
versions are very fast. For example randn only takes about twice as
long as rand, if you substitute rand in the C code (just as a test),
does it speed it up much?)

Murphy

Rajeev

unread,

May 19, 2005, 5:23:53 PM5/19/05

to

Kevin wrote:
> Hi,
>
> Thank you all for the replies. Here is a summary of what my code
> does. For those that are familiar with communication systems: it's
> basically running performance tests (BER curves) for a convolutional
> decoder.
>
> (Original Setup)
>
> 1. Decoder functions written in C and compiled into DLL's so that it
> can be called from Matlab (I didn't do this part and I don't know
> much about MEX things so I can't elaborate more on that)
>
> 2. Matlab wrapper generates random bits (x = rand(1:N)) encodes using
> (convenc(x,trellis)) and then adds AWGN (y = x + randn(1,N)). Then
> this new vector is passed to the C decoding function (out =
> decode(y)) and then Matlab script counts the number of bit errors.
>
> (New Setup)
>
> 1. Same decoder functions as above (I have the C source).
>
> 2. Write a C version of the Matlab test wrapper (call the C "rand"
> function, write my own "randn" function using Box-Muller technique,
> write my own "convenc" using table lookup)

Have you pondered the memory footprint of the codes ?

How big is the table used by your version of "convenc" ? How much
memory do you have in your machine ? The Matlab convenc may have
very small memory requirement, while your convenc might possibly
be blowing away the cache. Does your timing run execute once through
or is there an overall loop ?

If you want to pursue this idea further, take a look at source code
for a memory bandwith test such as STREAM, where you will find explicit
code whose function is to blow away the cache between tests.

http://www.cs.virginia.edu/stream/ref.html

> 3. Compile into EXE and run.
>
> So as you can see, there are no fancy vector/matrix manipulation
> routines in the original Matlab script (i.e. no FFT, SVD, matrix
> inverse...).
>
> I profiled both the Matlab script and the C code. For both cases,
> the majority of the running time (over 90%) is spent in the "decode"
> function (which is written in C).
>
> On my slow desktop running Windows XP, Matlab run time is about 33
> seconds. On the same machine, compiled C EXE (using Visual C++ with
> no optimization options) takes about 90 seconds. I also tried to
> compile the code on a fast Linux Mandrake 10 box (turning on the
> optimization using g++ -O) and it takes about 37 seconds to run.

You wouldn't by chance have the Matlab time on the fast Linux box ?
It would also be of interest to know the answers to what processor,
what speed, how much memory for each system.

> I'm still confused...
>
> Kevin

And an aside, I have been preferring to use C for time-sensitive code
since I figure there is less risk of unintended overhead, ie to my
way of thinking C is closer to what-you-write-is-what-you-get.

Good luck,
-rajeev-

Tian

unread,

May 15, 2012, 3:48:07 AM5/15/12

to

Kevin <cufo...@hotmail.com> wrote in message

Of course, Matlab cannot be faster than C++, provided that they run a same algorithm. I've test a simple multiplication of two matrix in dimension 1023 * 1024 and 1024 * 1023 respectively.
%m file running on MATLAB R2010b
a = rand(1023, 1024);
b = rand(1024, 1023);
c = zeros(1023, 1023);
for i = 1 : 1023
for j = 1 : 1023
for p = 1 : 1024
c(i,j) = c(i,j) + a(i,p) * b(p,j);
end
end
end
%This program completes in approximately 26 seconds
//C++ file and run its executable file
//...
for(int i = 0; i < 1023; i++)
for(int j = 0; j < 1023; j++)
for(int p = 0; p < 1024; p++)
c[i][j] += a[i][p] * b[p][j];
//This one completes in 20 seconds
However, the sentence c = a * b in MATLAB completed less than 1 seconds
The only reasonable explanation seems to be MATLAB's application of highly optimized algorithms in operator '*'.

dpb

unread,

May 15, 2012, 7:18:10 AM5/15/12

to

On 5/15/2012 2:48 AM, Tian wrote:
> Kevin <cufo...@hotmail.com> wrote in message

> Of course, Matlab cannot be faster than C++,...

Where does the "of course" come into play?

--

Nasser M. Abbasi

unread,

May 15, 2012, 9:40:37 AM5/15/12

to

It really all depends on what is being done in terms
of use of external math libraries or not.

Matlab, on intel hardware (may be on AMD also), will
use intel's MKL, a very highly optimized math library
http://en.wikipedia.org/wiki/Math_Kernel_Library

"Intel's Math Kernel Library (MKL) is a library of
optimized, math routines for science, engineering,
and financial applications. Core math functions
include BLAS, LAPACK, ScaLAPACK, Sparse Solvers,
Fast Fourier Transforms, and Vector Math."

So, if someone on Matlab is doing say x=A\b or A*B,
then Matlab will do this as fast as any body else,
since it will end up making a call to the MKL.

(Actually Matlab's A*B is faster than Fortran's
MATMULT(A,B) based on what I read on the net.
http://gcc.gnu.org/onlinedocs/gfortran/MATMUL.html

You can google this.

But for generic operations, i.e. pure m code,
then would expect C++ generated compiled code, which
does an equivalent algorithm to be faster than Matlab's.

That is why it is good to always try to use Matlab's
build-in functions. They most likely end up making
calls to such libraries as MKL and will be much
faster than home grown pure m code.

--Nasser

James Tursa

unread,

May 15, 2012, 10:11:07 AM5/15/12

to

"Nasser M. Abbasi" <n...@12000.org> wrote in message <jotmcm$r2r$1...@speranza.aioe.org>...

>
> (Actually Matlab's A*B is faster than Fortran's
> MATMULT(A,B) based on what I read on the net.
> http://gcc.gnu.org/onlinedocs/gfortran/MATMUL.html

One example of this is my Intel Fortran compiler on 32-bit Windows XP. The intrinsic matmul is pitifully slow. Simply downloading the DGEMM source code from netlib.org and using that instead beats the intrinsic matmul. IMO, the intrinsic matmul is only useful for problems that you *know* are small and will not dominate the run-time if matmul is used.

James Tursa

dpb

unread,

May 24, 2012, 9:46:02 PM5/24/12

to

That is a "quality of implementation" issue; not a fundamental quality
of the Fortran intrinsic since the Fortran Standard doesn't specify
anything other than the arguments and expected results--it's up to the
compiler vendor for the implementation.

That said, since vendors are aware of the availability of LINPACK and
friends, it's hardly likely they will spend much effort in the area of
MATMUL and it's more beneficial to users of their compilers that they
concentrate on the stuff that only they as the compiler vendor _can_ do.

--

Ben

unread,

Jun 15, 2012, 2:24:07 PM6/15/12

to

I'm glad this thread re-opened fairly recently as I'm dealing with similar issues. Basically I have some code that is very computationally complex - operating over all partitions of a set and things like that. My job this summer is to improve the code's performance mostly by vectorization (so many unnecessary for-loops!) and by creating MEX files. I've never created MEX files before so I've been doing some tests.

My main question was about the speed of vectorization vs. C++ loops. I wrote this test:

function mex_vs_vectorize_test

test_vector = randperm(1000000)/1000000;

count = 0;
tic
for i = 1:size(test_vector,2)
count = count + test_vector(i);
end
toc

tic
count = sum(test_vector);
toc

tic
count = mexTest(test_vector);
toc

The function mexTest calls a MEX-file that runs this function:

void sumUp(double *values, mwSize valuesSize, double *sum) {
*sum = 0;
for(mwSize i = 0; i < valuesSize; i++) {
*sum = *sum + *values;
values++;
}
}

Like some of the observations mentioned earlier, the fastest of the three is MATLAB's vectorized computation.

So my question to the community is this: Ultimately is it worth just writing everything in C++ and ignoring MATLAB's advantage in this specific case? Should I consider using the MATLAB Engine to send certain vectorizable for-loops back to MATLAB (my guess is that this isn't the case)... or maybe I should be using the libraries mentioned above in my C++ code... is this possible when creating MEX-files? I will definitely have some for-loops that cannot be vectorized and which call functions that may contain for-loops that can be vectorized...

...advice ...thoughts?

Thanks!

James Tursa

unread,

Jun 15, 2012, 2:49:08 PM6/15/12

to

"Ben " <bms...@columbia.edu> wrote in message <jrfuk7$d6k$1...@newscl01ah.mathworks.com>...

> I'm glad this thread re-opened fairly recently as I'm dealing with similar issues. Basically I have some code that is very computationally complex - operating over all partitions of a set and things like that. My job this summer is to improve the code's performance mostly by vectorization (so many unnecessary for-loops!) and by creating MEX files. I've never created MEX files before so I've been doing some tests.
>
> My main question was about the speed of vectorization vs. C++ loops. I wrote this test:
>
> function mex_vs_vectorize_test
>
> test_vector = randperm(1000000)/1000000;
>
> count = 0;
> tic
> for i = 1:size(test_vector,2)
> count = count + test_vector(i);
> end
> toc
>
> tic
> count = sum(test_vector);
> toc
>
> tic
> count = mexTest(test_vector);
> toc
>
> The function mexTest calls a MEX-file that runs this function:
>
> void sumUp(double *values, mwSize valuesSize, double *sum) {
> *sum = 0;
> for(mwSize i = 0; i < valuesSize; i++) {
> *sum = *sum + *values;
> values++;
> }
> }
>
> Like some of the observations mentioned earlier, the fastest of the three is MATLAB's vectorized computation.

My guess is that the sum function is parallelized in the background, whereas your C++ loop is not. If you were to wrap your loop in an OMP directive with a reduction, you could probably match MATLAB for speed in this particular case. Caveat: This is just a guess ... I haven't actually run any code to verify this yet.

> So my question to the community is this: Ultimately is it worth just writing everything in C++ and ignoring MATLAB's advantage in this specific case? Should I consider using the MATLAB Engine to send certain vectorizable for-loops back to MATLAB (my guess is that this isn't the case)... or maybe I should be using the libraries mentioned above in my C++ code... is this possible when creating MEX-files? I will definitely have some for-loops that cannot be vectorized and which call functions that may contain for-loops that can be vectorized...

The answer is ... it depends. Several MATLAB functions have C++ code in the background, and many of these are parallelized. The list of which functions do what is constantly changing from release to release. And the list of how well the JIT optimizes your m-code is also changing from release to release. So you might be able to beat it in a previous release, but not the current release. I would first use the profiler to point me to where the majority of time is being spent, work on improving that first, and *then* worry about mex routine.

That being said, some things to keep in mind are:

- Run times can often be significantly affected by memory access patterns, sometimes even dominated by memory access patterns. Try to use algorithms that drag memory through the high level cache only once if possible, and in a linear fashion.

- Variable slicing can be pretty poor performance on the MATLAB side of things. E.g., every time you do X(:,:,n) on MATLAB it causes a data copy to take place for the 2D slice. This can be avoided in a mex routine by accessing the data directly without a copy. If you have a lot of this going on in your code, then a mex routine can be faster.

- The MATLAB Engine will often result in a speed *loss* because of memory copy issues. To operate on a variable using the Engine, you have to *copy* the entire variable to the Engine first, then *copy* the result back to your program. This can easily triple the memory usage, at least temporarily, and rack up a tremendous amount of time in all the memory allocation & copying. So it will drag the overall performance as a result. These data copies can often be avoided in a mex routine.

- Basic linear algebra things like sum & matrix multiply will probably be hard to beat in a mex routine because MATLAB uses highly optimized third-party libraries for this, although in some specialized cases it can be done. E.g., doing lots of 2D array slice matrix multiplies can be done a lot quicker in a mex routine ... not because the underlying math is faster in a mex routine but because all the data slice copies can be avoided.

James Tursa

Nasser M. Abbasi

unread,

Jun 15, 2012, 3:02:27 PM6/15/12

to

On 6/15/2012 1:49 PM, James Tursa wrote:

> - Basic linear algebra things like sum& matrix multiply will probably be

> hard to beat in a mex routine because MATLAB uses highly optimized third-party
>libraries for this

The reason for this is Matlab use of intel Math Kernel library (MKL)
and only on intel Hardware. I do not know of any other 3rd part libraries
that cause this speed advantage.

Some performance numbers are posted here :

http://julialang.org/

looking at the chart, Matlab is faster than C for the
random matrix multiplication test. It says below that:

"Matlab’s ability to beat both C and Julia on random
matrix multiplication comes from its use of the proprietary
Intel Math Kernel Library, which has extremely optimized
code for matrix multiplication on Intel platforms"

But for non-linear algebra tests that does not take
advantage of the MKL, the speed advantage seems to go
away.

--Nasser

James Tursa

unread,

Jun 15, 2012, 3:50:08 PM6/15/12

to

"James Tursa" wrote in message <jrg034$jno$1...@newscl01ah.mathworks.com>...

>
> My guess is that the sum function is parallelized in the background, whereas your C++ loop is not. If you were to wrap your loop in an OMP directive with a reduction, you could probably match MATLAB for speed in this particular case. Caveat: This is just a guess ... I haven't actually run any code to verify this yet.

Follow-up: I should probably add to the list that you should get a good compiler if your goal is to get as much speed as possible. E.g., I just ran some sum tests with R2012a on an Intel Core 2 Duo. Using the supplied LCC compiler a mex routine took 3 times as long as the intrinsic MATLAB sum function. Using Microsoft Visual Studio gave times that were roughly 5-10% better than the MATLAB sum function. Using MSVC with OpenMP gave the same timing result (indicating to me that there was already some parallelization going on in the background from the compiler even when OpenMP was not used).

James Tursa