[Numpy-discussion] numpy speed question

Jean-Luc Menut

unread,

Nov 25, 2010, 5:13:49 AM11/25/10

to numpy-di...@scipy.org

Hello all,

I have a little question about the speed of numpy vs IDL 7.0. I did a
very simple little check by computing just a cosine in a loop. I was
quite surprised to see an order of magnitude of difference between numpy
and IDL, I would have thought that for such a basic function, the speed
would be approximatively the same.

I suppose that some of the difference may come from the default data
type of 64bits in numpy and 32 bits in IDL. Is there a way to change the
numpy default data type (without recompiling) ?

And I'm not an expert at all, maybe there is a better explanation, like
a better use of the several CPU core by IDL ?

I'm working with windows 7 64 bits on a core i7.

any hint is welcome.
Thanks.

Here the IDL code :
Julian1 = SYSTIME( /JULIAN , /UTC )
for j=0,9999 do begin
for i=0,999 do begin
a=cos(2*!pi*i/100.)
endfor
endfor
Julian2 = SYSTIME( /JULIAN , /UTC )
print, (Julian2-Julian1)*86400.0
print,cpt
end

result:
% Compiled module: $MAIN$.
2.9999837

The python code:
from numpy import *
from time import time
time1 = time()
for j in range(10000):
for i in range(1000):
a=cos(2*pi*i/100.)
time2 = time()
print time2-time1

result:
In [2]: run python_test_speed.py
24.1809999943

_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Sebastian Walter

unread,

Nov 25, 2010, 5:38:06 AM11/25/10

to Discussion of Numerical Python

using math.cos instead of numpy.cos should be much faster.
I believe this is a known issue of numpy.

Jean-Luc Menut

unread,

Nov 25, 2010, 5:49:14 AM11/25/10

to numpy-di...@scipy.org

Le 25/11/2010 11:38, Sebastian Walter a écrit :
> using math.cos instead of numpy.cos should be much faster.
> I believe this is a known issue of numpy.

You're right, with math.cos, the code take 4.3s to run, not as fast as
IDL, but a lot better.

Ernest Adrogué

unread,

Nov 25, 2010, 5:51:07 AM11/25/10

to numpy-di...@scipy.org

Hi,

25/11/10 @ 11:13 (+0100), thus spake Jean-Luc Menut:

> I suppose that some of the difference may come from the default data
> type of 64bits in numpy and 32 bits in IDL. Is there a way to change the
> numpy default data type (without recompiling) ?

This is probably not the issue.

> And I'm not an expert at all, maybe there is a better explanation, like
> a better use of the several CPU core by IDL ?

I'm not an expert either, but the basic idea you have to get is
that "for" loops in Python are slow. Numpy is not going to change
this. Instead, Numpy allows you to work with "vectors" and "arrays"
so that you need not putting loops in your code. So, you have to
change the way you think about things, it takes a little to get
used to it at first.

Cheers,

--
Ernest

Dave Hirschfeld

unread,

Nov 25, 2010, 5:49:57 AM11/25/10

to numpy-di...@scipy.org

Jean-Luc Menut <jeanluc.menut <at> free.fr> writes:
>
> I have a little question about the speed of numpy vs IDL 7.0.
>

> Here the IDL result:

> % Compiled module: $MAIN$.
> 2.9999837
>
> The python code:
> from numpy import *
> from time import time
> time1 = time()
> for j in range(10000):
> for i in range(1000):
> a=cos(2*pi*i/100.)
> time2 = time()
> print time2-time1
>
> result:
> In [2]: run python_test_speed.py
> 24.1809999943
>

Whilst you've imported everything from numpy you're not really using numpy -
you're still using a slow Python (double) loop. The power of numpy comes from
vectorising your code - i.e. applying functions to arrays of data.

The example below demonstrates an 80 fold increase in speed by vectorising the
calculation:

def method1():
a = empty([1000, 10000])

for j in range(10000):
for i in range(1000):

a[i,j] = cos(2*pi*i/100.)
return a
#

def method2():
ij = np.repeat((2*pi*np.arange(1000)/100.)[:,None], 10000, axis=1)
return np.cos(ij)
#

In [46]: timeit method1()
1 loops, best of 3: 47.9 s per loop

In [47]: timeit method2()
1 loops, best of 3: 589 ms per loop

In [48]: allclose(method1(), method2())
Out[48]: True

Jean-Luc Menut

unread,

Nov 25, 2010, 5:55:24 AM11/25/10

to numpy-di...@scipy.org

Le 25/11/2010 11:51, Ernest Adrogué a écrit :
> I'm not an expert either, but the basic idea you have to get is
> that "for" loops in Python are slow. Numpy is not going to change
> this. Instead, Numpy allows you to work with "vectors" and "arrays"
> so that you need not putting loops in your code. So, you have to
> change the way you think about things, it takes a little to get
> used to it at first.

Yes I know but IDL share this characteristics with numpy, and sometimes
you cannot avoid loop. Anyway it was just a test to compare the speed of
the cosine function in IDL and numpy.

Alan G Isaac

unread,

Nov 25, 2010, 9:00:57 AM11/25/10

to Discussion of Numerical Python

On 11/25/2010 5:55 AM, Jean-Luc Menut wrote:
> it was just a test to compare the speed of
> the cosine function in IDL and numpy

The point others are trying to make is that
you *instead* tested the speed of creation
of a certain object type. To test the *function*
speeds, feed both large arrays.

>>> type(0.5)
<type 'float'>
>>> type(math.cos(0.5))
<type 'float'>
>>> type(np.cos(0.5))
<type 'numpy.float64'>

hth,
Alan Isaac

David Cournapeau

unread,

Nov 25, 2010, 5:31:13 PM11/25/10

to Discussion of Numerical Python

On Thu, Nov 25, 2010 at 7:55 PM, Jean-Luc Menut <jeanlu...@free.fr> wrote:

> Yes I know but IDL share this characteristics with numpy, and sometimes
> you cannot avoid loop. Anyway it was just a test to compare the speed of
> the cosine function in IDL and numpy.

No, you compared IDL looping and python looping. You did not even use
numpy. Loops are slow in python, and will remain so in the near
future. OTOH, there are many ways to deal with this issue in python
compared to IDL (cython being a fairly popular one).

David

Gökhan Sever

unread,

Nov 25, 2010, 4:34:24 PM11/25/10

to Discussion of Numerical Python

Vectorised numpy version already blow away the results.

Here is what I get using the IDL version (with IDL v7.1):

IDL> .r test_idl
% Compiled module: $MAIN$.
4.0000185

I[10]: time run test_python
43.305727005

and using a Cythonized version:

from math import pi

cdef extern from "math.h":
float cos(float)

cpdef float myloop(int n1, int n2, float n3):
cdef float a
cdef int i, j
for j in range(n1):
for i in range(n2):
a=cos(2*pi*i/n3)

compiling the setup.py file python setup.py build_ext --inplace
and importing the function into IPython

from mycython import myloop

I[6]: timeit myloop(10000, 1000, 100.0)
1 loops, best of 3: 2.91 s per loop

--
Gökhan

Bruce Sherwood

unread,

Nov 26, 2010, 11:48:39 AM11/26/10

to Discussion of Numerical Python

Although this was mentioned earlier, it's worth emphasizing that if
you need to use functions such as cosine with scalar arguments, you
should use math.cos(), not numpy.cos(). The numpy versions of these
functions are optimized for handling array arguments and are much
slower than the math versions for scalar arguments.

Bruce Sherwood

Francesc Alted

unread,

Nov 26, 2010, 1:03:03 PM11/26/10

to Discussion of Numerical Python

A Thursday 25 November 2010 11:13:49 Jean-Luc Menut escrigué:

> Hello all,
>
> I have a little question about the speed of numpy vs IDL 7.0. I did a
> very simple little check by computing just a cosine in a loop. I was
> quite surprised to see an order of magnitude of difference between
> numpy and IDL, I would have thought that for such a basic function,
> the speed would be approximatively the same.
>
> I suppose that some of the difference may come from the default data
> type of 64bits in numpy and 32 bits in IDL. Is there a way to change
> the numpy default data type (without recompiling) ?
>
> And I'm not an expert at all, maybe there is a better explanation,
> like a better use of the several CPU core by IDL ?

As others have already point out, you should make sure that you use
numpy.cos with arrays in order to get good performance.

I don't know whether IDL is using multi-cores or not, but if you are
looking for ultimate performance, you can always use Numexpr that makes
use of multicores. For example, using a machine with 8 cores (w/
hyperthreading), we have:

>>> from math import pi
>>> import numpy as np
>>> import numexpr as ne
>>> i = np.arange(1e6)
>>> %timeit np.cos(2*pi*i/100.)
10 loops, best of 3: 85.2 ms per loop
>>> %timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 8.28 ms per loop

If you don't have a machine with a lot of cores, but still want to get
good performance, you can still link Numexpr against Intel's VML (Vector
Math Library). For example, using Numexpr+VML with only one core (in
another machine):

>>> %timeit np.cos(2*pi*i/100.)
10 loops, best of 3: 66.7 ms per loop
>>> ne.set_vml_num_threads(1)
>>> %timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 9.1 ms per loop

which also gives a pretty good speedup. Curiously, Numexpr+VML is not
that good at using multicores in this case:

>>> ne.set_vml_num_threads(2)
>>> %timeit ne.evaluate("cos(2*pi*i/100.)")
10 loops, best of 3: 14.7 ms per loop

I don't really know why Numexpr+VML is taking more time using 2 threads
than only one, but it is probably due to Numexpr requiring better fine-
tuning in combination with VML :-/

--
Francesc Alted

Jean-Luc Menut

unread,

Dec 1, 2010, 5:23:22 AM12/1/10

to Discussion of Numerical Python

Le 26/11/2010 17:48, Bruce Sherwood a écrit :
> Although this was mentioned earlier, it's worth emphasizing that if
> you need to use functions such as cosine with scalar arguments, you
> should use math.cos(), not numpy.cos(). The numpy versions of these
> functions are optimized for handling array arguments and are much
> slower than the math versions for scalar arguments.

Yes I understand that. I just want to stress that it was not a benchmark
(nor a critic) but a test to know if it was interesting to translate
directly an IDL code into python/numpy before trying to optimize it (I
know more python than IDL). I expected to have approximatively the same
speed for both, was surprised by the result, and wanted to know if there
was an obvious reason besides the unoptimization for scalars.

Reply all

Reply to author

Forward