f2py vs Cython

1,056 views
Skip to first unread message

Alex van Houten

unread,
Sep 22, 2011, 12:38:30 PM9/22/11
to cython...@googlegroups.com
I ran into this post
http://technicaldiscovery.blogspot.com/2011/06/speeding-up-python-numpy-cython-and.html
It is about solving the Laplace's equation in 2D.
I was looking at the comments by Ondřej Čertík.
He used "vectorized" Fortran.
(btw, what does "vectorized" mean in this context?)

He was getting a speed increase of a factor 6.5 compared to
the slower Cython version which would translate to a speed
increase of a factor 6.5*0.94/1.28=4.8 over "fast" Cython.
I was amazed by this result. Although Travis did not reproduce
exactly this number.
I am now considering translating some of my Cython code to Fortran
and using f2py.
Can anyone comment on this?
Are speed benefits this large actually attainable?

Alex.

Dag Sverre Seljebotn

unread,
Sep 22, 2011, 1:59:12 PM9/22/11
to cython...@googlegroups.com
On 09/22/2011 06:38 PM, Alex van Houten wrote:
> I ran into this post
> http://technicaldiscovery.blogspot.com/2011/06/speeding-up-python-numpy-cython-and.html
> It is about solving the Laplace's equation in 2D.
> I was looking at the comments by Ondřej Čertík.
> He used "vectorized" Fortran.
> (btw, what does "vectorized" mean in this context?)

That the loop was detected to do the same operation elementwise on all
elements, e.g., A = C + D in NumPy. This enables the use of many
optimization, most importantly SSE (google it), which gives a 2x speedup
for double precision and 4x for single precision.


http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

>
> He was getting a speed increase of a factor 6.5 compared to
> the slower Cython version which would translate to a speed
> increase of a factor 6.5*0.94/1.28=4.8 over "fast" Cython.
> I was amazed by this result. Although Travis did not reproduce
> exactly this number.
> I am now considering translating some of my Cython code to Fortran
> and using f2py.
> Can anyone comment on this?
> Are speed benefits this large actually attainable?

Yes, but only with vectorizable code. (You should Google that for more
info).

You should also try numexpr and Theano for vectorizable code for
comparison, they are libraries that can be used from pure Python to get
impressive speedups in similar cases.

Dag Sverre

Chris.Barker

unread,
Sep 22, 2011, 5:43:03 PM9/22/11
to cython...@googlegroups.com
On 9/22/11 10:59 AM, Dag Sverre Seljebotn wrote:
>> He used "vectorized" Fortran.
>> (btw, what does "vectorized" mean in this context?)

>> I am now considering translating some of my Cython code to Fortran
>> and using f2py.

if you do do fortran, you might want to look at fwrap, rather than f2py:

http://fortrancython.wordpress.com/

not exactly sure the status, but it may be a better option than f2py if
you are already using cython.

-Chris

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Alex van Houten

unread,
Sep 23, 2011, 4:43:44 AM9/23/11
to cython...@googlegroups.com
Dag Sverre Seljebotn <d.s.seljebotn <at> astro.uio.no> writes:

> You should also try numexpr and Theano for vectorizable code for
> comparison, they are libraries that can be used from pure Python to get
> impressive speedups in similar cases.
>

Hmmm. It seems hard to fix this. I use cython "nogil" to release the GIL in
Python threads, but I guess one cannot use any Theano or Numexpr within a "with
nogil" block.
Or is this still possible?

I was looking at work by James Bergstra
https://bitbucket.org/jaberg/theano-cython/overview
in particular at the basic subtensor example
https://bitbucket.org/jaberg/theano-cython/changeset/5cf01ca6b24c
He uses a decorator
"
from theano.gof import local_optimizer
.....
@register_specialize
@local_optimizer()
def replace_Subtensor_with_CythonSubtensor(node):
.....
"

Now would that also work like this:
"
@register_specialize
@local_optimizer()
def replace_Subtensor_with_CythonSubtensor(node):
....
with nogil:
number crunching
...
"
?

Alex


Francesc Alted

unread,
Sep 23, 2011, 11:00:33 AM9/23/11
to cython...@googlegroups.com
2011/9/23 Alex van Houten <sparr...@yahoo.com>:

> Dag Sverre Seljebotn <d.s.seljebotn <at> astro.uio.no> writes:
>
>> You should also try numexpr and Theano for vectorizable code for
>> comparison, they are libraries that can be used from pure Python to get
>> impressive speedups in similar cases.
>>
> Hmmm. It seems hard to fix this. I use cython "nogil" to release the GIL in
> Python threads, but I guess one cannot use any Theano or Numexpr within a "with
> nogil" block.
> Or is this still possible?

Well, I cannot speak for Theano, but at least when using numexpr, you
don't need to release the GIL explicitly, because the GIL is released
internally as soon as computations goes into C space. So, you can
call Numexpr in your Cython code without being afraid about GIL at
all.

--
Francesc Alted

Rajeev Singh

unread,
Sep 23, 2011, 9:57:28 AM9/23/11
to cython...@googlegroups.com

Hi,

Let me comment on the mentioned benchmark. First of all the
calculation done by cython is not the same as that by numpy and
vectorized fortran. The difference being parallel update (jacobi
iteration in case of numpy and fortran) and sequential update
(gauss-seidel iteration in case of cython). The cython code is not
vectorized (in SIMD sense) in this study. That leads to a loss of a
factor of about 2 to 3.

That being said I benchmarked it again and following are some relevant results -

3 : 0.3609 seconds Vectorized Fortran (pure, intel)
4 : 0.2883 seconds Vectorized C (pure, intel)
6 : 0.5839 seconds Cython with parallel update
9 : 0.2115 seconds Vectorized Fortran (f2py, intel)
10 : 2.4876 seconds NumPy
11 : 1.4110 seconds Cython

The difference between the best I could get with Cython and f2py with
intel compiler was a factor of 3. I have put the benchmark codes here
-

http://dl.dropbox.com/u/5939736/laplace.zip

If you have sage just type -

sage laplace.py

in the laplace directory and it will compile and run all the codes. If
you want have intel compiler it can be include in the benchmark by
editing one line in the file laplace.py (just set intel=True).

Rajeev

Alex van Houten

unread,
Sep 23, 2011, 4:47:08 PM9/23/11
to cython...@googlegroups.com
Rajeev Singh <rajs2010 <at> gmail.com> writes:

> Let me comment on the mentioned benchmark. First of all the
> calculation done by cython is not the same as that by numpy and
> vectorized fortran. The difference being parallel update (jacobi
> iteration in case of numpy and fortran) and sequential update
> (gauss-seidel iteration in case of cython). The cython code is not
> vectorized (in SIMD sense) in this study. That leads to a loss of a
> factor of about 2 to 3.
>
> That being said I benchmarked it again and following are some relevant results
>

> 3 : 0.3609 seconds Vectorized Fortran (pure, intel)
> 4 : 0.2883 seconds Vectorized C (pure, intel)
> 6 : 0.5839 seconds Cython with parallel update
> 9 : 0.2115 seconds Vectorized Fortran (f2py, intel)
> 10 : 2.4876 seconds NumPy
> 11 : 1.4110 seconds Cython
>
> The difference between the best I could get with Cython and f2py with
> intel compiler was a factor of 3. I have put the benchmark codes here
> -
>
> http://dl.dropbox.com/u/5939736/laplace.zip
>
> If you have sage just type -
>
> sage laplace.py
>

Thanks. Great job. So the conclusion remains that Cython cannot keep up with C
or Fortran because of the vectorization (SIMD) issue. That is pretty serious. I
tried Numexpr instead of Cython but that gave too much overhead, it actually
made things run slower. I might still try Theano.
What is the outcome if you use gnu instead of Intel? Can you still apply SSE
when compiling?

Btw, I could not reproduce your results, "sage laplace.py" gave
"gnu: no Fortran 90 compiler found
.....
don't know how to compile Fortran code on platform 'posix'"
but gfortran is installed (in /usr/bin)
I installed g95 from http://www.g95.org/ too, but that didn't make a difference.
Any clue?

Alex.


Robert L Cloud

unread,
Sep 23, 2011, 5:21:05 PM9/23/11
to cython-users
I don't think that that is a valid conclusion to make. It is really
not a fair comparison. Gauss Seidel iteration is very difficult to
apply SSE intrinsics because it requires the data from the current
updated iteration as well as data from the past. One can't just pull
in vectors and run the code, it must be done sequentially. While
there are some parallel schemes for gauss seidel, e.g., red black
coloring, they are innately less parallel than jacobi can be.

This really has nothing to do with Cython. Also it isn't a fair test,
because of the number of iterations required for each scheme vary
widely. i've whipped up a simple python script using gauss seidel and
successive overrelaxation and for a 30 x 30 domain, it takes 460
iterations to converge to a steady state, and for pure jacobi, it
takes 1467.

Gauss Seidel can be found here:
http://dl.dropbox.com/u/33476113/code/python_scripts/gauss_seidel_sequential_or_2d.py
jacobi here: http://dl.dropbox.com/u/33476113/code/python_scripts/jacobi_2d_straight_python.py

As for your question about GNU vs Intel, I've done some tests, see
here: http://bit.ly/oHuk0I about the relative speed of compiled Cython
with intel vs gnu compilers.

There was also a recent thread in here about applying SSE intrinsics
to Cython. You may want to have a look at it.
> I installed g95 fromhttp://www.g95.org/too, but that didn't make a difference.
> Any clue?
>
> Alex.

Robert Bradshaw

unread,
Sep 23, 2011, 5:40:49 PM9/23/11
to cython...@googlegroups.com
On Fri, Sep 23, 2011 at 1:47 PM, Alex van Houten <sparr...@yahoo.com> wrote:
> Rajeev Singh <rajs2010 <at> gmail.com> writes:
>
>> Let me comment on the mentioned benchmark. First of all the
>> calculation done by cython is not the same as that by numpy and
>> vectorized fortran. The difference being parallel update (jacobi
>> iteration in case of numpy and fortran) and sequential update
>> (gauss-seidel iteration in case of cython). The cython code is not
>> vectorized (in SIMD sense) in this study. That leads to a loss of a
>> factor of about 2 to 3.
>>
>> That being said I benchmarked it again and following are some relevant results
>>
>>   3 :   0.3609 seconds  Vectorized Fortran (pure, intel)
>>   4 :   0.2883 seconds  Vectorized C (pure, intel)
>>   6 :   0.5839 seconds  Cython with parallel update
>>   9 :   0.2115 seconds  Vectorized Fortran (f2py, intel)
>>  10 :   2.4876 seconds  NumPy
>>  11 :   1.4110 seconds  Cython
>>
>> The difference between the best I could get with Cython and f2py with
>> intel compiler was a factor of 3. I have put the benchmark codes here
>> -
>>
>> http://dl.dropbox.com/u/5939736/laplace.zip
>>
>> If you have sage just type -
>>
>> sage laplace.py
>>
> Thanks. Great job. So the conclusion remains that Cython cannot keep up with C
> or Fortran because of the vectorization (SIMD) issue. That is pretty serious.

But there's no reason one couldn't write the "Vectorized C" version in
Cython and compile it with the intel compiler, right? This would seem
to be a fairer comparison (not to mention the difference in
algorithms).

- Robert

Rajeev Singh

unread,
Sep 23, 2011, 5:46:28 PM9/23/11
to cython...@googlegroups.com

Hi,

@Alex - For me it worked with gfotran. Can you play around with the
Makefile and make it work!

@Robert Cloud - As one can see from the benchmark codes I have I
perform either simulation for fixed number of steps. I get different
answers for the two algorithms, but to be fair I compare performance
between same algorithm.

@Robert Bradshaw - In the list of benchmarks I have a vectorized C
version in Cython which I have called "Cython with parallel update"
whose performance is still 3 times slower than f2py vectorized.

Rajeev

Pauli Virtanen

unread,
Sep 23, 2011, 6:18:53 PM9/23/11
to cython...@googlegroups.com
On Sat, 24 Sep 2011 03:16:28 +0530, Rajeev Singh wrote:
> @Robert Bradshaw - In the list of benchmarks I have a vectorized C
> version in Cython which I have called "Cython with parallel update"
> whose performance is still 3 times slower than f2py vectorized.

You also need

#cython: cdivision=True

as otherwise the inner loop contains a branch.

--
Pauli Virtanen

Pauli Virtanen

unread,
Sep 23, 2011, 6:33:23 PM9/23/11
to cython...@googlegroups.com
On Fri, 23 Sep 2011 20:47:08 +0000, Alex van Houten wrote:
[clip]

> Thanks. Great job. So the conclusion remains that Cython cannot keep up
> with C or Fortran because of the vectorization (SIMD) issue. That is
> pretty serious.

This statement is more or less equivalent to "C is slower than Fortran",
as the generated Cython code is pretty much what you would write in C.
(Provided you set boundscheck=False, cdivision=True, etc.)

C does not have vectorized statements, so it becomes the burden of the
compiler to detect the vectorizable loops and do the optimization (at
least gcc 4.5.1 usually does not manage to pull this off).

Also, "C is slower than Fortran" really depends on the problem.
For arrays that don't fit into the CPU cache, for example, the speed
difference can vanish completely because everything is waiting
for CPU <-> memory transfers.

Rajeev Singh

unread,
Sep 23, 2011, 6:37:12 PM9/23/11
to cython...@googlegroups.com

Hi,

Division is an interesting point altogether. I discovered that only
pure fortran programs in all the examples I have here are not affected
(I believe due to its compiler). All other methods become 2 to 3 times
fast if I replace division by multiplication by inverse in the inner
most loop.

> #cython: cdivision=True

This doesn't affect the performance.

"C slower than fortran" is not true for us as we have examples in pure
C and pure fortran and the performance is below -


0 : 0.8382 seconds Vectorized Fortran (pure)
1 : 0.4082 seconds Vectorized C (pure)
2 : 0.3880 seconds Vectorized C (pure, dyanamic array)
3 : 0.3442 seconds Vectorized Fortran (pure, intel)
4 : 0.2780 seconds Vectorized C (pure, intel)
5 : 0.3081 seconds Vectorized C (pure, dyanamic array, intel)

Rajeev

Rajeev Singh

unread,
Sep 23, 2011, 6:48:45 PM9/23/11
to cython...@googlegroups.com

Hi,

I would like to make another observation. On a different machine the
performance is not very different -

7 : 2.2209 seconds Cython with parallel update
8 : 1.1973 seconds Cython vectorized
9 : 1.2073 seconds Vectorized Fortran (f2py)
10 : 1.0501 seconds Vectorized Fortran (f2py, intel)

I don't know why this happens.

Rajeev

Pauli Virtanen

unread,
Sep 23, 2011, 6:55:37 PM9/23/11
to cython...@googlegroups.com
On Sat, 24 Sep 2011 04:07:12 +0530, Rajeev Singh wrote:
[clip]

> > #cython: cdivision=True
>
> This doesn't affect the performance.

Hmm, true. For some reason the cdivision statement doesn't work as
a comment. However, adding

@cython.cdivision(True)

to the routine however does, and after that I get identical performance
with the "Vectorized C" code.

After that, we're down to "Fortran is faster than C", or not :)

--
Pauli Virtanen

Pauli Virtanen

unread,
Sep 23, 2011, 8:33:24 PM9/23/11
to cython...@googlegroups.com
On Fri, 23 Sep 2011 22:55:37 +0000, Pauli Virtanen wrote:
[clip]

> After that, we're down to "Fortran is faster than C", or not :)

@cython.cdivision(True)
def cy_update(np.ndarray[double, ndim=2] out, double dx2, double dy2):
cdef unsigned int i, j
cdef np.ndarray[double, ndim=2] u

u = out.copy()

for i in xrange(1,u.shape[0]-1):
for j in xrange(1, u.shape[1]-1):
out[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 +
(u[i, j+1] + u[i, j-1]) * dx2) * (1.0 / (2*(dx2+dy2)))


Note that memory access pattern matters a lot: if I write "u = out"
instead of "u = out.copy()", the time taken doubles.

My guess is that this is a CPU cache prediction issue. In any case,
it has nothing to do with Cython: this also explains the difference
between "Looped fortran" and "Vectorized fortran".

0 : 0.1067 seconds Cython
1 : 0.0711 seconds Vectorized C
2 : 0.0506 seconds Vectorized Fortran
3 : 0.1328 seconds Looped Fortran (in-place)
4 : 0.0585 seconds Looped Fortran (copy)

The remaining difference between C and Cython comes from strided
indexing of Numpy arrays -- the C compiler does not know that the
array is C-contiguous. Finally, Fortran is slightly faster than C.

There's a factor of 2x difference between more or less straightforward
Cython code, and the equivalent (!) Fortran code.

However, as seen above, even in Fortran you can lose factors of
2 and more quite easily for completely counterintuitive reasons ---
which probably have to do with the memory architecture of your
machine.

(Also, it becomes pretty clear that making fair benchmarks is
quite difficult.)

--
Pauli Virtanen

Robert Bradshaw

unread,
Sep 23, 2011, 8:47:22 PM9/23/11
to cython...@googlegroups.com

This can be specified, add mode="c" to your array declaration.

> Finally, Fortran is slightly faster than C.

I wonder if the restrict keyword could help here (not yet supported in
Cython, but could be added in C).

Alex van Houten

unread,
Sep 24, 2011, 2:55:32 AM9/24/11
to cython...@googlegroups.com
Rajeev Singh <rajs2010 <at> gmail.com> writes:


> I would like to make another observation. On a different machine the
> performance is not very different -
>
> 7 : 2.2209 seconds Cython with parallel update
> 8 : 1.1973 seconds Cython vectorized
> 9 : 1.2073 seconds Vectorized Fortran (f2py)
> 10 : 1.0501 seconds Vectorized Fortran (f2py, intel)
>

What do you mean by Cython vectorized here?
Does that involve the Intel compiler?
Would gcc also work?
Which compiler arguments do you use?
Are you now able to explain the Cython difference in performance for the two
machines?

Alex

Rajeev Singh

unread,
Sep 24, 2011, 3:50:49 AM9/24/11
to cython...@googlegroups.com
On Sat, Sep 24, 2011 at 12:25 PM, Alex van Houten <sparr...@yahoo.com> wrote:
> Rajeev Singh <rajs2010 <at> gmail.com> writes:
>
>
>> I would like to make another observation. On a different machine the
>> performance is not very different -
>>


>>   7 :   2.2209 seconds  Cython with parallel update
>>   8 :   1.1973 seconds  Cython vectorized
>>   9 :   1.2073 seconds  Vectorized Fortran (f2py)
>>  10 :   1.0501 seconds  Vectorized Fortran (f2py, intel)
>>

> What do you mean by Cython vectorized here?

It is bad naming. In cython vectorized the arrays over which the
iterations run are declared in C style -

cdef double u[size_mat][size_mat], u1[size_mat][size_mat]

where size_mat is defined in a .h file with line '# define size_mat
150'. I don't know how to do it within cython!

> Does that involve the Intel compiler?
> Would gcc also work?
> Which compiler arguments do you use?

I compile Cython with gcc only. In fact I use sage for this job as -

sage -c "from sage.all import cython_create_local_so;
cython_create_local_so('_laplace.pyx')"

I hope nothing funny happens because of this.

> Are you now able to explain the Cython difference in performance  for the two
> machines?

No

Rajeev

Rajeev Singh

unread,
Sep 24, 2011, 4:12:19 AM9/24/11
to cython...@googlegroups.com

Hi,

Cython compiled with intel is the new winner on both machines -

machine 1
8 : 0.9249 seconds Cython vectorized
10 : 1.0121 seconds Vectorized Fortran (f2py, intel)
machine 2
7 : 0.1873 seconds Cython vectorized
9 : 0.2100 seconds Vectorized Fortran (f2py, intel)

The difference from before was that I used icc instead of gcc. The
compiler complained with some warnings however.

Rajeev

Sturla Molden

unread,
Sep 24, 2011, 3:34:34 PM9/24/11
to cython...@googlegroups.com
Den 24.09.2011 02:47, skrev Robert Bradshaw:
> I wonder if the restrict keyword could help here (not yet supported in
> Cython, but could be added in C).

I actually added support for restrict and volatile qualifiers to
Cython's NumPy syntax a couple of years ago. That is, qualifiers for the
buffer instead of two new keywords to Cython.

Sturla


Dag Sverre Seljebotn

unread,
Sep 24, 2011, 3:40:16 PM9/24/11
to cython...@googlegroups.com

Yes. I looked at your changes again this spring, but since there was
some problems (don't remember but I think it broke our current test
suite) it didn't get integrated at the time. If anybody wants to
continue on that it is here:

https://github.com/dagss/cython/tree/restrict

That's based on Cython 0.11.3, and needs to be rebased on current trunk.

Dag Sverre

Sturla Molden

unread,
Sep 24, 2011, 4:01:35 PM9/24/11
to cython...@googlegroups.com
Den 24.09.2011 10:12, skrev Rajeev Singh:
> Cython compiled with intel is the new winner on both machines -
>

You need to show scalability, not just performance on fixed-sized
datasets, as well as mean and standard deviation for multiple runs.
Using the same algorithm for all compilers is mandatory. As is knowing
Cython's compile flags (or decorators) and using them correctly.

Comparing with Fortran is also misleading, because of the huge
difference between Fortran compilers.

Sturla

Alex van Houten

unread,
Sep 24, 2011, 4:38:38 PM9/24/11
to cython...@googlegroups.com
Rajeev Singh <rajs2010 <at> gmail.com> writes:


> The difference from before was that I used icc instead of gcc. The
> compiler complained with some warnings however.

Do you mean gcc instead of icc?

Alex.

Alex van Houten

unread,
Sep 24, 2011, 4:44:26 PM9/24/11
to cython...@googlegroups.com
Rajeev Singh <rajs2010 <at> gmail.com> writes:


> >>>   7 :   2.2209 seconds  Cython with parallel update
> >>>   8 :   1.1973 seconds  Cython vectorized
> >>>   9 :   1.2073 seconds  Vectorized Fortran (f2py)
> >>>  10 :   1.0501 seconds  Vectorized Fortran (f2py, intel)
> >>>
> >
> >> What do you mean by Cython vectorized here?
> > It is bad naming. In cython vectorized the arrays over which the
> > iterations run are declared in C style -
> >
> > cdef double u[size_mat][size_mat], u1[size_mat][size_mat]
> >
> > where size_mat is defined in a .h file with line '# define size_mat
> > 150'. I don't know how to do it within cython!

> Cython compiled with intel is the new winner on both machines -


>
> machine 1
> 8 : 0.9249 seconds Cython vectorized
> 10 : 1.0121 seconds Vectorized Fortran (f2py, intel)
> machine 2
> 7 : 0.1873 seconds Cython vectorized
> 9 : 0.2100 seconds Vectorized Fortran (f2py, intel)
>

What is it that I should conclude from this?
In terms of speed benefits I mean.
Is it important to declare the arrays over which the iterations run
in C style? Instead of Numpy style?
That seems to buy you almost a factor 2!
Correct?

Alex.


Sturla Molden

unread,
Sep 24, 2011, 6:49:25 PM9/24/11
to cython...@googlegroups.com
Den 24.09.2011 22:44, skrev Alex van Houten:
> Is it important to declare the arrays over which the iterations run in
> C style? Instead of Numpy style?

If you declare a NumPy array as C or Fortran contiguous, i.e. with
mode="c" or mode="fortran", and specify ndim=1 or ndim=2, it does not
differ in performance from a C array. Remember to use:

@cython.wraparound(False)
@cython.boundscheck(False)

To compete with plain C or Fortran performance wise, we also need

@cython.cdivision(True)


Sturla


Alex van Houten

unread,
Sep 25, 2011, 1:59:29 AM9/25/11
to cython...@googlegroups.com

My question was "Are speed benefits this large actually attainable?".
And Dag Sverre replied that SSE was the important factor.
Does that mean, after all the optimizations described above,
and after we have seen that we can make Cython run as fast as Fortran,
that SSE can now and is now applied when we compile?

I understood from a similar discussion on this forum
http://comments.gmane.org/gmane.comp.python.cython.user/4401
that it is not trivial to apply SSE to compiling Cython code and Dag Sverre
there replied that Numexpr and Theano are the way to go if one wants to
use SSE:
http://permalink.gmane.org/gmane.comp.python.cython.user/4402

So I'm still confused, since Numexpr nor Theano has been part of the
present discussion.

Alex.


Rajeev Singh

unread,
Sep 25, 2011, 3:48:49 AM9/25/11
to cython...@googlegroups.com
Hi,

> Do you mean gcc instead of icc?

No I mean icc - the intel c compiler.

> What is it that I should conclude from this?
> In terms of speed benefits I mean.

> Is it important to declare the arrays over which the iterations run


> in C style? Instead of Numpy style?

> That seems to buy you almost a factor 2!
> Correct?

I can not say. It is a very machine dependent thing. What are you
getting on your machine?

Rajeev

Rajeev Singh

unread,
Sep 25, 2011, 3:51:03 AM9/25/11
to cython...@googlegroups.com
On Sun, Sep 25, 2011 at 4:19 AM, Sturla Molden <sturla...@yahoo.no> wrote:
> Den 24.09.2011 22:44, skrev Alex van Houten:
>>
>> Is it important to declare the arrays over which the iterations run in C
>> style? Instead of Numpy style?
>

Hi,


> If you declare a NumPy array as C or Fortran contiguous, i.e. with mode="c"
> or mode="fortran", and specify ndim=1 or ndim=2, it does not differ in
> performance from a C array. Remember to use:
>
>   @cython.wraparound(False)
>   @cython.boundscheck(False)

We are using all this.

>
> To compete with plain C or Fortran performance wise, we also need
>
>   @cython.cdivision(True)
>

This one doesn't seem to make difference.

Rajeev

Francesc Alted

unread,
Sep 25, 2011, 4:17:09 AM9/25/11
to cython...@googlegroups.com
2011/9/25 Alex van Houten <sparr...@yahoo.com>:

> My question was "Are speed benefits this large actually attainable?".
> And Dag Sverre replied that SSE was the important factor.
> Does that mean, after all the optimizations described above,
> and after we have seen that we can make Cython run as fast as Fortran,
> that SSE can now and is now applied when we compile?

Should you have read through this thread carefully, you should already
discovered that things are not so easy. You cannot get best
performance by just using separately SSE, good compilers, optimizing
memory access, a specific programming language, a certain hardware
architecture or whatever. Getting best performance is more a holistic
experience, where all the above elements count.

Also, depending on your problem, you may find that one or two of these
factors are the critical ones for your case. Identifying your
bottleneck is very important in order to get the best solution for
you. So questions like "Which is the best compiler?", or "SSE
matters?" or "Which is the best platform?" are really non-sense.
Instead, what you should ask is: "I have this problem, which
techniques/tools should I use so as to solve it optimally"?

Specific benchmarks (like the one that has been discussed in this
thread) are only useful to evaluate techniques/tools in *this*
scenario, but unfortunately, there are many others where these
aparently 'optimal' techniques do not apply at all. The moral of all
of this is that we have to be modest and specific enough in our
requeriments if we really want to get the most of our systems.

--
Francesc Alted

Alex van Houten

unread,
Sep 25, 2011, 5:40:54 AM9/25/11
to cython...@googlegroups.com
Francesc Alted <faltet <at> pytables.org> writes:

> Should you have read through this thread carefully, you should already
> discovered that things are not so easy. You cannot get best
> performance by just using separately SSE, good compilers, optimizing
> memory access, a specific programming language, a certain hardware
> architecture or whatever. Getting best performance is more a holistic
> experience, where all the above elements count.
>

Thanks, I'll try to remember that "holistic experience".
What I'll take away from this discussion is that for this particular problem
(solving the 2D Laplace equation) there is no fundamental limit
to the speed of Cython code making it significantly slower than e.g.
vectorized Fortran.
Would you agree on that?
Or can we not draw that conclusion yet?

Still curious if SSE was applied when compiling the Cython or Fortran code.

Alex.


Pauli Virtanen

unread,
Sep 25, 2011, 8:26:08 AM9/25/11
to cython...@googlegroups.com
On Sun, 25 Sep 2011 09:40:54 +0000, Alex van Houten wrote:
[clip]
> Thanks, I'll try to remember that "holistic experience". What I'll take
> away from this discussion is that for this particular problem (solving
> the 2D Laplace equation) there is no fundamental limit to the speed of
> Cython code making it significantly slower than e.g. vectorized Fortran.
> Would you agree on that?

There's no reason why Cython would be slower than equivalent C code.
C vs. Fortran then depends very much on the compilers. More often than
not, Fortran compilers manage to be somewhat smarter than C compilers
when dealing with multidimensional arrays. How much this matters in
practice, varies.

Remember also that there were confounding factors in the benchmark
results posted here, and the results are misleading if you are
not careful:

- some of the algorithms were in-place, some used a copy
- some of the algorithms had strided indexing, some contiguous arrays
- some had an additional branch inside the loop (due to cdivision)

The numbers you get do not directly answer the question
"language X is faster than Y".

> Still curious if SSE was applied when compiling the Cython or Fortran
> code.

Inspect the assembler output from gcc/icc to find out.

However, speed-wise, my guess would be that SSE does not make
much of a difference here. The computations are mostly multiplications
and additions, which should be pretty fast, so the bottleneck is probably
memory-to-CPU transfer speed, and the cache usage pattern.

--
Pauli Virtanen

Sturla Molden

unread,
Sep 25, 2011, 3:24:55 PM9/25/11
to cython...@googlegroups.com
Den 25.09.2011 11:40, skrev Alex van Houten:
> Still curious if SSE was applied when compiling the Cython or Fortran
> code. Alex.

At least the GCC vectorizer has a verbose mode.


Sturla

Alex van Houten

unread,
Sep 26, 2011, 8:10:06 AM9/26/11
to cython...@googlegroups.com
Sturla Molden <sturlamolden <at> yahoo.no> writes:

> > Still curious if SSE was applied when compiling the Cython or Fortran
> > code. Alex.
>
> At least the GCC vectorizer has a verbose mode.

Ok, I took Rajeev's _laplace.pyx from here
http://dl.dropbox.com/u/5939736/laplace.zip
and I commented out the include statements at the top.

I used the "-ftree-vectorizer-verbose=5" as used by Sturla, here:
http://www.mail-archive.com/cytho...@codespeak.net/msg07442.html
I added this to setup.py:
extra_compile_args=["-O3","-msse4","-ftree-vectorizer-verbose=5"]
In its original setting, without "-O3" or without "-msse4" there is no
vectorization, apparently.

If you run this, you get the message "LOOP VECTORIZED" 6 times.
Now, by commenting out chunks of code in _laplace.pyx I found that
cy_update does allow for any vectorization, while the other modules do, with the
exception of cy_pointer.

It seems that there is a necessary condition for any array which has
elements at the right hand side of the equation inside the loop.
In order for gcc to vectorize loops these array have to be declared in
C style. So like this for some double precision array u:


cdef double u[size_mat][size_mat]

Alex.

Robert Bradshaw

unread,
Sep 26, 2011, 2:02:04 PM9/26/11
to cython...@googlegroups.com

So gcc can't vectorize loops where the size is not fixed at compile
time? This seems rather restrictive.

- Robert

Reply all
Reply to author
Forward
0 new messages