He was getting a speed increase of a factor 6.5 compared to
the slower Cython version which would translate to a speed
increase of a factor 6.5*0.94/1.28=4.8 over "fast" Cython.
I was amazed by this result. Although Travis did not reproduce
exactly this number.
I am now considering translating some of my Cython code to Fortran
and using f2py.
Can anyone comment on this?
Are speed benefits this large actually attainable?
Alex.
That the loop was detected to do the same operation elementwise on all
elements, e.g., A = C + D in NumPy. This enables the use of many
optimization, most importantly SSE (google it), which gives a 2x speedup
for double precision and 4x for single precision.
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
>
> He was getting a speed increase of a factor 6.5 compared to
> the slower Cython version which would translate to a speed
> increase of a factor 6.5*0.94/1.28=4.8 over "fast" Cython.
> I was amazed by this result. Although Travis did not reproduce
> exactly this number.
> I am now considering translating some of my Cython code to Fortran
> and using f2py.
> Can anyone comment on this?
> Are speed benefits this large actually attainable?
Yes, but only with vectorizable code. (You should Google that for more
info).
You should also try numexpr and Theano for vectorizable code for
comparison, they are libraries that can be used from pure Python to get
impressive speedups in similar cases.
Dag Sverre
>> I am now considering translating some of my Cython code to Fortran
>> and using f2py.
if you do do fortran, you might want to look at fwrap, rather than f2py:
http://fortrancython.wordpress.com/
not exactly sure the status, but it may be a better option than f2py if
you are already using cython.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
> You should also try numexpr and Theano for vectorizable code for
> comparison, they are libraries that can be used from pure Python to get
> impressive speedups in similar cases.
>
Hmmm. It seems hard to fix this. I use cython "nogil" to release the GIL in
Python threads, but I guess one cannot use any Theano or Numexpr within a "with
nogil" block.
Or is this still possible?
I was looking at work by James Bergstra
https://bitbucket.org/jaberg/theano-cython/overview
in particular at the basic subtensor example
https://bitbucket.org/jaberg/theano-cython/changeset/5cf01ca6b24c
He uses a decorator
"
from theano.gof import local_optimizer
.....
@register_specialize
@local_optimizer()
def replace_Subtensor_with_CythonSubtensor(node):
.....
"
Now would that also work like this:
"
@register_specialize
@local_optimizer()
def replace_Subtensor_with_CythonSubtensor(node):
....
with nogil:
number crunching
...
"
?
Alex
Well, I cannot speak for Theano, but at least when using numexpr, you
don't need to release the GIL explicitly, because the GIL is released
internally as soon as computations goes into C space. So, you can
call Numexpr in your Cython code without being afraid about GIL at
all.
--
Francesc Alted
Hi,
Let me comment on the mentioned benchmark. First of all the
calculation done by cython is not the same as that by numpy and
vectorized fortran. The difference being parallel update (jacobi
iteration in case of numpy and fortran) and sequential update
(gauss-seidel iteration in case of cython). The cython code is not
vectorized (in SIMD sense) in this study. That leads to a loss of a
factor of about 2 to 3.
That being said I benchmarked it again and following are some relevant results -
3 : 0.3609 seconds Vectorized Fortran (pure, intel)
4 : 0.2883 seconds Vectorized C (pure, intel)
6 : 0.5839 seconds Cython with parallel update
9 : 0.2115 seconds Vectorized Fortran (f2py, intel)
10 : 2.4876 seconds NumPy
11 : 1.4110 seconds Cython
The difference between the best I could get with Cython and f2py with
intel compiler was a factor of 3. I have put the benchmark codes here
-
http://dl.dropbox.com/u/5939736/laplace.zip
If you have sage just type -
sage laplace.py
in the laplace directory and it will compile and run all the codes. If
you want have intel compiler it can be include in the benchmark by
editing one line in the file laplace.py (just set intel=True).
Rajeev
> Let me comment on the mentioned benchmark. First of all the
> calculation done by cython is not the same as that by numpy and
> vectorized fortran. The difference being parallel update (jacobi
> iteration in case of numpy and fortran) and sequential update
> (gauss-seidel iteration in case of cython). The cython code is not
> vectorized (in SIMD sense) in this study. That leads to a loss of a
> factor of about 2 to 3.
>
> That being said I benchmarked it again and following are some relevant results
>
> 3 : 0.3609 seconds Vectorized Fortran (pure, intel)
> 4 : 0.2883 seconds Vectorized C (pure, intel)
> 6 : 0.5839 seconds Cython with parallel update
> 9 : 0.2115 seconds Vectorized Fortran (f2py, intel)
> 10 : 2.4876 seconds NumPy
> 11 : 1.4110 seconds Cython
>
> The difference between the best I could get with Cython and f2py with
> intel compiler was a factor of 3. I have put the benchmark codes here
> -
>
> http://dl.dropbox.com/u/5939736/laplace.zip
>
> If you have sage just type -
>
> sage laplace.py
>
Thanks. Great job. So the conclusion remains that Cython cannot keep up with C
or Fortran because of the vectorization (SIMD) issue. That is pretty serious. I
tried Numexpr instead of Cython but that gave too much overhead, it actually
made things run slower. I might still try Theano.
What is the outcome if you use gnu instead of Intel? Can you still apply SSE
when compiling?
Btw, I could not reproduce your results, "sage laplace.py" gave
"gnu: no Fortran 90 compiler found
.....
don't know how to compile Fortran code on platform 'posix'"
but gfortran is installed (in /usr/bin)
I installed g95 from http://www.g95.org/ too, but that didn't make a difference.
Any clue?
Alex.
But there's no reason one couldn't write the "Vectorized C" version in
Cython and compile it with the intel compiler, right? This would seem
to be a fairer comparison (not to mention the difference in
algorithms).
- Robert
Hi,
@Alex - For me it worked with gfotran. Can you play around with the
Makefile and make it work!
@Robert Cloud - As one can see from the benchmark codes I have I
perform either simulation for fixed number of steps. I get different
answers for the two algorithms, but to be fair I compare performance
between same algorithm.
@Robert Bradshaw - In the list of benchmarks I have a vectorized C
version in Cython which I have called "Cython with parallel update"
whose performance is still 3 times slower than f2py vectorized.
Rajeev
You also need
#cython: cdivision=True
as otherwise the inner loop contains a branch.
--
Pauli Virtanen
This statement is more or less equivalent to "C is slower than Fortran",
as the generated Cython code is pretty much what you would write in C.
(Provided you set boundscheck=False, cdivision=True, etc.)
C does not have vectorized statements, so it becomes the burden of the
compiler to detect the vectorizable loops and do the optimization (at
least gcc 4.5.1 usually does not manage to pull this off).
Also, "C is slower than Fortran" really depends on the problem.
For arrays that don't fit into the CPU cache, for example, the speed
difference can vanish completely because everything is waiting
for CPU <-> memory transfers.
Hi,
Division is an interesting point altogether. I discovered that only
pure fortran programs in all the examples I have here are not affected
(I believe due to its compiler). All other methods become 2 to 3 times
fast if I replace division by multiplication by inverse in the inner
most loop.
> #cython: cdivision=True
This doesn't affect the performance.
"C slower than fortran" is not true for us as we have examples in pure
C and pure fortran and the performance is below -
0 : 0.8382 seconds Vectorized Fortran (pure)
1 : 0.4082 seconds Vectorized C (pure)
2 : 0.3880 seconds Vectorized C (pure, dyanamic array)
3 : 0.3442 seconds Vectorized Fortran (pure, intel)
4 : 0.2780 seconds Vectorized C (pure, intel)
5 : 0.3081 seconds Vectorized C (pure, dyanamic array, intel)
Rajeev
Hi,
I would like to make another observation. On a different machine the
performance is not very different -
7 : 2.2209 seconds Cython with parallel update
8 : 1.1973 seconds Cython vectorized
9 : 1.2073 seconds Vectorized Fortran (f2py)
10 : 1.0501 seconds Vectorized Fortran (f2py, intel)
I don't know why this happens.
Rajeev
Hmm, true. For some reason the cdivision statement doesn't work as
a comment. However, adding
@cython.cdivision(True)
to the routine however does, and after that I get identical performance
with the "Vectorized C" code.
After that, we're down to "Fortran is faster than C", or not :)
--
Pauli Virtanen
@cython.cdivision(True)
def cy_update(np.ndarray[double, ndim=2] out, double dx2, double dy2):
cdef unsigned int i, j
cdef np.ndarray[double, ndim=2] u
u = out.copy()
for i in xrange(1,u.shape[0]-1):
for j in xrange(1, u.shape[1]-1):
out[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 +
(u[i, j+1] + u[i, j-1]) * dx2) * (1.0 / (2*(dx2+dy2)))
Note that memory access pattern matters a lot: if I write "u = out"
instead of "u = out.copy()", the time taken doubles.
My guess is that this is a CPU cache prediction issue. In any case,
it has nothing to do with Cython: this also explains the difference
between "Looped fortran" and "Vectorized fortran".
0 : 0.1067 seconds Cython
1 : 0.0711 seconds Vectorized C
2 : 0.0506 seconds Vectorized Fortran
3 : 0.1328 seconds Looped Fortran (in-place)
4 : 0.0585 seconds Looped Fortran (copy)
The remaining difference between C and Cython comes from strided
indexing of Numpy arrays -- the C compiler does not know that the
array is C-contiguous. Finally, Fortran is slightly faster than C.
There's a factor of 2x difference between more or less straightforward
Cython code, and the equivalent (!) Fortran code.
However, as seen above, even in Fortran you can lose factors of
2 and more quite easily for completely counterintuitive reasons ---
which probably have to do with the memory architecture of your
machine.
(Also, it becomes pretty clear that making fair benchmarks is
quite difficult.)
--
Pauli Virtanen
This can be specified, add mode="c" to your array declaration.
> Finally, Fortran is slightly faster than C.
I wonder if the restrict keyword could help here (not yet supported in
Cython, but could be added in C).
> I would like to make another observation. On a different machine the
> performance is not very different -
>
> 7 : 2.2209 seconds Cython with parallel update
> 8 : 1.1973 seconds Cython vectorized
> 9 : 1.2073 seconds Vectorized Fortran (f2py)
> 10 : 1.0501 seconds Vectorized Fortran (f2py, intel)
>
What do you mean by Cython vectorized here?
Does that involve the Intel compiler?
Would gcc also work?
Which compiler arguments do you use?
Are you now able to explain the Cython difference in performance for the two
machines?
Alex
>> 7 : 2.2209 seconds Cython with parallel update
>> 8 : 1.1973 seconds Cython vectorized
>> 9 : 1.2073 seconds Vectorized Fortran (f2py)
>> 10 : 1.0501 seconds Vectorized Fortran (f2py, intel)
>>
> What do you mean by Cython vectorized here?
It is bad naming. In cython vectorized the arrays over which the
iterations run are declared in C style -
cdef double u[size_mat][size_mat], u1[size_mat][size_mat]
where size_mat is defined in a .h file with line '# define size_mat
150'. I don't know how to do it within cython!
> Does that involve the Intel compiler?
> Would gcc also work?
> Which compiler arguments do you use?
I compile Cython with gcc only. In fact I use sage for this job as -
sage -c "from sage.all import cython_create_local_so;
cython_create_local_so('_laplace.pyx')"
I hope nothing funny happens because of this.
> Are you now able to explain the Cython difference in performance for the two
> machines?
No
Rajeev
Hi,
Cython compiled with intel is the new winner on both machines -
machine 1
8 : 0.9249 seconds Cython vectorized
10 : 1.0121 seconds Vectorized Fortran (f2py, intel)
machine 2
7 : 0.1873 seconds Cython vectorized
9 : 0.2100 seconds Vectorized Fortran (f2py, intel)
The difference from before was that I used icc instead of gcc. The
compiler complained with some warnings however.
Rajeev
I actually added support for restrict and volatile qualifiers to
Cython's NumPy syntax a couple of years ago. That is, qualifiers for the
buffer instead of two new keywords to Cython.
Sturla
Yes. I looked at your changes again this spring, but since there was
some problems (don't remember but I think it broke our current test
suite) it didn't get integrated at the time. If anybody wants to
continue on that it is here:
https://github.com/dagss/cython/tree/restrict
That's based on Cython 0.11.3, and needs to be rebased on current trunk.
Dag Sverre
You need to show scalability, not just performance on fixed-sized
datasets, as well as mean and standard deviation for multiple runs.
Using the same algorithm for all compilers is mandatory. As is knowing
Cython's compile flags (or decorators) and using them correctly.
Comparing with Fortran is also misleading, because of the huge
difference between Fortran compilers.
Sturla
> The difference from before was that I used icc instead of gcc. The
> compiler complained with some warnings however.
Do you mean gcc instead of icc?
Alex.
> >>> 7 : 2.2209 seconds Cython with parallel update
> >>> 8 : 1.1973 seconds Cython vectorized
> >>> 9 : 1.2073 seconds Vectorized Fortran (f2py)
> >>> 10 : 1.0501 seconds Vectorized Fortran (f2py, intel)
> >>>
> >
> >> What do you mean by Cython vectorized here?
> > It is bad naming. In cython vectorized the arrays over which the
> > iterations run are declared in C style -
> >
> > cdef double u[size_mat][size_mat], u1[size_mat][size_mat]
> >
> > where size_mat is defined in a .h file with line '# define size_mat
> > 150'. I don't know how to do it within cython!
> Cython compiled with intel is the new winner on both machines -
>
> machine 1
> 8 : 0.9249 seconds Cython vectorized
> 10 : 1.0121 seconds Vectorized Fortran (f2py, intel)
> machine 2
> 7 : 0.1873 seconds Cython vectorized
> 9 : 0.2100 seconds Vectorized Fortran (f2py, intel)
>
What is it that I should conclude from this?
In terms of speed benefits I mean.
Is it important to declare the arrays over which the iterations run
in C style? Instead of Numpy style?
That seems to buy you almost a factor 2!
Correct?
Alex.
If you declare a NumPy array as C or Fortran contiguous, i.e. with
mode="c" or mode="fortran", and specify ndim=1 or ndim=2, it does not
differ in performance from a C array. Remember to use:
@cython.wraparound(False)
@cython.boundscheck(False)
To compete with plain C or Fortran performance wise, we also need
@cython.cdivision(True)
Sturla
My question was "Are speed benefits this large actually attainable?".
And Dag Sverre replied that SSE was the important factor.
Does that mean, after all the optimizations described above,
and after we have seen that we can make Cython run as fast as Fortran,
that SSE can now and is now applied when we compile?
I understood from a similar discussion on this forum
http://comments.gmane.org/gmane.comp.python.cython.user/4401
that it is not trivial to apply SSE to compiling Cython code and Dag Sverre
there replied that Numexpr and Theano are the way to go if one wants to
use SSE:
http://permalink.gmane.org/gmane.comp.python.cython.user/4402
So I'm still confused, since Numexpr nor Theano has been part of the
present discussion.
Alex.
> Do you mean gcc instead of icc?
No I mean icc - the intel c compiler.
> What is it that I should conclude from this?
> In terms of speed benefits I mean.
> Is it important to declare the arrays over which the iterations run
> in C style? Instead of Numpy style?
> That seems to buy you almost a factor 2!
> Correct?
I can not say. It is a very machine dependent thing. What are you
getting on your machine?
Rajeev
Hi,
> If you declare a NumPy array as C or Fortran contiguous, i.e. with mode="c"
> or mode="fortran", and specify ndim=1 or ndim=2, it does not differ in
> performance from a C array. Remember to use:
>
> @cython.wraparound(False)
> @cython.boundscheck(False)
We are using all this.
>
> To compete with plain C or Fortran performance wise, we also need
>
> @cython.cdivision(True)
>
This one doesn't seem to make difference.
Rajeev
Should you have read through this thread carefully, you should already
discovered that things are not so easy. You cannot get best
performance by just using separately SSE, good compilers, optimizing
memory access, a specific programming language, a certain hardware
architecture or whatever. Getting best performance is more a holistic
experience, where all the above elements count.
Also, depending on your problem, you may find that one or two of these
factors are the critical ones for your case. Identifying your
bottleneck is very important in order to get the best solution for
you. So questions like "Which is the best compiler?", or "SSE
matters?" or "Which is the best platform?" are really non-sense.
Instead, what you should ask is: "I have this problem, which
techniques/tools should I use so as to solve it optimally"?
Specific benchmarks (like the one that has been discussed in this
thread) are only useful to evaluate techniques/tools in *this*
scenario, but unfortunately, there are many others where these
aparently 'optimal' techniques do not apply at all. The moral of all
of this is that we have to be modest and specific enough in our
requeriments if we really want to get the most of our systems.
--
Francesc Alted
> Should you have read through this thread carefully, you should already
> discovered that things are not so easy. You cannot get best
> performance by just using separately SSE, good compilers, optimizing
> memory access, a specific programming language, a certain hardware
> architecture or whatever. Getting best performance is more a holistic
> experience, where all the above elements count.
>
Thanks, I'll try to remember that "holistic experience".
What I'll take away from this discussion is that for this particular problem
(solving the 2D Laplace equation) there is no fundamental limit
to the speed of Cython code making it significantly slower than e.g.
vectorized Fortran.
Would you agree on that?
Or can we not draw that conclusion yet?
Still curious if SSE was applied when compiling the Cython or Fortran code.
Alex.
There's no reason why Cython would be slower than equivalent C code.
C vs. Fortran then depends very much on the compilers. More often than
not, Fortran compilers manage to be somewhat smarter than C compilers
when dealing with multidimensional arrays. How much this matters in
practice, varies.
Remember also that there were confounding factors in the benchmark
results posted here, and the results are misleading if you are
not careful:
- some of the algorithms were in-place, some used a copy
- some of the algorithms had strided indexing, some contiguous arrays
- some had an additional branch inside the loop (due to cdivision)
The numbers you get do not directly answer the question
"language X is faster than Y".
> Still curious if SSE was applied when compiling the Cython or Fortran
> code.
Inspect the assembler output from gcc/icc to find out.
However, speed-wise, my guess would be that SSE does not make
much of a difference here. The computations are mostly multiplications
and additions, which should be pretty fast, so the bottleneck is probably
memory-to-CPU transfer speed, and the cache usage pattern.
--
Pauli Virtanen
At least the GCC vectorizer has a verbose mode.
Sturla
> > Still curious if SSE was applied when compiling the Cython or Fortran
> > code. Alex.
>
> At least the GCC vectorizer has a verbose mode.
Ok, I took Rajeev's _laplace.pyx from here
http://dl.dropbox.com/u/5939736/laplace.zip
and I commented out the include statements at the top.
I used the "-ftree-vectorizer-verbose=5" as used by Sturla, here:
http://www.mail-archive.com/cytho...@codespeak.net/msg07442.html
I added this to setup.py:
extra_compile_args=["-O3","-msse4","-ftree-vectorizer-verbose=5"]
In its original setting, without "-O3" or without "-msse4" there is no
vectorization, apparently.
If you run this, you get the message "LOOP VECTORIZED" 6 times.
Now, by commenting out chunks of code in _laplace.pyx I found that
cy_update does allow for any vectorization, while the other modules do, with the
exception of cy_pointer.
It seems that there is a necessary condition for any array which has
elements at the right hand side of the equation inside the loop.
In order for gcc to vectorize loops these array have to be declared in
C style. So like this for some double precision array u:
cdef double u[size_mat][size_mat]
Alex.
So gcc can't vectorize loops where the size is not fixed at compile
time? This seems rather restrictive.
- Robert