http://gcc.gnu.org/projects/tree-ssa/vectorization.html
example15: condition in nested loop:
for (j = 0; j < M; j++)
{
x = x_in[j];
curr_a = a[0];
for (i = 0; i < N; i++)
{
next_a = a[i+1];
curr_a = x > c[i] ? curr_a : next_a;
}
x_out[j] = curr_a;
}
I am having a hard time to vectorize an equivalent loop in Cython.
Anyone?
Alex.
That's not really possible currently. I don't know why you seem so
intent on SIMD though, but have a look here:
http://wiki.cython.org/enhancements/simd . If you want to implement
that, raise an issue on the cython-dev mailing list to discuss the
details and implementation.
> That's not really possible currently. I don't know why you seem so
> intent on SIMD though,
For speed benefits obviously. Linking with Intel's svml made it possible to
vectorize a Cython loop
with a couple of transcendental functions---> factor 2 speed improvement.
> but have a look here:
> http://wiki.cython.org/enhancements/simd . If you want to implement
> that, raise an issue on the cython-dev mailing list to discuss the
> details and implementation.
Vectorizing an inner loop containing an if statement is what I'm after. But that
won't buy me
another factor 2. There are other, somewhat slower solutions that involve temp
arrays.
I'll have a look at it. CEP 517 looks VERY interesting.
Alex.
I get that. I'm just saying that I'm sceptical of your desperate need
for this performance. If performance was that desperately needed,
you'd likely be coding SSE by hand or using Fortran with vectorized
operations.
It would of course be great to get this functionality in Cython, it's
just not there at the moment and apparently (as you have seen)
compilers have a hard time auto-vectorizing the loops Cython
generates. There's just nothing that a cython user can do to fix this,
unless said user fixes Cython :).
The easiest thing is really to use your already working vectorized C
code and just link it in and call it. There is of course no guarantee
that other users compiling the same code will also get it vectorized
with their compiler, platform and compiler version. Depending on what
you want to do, that might just be enough.
>
> On 13 October 2011 06:27, Alex van Houten <sparrow2867 <at> yahoo.com> wrote:
> > mark florisson <markflorisson88 <at> gmail.com> writes:
> >
> >> That's not really possible currently. I don't know why you seem so
> >> intent on SIMD though,
> > For speed benefits obviously. Linking with Intel's svml made it possible to
> > vectorize a Cython loop
> > with a couple of transcendental functions---> factor 2 speed improvement.
>
> I get that. I'm just saying that I'm sceptical of your desperate need
> for this performance. If performance was that desperately needed,
> you'd likely be coding SSE by hand or using Fortran with vectorized
> operations.
>
Let me try to explain what I'm trying to achieve. My company develops software
for medical applications which will run on Windows. Speed is a VERY important
issue since datasets are large and physicians want test results fast. My
colleagues have been developing C++ code for the last decades. They compile
using Intel compilers.
As part of improving the readability and maintainability of code I have been
advocating Python. My colleagues seem reluctant to develop new code in Python,
but are willing to do so if there are no speed drawbacks. So we designed a
benchmark for a typical computation which involves the processing of a 3D
volume, including a variety of calculations such as transcendental functions,
convolution and interpolation.
So the speed of the Python/Cython code (compiled with gcc) is compared to
equivalent C++ code compiled with icl (Intel).
At first the C++ code was significantly faster because of SIMD, but after I got
to vectorize a pivotal Cython loop including transcendental functions (by
linking against Intel's svml) the Python/Cython code was (slightly) faster. So
getting SIMD to work on Cython code was REALLY important to me.
What I'm trying to get across here is that Cython will always be benchmarked
against older compiled languages (Fortran, C, C++) because it is relatively new
and relatively unknown. In order to get a larger user community Cython will have
to continue to prove that it is fast, it cannot afford to be a factor of 2 or
more slower than other compiled languages because of SIMD and vectorization
issues. This is pivotal!
Perhaps I should copy/paste all this on the Cython-dev forum.
Alex.
I see. I agree that it would be great to be as fast as (SIMD) Fortran,
C or C++. However, you have to realize that Cython is an open source
project with only a handful of regular contributors who are doing this
in their free time, whenever they feel like it. So for us it's not
really a necessity, but more something that would be great to have.
Cython code can already be up to 800 times as fast as regular Python,
and that's mostly what we compare to, I'm not sure that the goal is to
obsolete these lower languages, although I think if we could get there
that would be great.
If speed is really very important and if it's possible (e.g. you have
no data dependencies between iterations) you could also try to utilize
multiple cores using cython.parallel.prange.
Numpy itself and various other things (like Theano I've heard),
pyopencl etc might also give you tremendous speedups, depending on
your code and on the available hardware. And really, you don't have to
use Cython for everything, it's easy enough to write the really
performance critical parts in another language and do the less
performance critical stuff in Python and Cython. It's really just a
tool, it's not supposed to solve every problem.
Personally, I mostly compare against Python for untyped code, but
comare it against C for typed code. I would certainly say one of our
goals is that users not be forced to use C/C++/Fortran/assembly for
performance reasons alone, i.e. it should be close enough where it
matters that performance is a non-issue (for those willing to put in
the work on both sides). A factor of 2 is good enough for most people
(vs the costs of using a lower-level language), but certainly not all,
and it really hurts for "marketing" even if that last x% doesn't
really matter. Thanks, Alex, for sharing where you're coming from.
Emitting C code amenable to auto-vectorization is certainly a worthy
goal, but as mentioned we are just a handful of volunteers with lots
of other constraints on our time. Until someone (you? :-) sits down
and figures out how to do it, a big selling point can be that Cython
also makes it trivial to wrap C routines if your inner loops aren't
fast enough (try it first, it might be :-) without having to inflict
the burdens of lower level languages on the rest of your application.
The existence of a lifeline back to legacy code (systems, whatever) is
often an important factor in getting people to make the jump
(especially to a less-proven platform) even if the majority of the
people end up so happy in the new world that they never use it.
- Robert
> Emitting C code amenable to auto-vectorization is certainly a worthy
> goal, but as mentioned we are just a handful of volunteers with lots
> of other constraints on our time. Until someone (you? sits down
> and figures out how to do it,
I'd love to, seriously, if my life were less hectic than it presently is.
> a big selling point can be that Cython
> also makes it trivial to wrap C routines if your inner loops aren't
> fast enough (try it first, it might be without having to inflict
> the burdens of lower level languages on the rest of your application.
> The existence of a lifeline back to legacy code (systems, whatever) is
> often an important factor in getting people to make the jump
> (especially to a less-proven platform) even if the majority of the
> people end up so happy in the new world that they never use it.
>
That is very convenient indeed. I googled a bit and found an example that
uses Cython to expose Numpy arrays to a pure C++ program.
http://stackoverflow.com/questions/3046305/simple-wrapping-of-c-code-with-cython
It might be worth including such an example in the Cython docs, preferably with
pure C instead of C++.
This indeed lowers the need for adjusting the Cython compiler to generate
C code for easy auto-vectorization.
Alex.
Yes, we welcome contributions :).
> This indeed lowers the need for adjusting the Cython compiler to generate
> C code for easy auto-vectorization.
I'd still like to generate auto-vectorization friendly C code, but
because of this approach at least it's not a blocker until someone
finds the time to do it.
- Robert
> > It might be worth including such an example in the Cython
> > docs,
> > preferably with pure C instead of C++.
>
> Yes, we welcome contributions :).
>
> - Robert
Here is a working example with a pure C loop containing
a condition that won't vectorize in Cython. Got some help
from Mark Florisson that made it work. It's called "cmodule.c".
It is a stupid example, but it shows how it works.
"
#include "math.h"
void compute_exp(float u[256][256], int sizex, int sizey,
float v[256][256])
{ int i, j;
for (i=0; i<sizex; i++)
for (j=0; j<sizey; j++)
/* insert condition here.*/
if (i>7)
{
v[i][j] = exp(u[i][j]);}
}
"
This Cython snippet "call_cfunction.pyx" provides the
interface between C and Python
"
import numpy as np
cimport numpy as np
# Define the external C function
cdef extern from "cmodule.c":
cdef void compute_exp(float **u, int sizex, int sizey,
float **v)
def expon_twoD(np.ndarray[np.float32_t,ndim=2,mode='c'] A,
np.ndarray[np.float32_t,ndim=2,mode='c'] B):
cdef int sizex=A.shape[0]
cdef int sizey=A.shape[1]
assert sizex==B.shape[0]
assert sizey==B.shape[1]
# Make pointers
cdef np.float32_t** Aptr = <np.float32_t**>A.data
cdef np.float32_t** Bptr = <np.float32_t**>B.data
compute_exp(Aptr,sizex,sizey,Bptr)
"
This is the "setup.py" for
"python setup.py build_ext --inplace":
"
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
# I added "include_dirs=C:\Program
Files\Paive\Lib\site-packages\numpy\core\include"
to my distutils.cfg.
ext_modules = [Extension(
name="call_cfunction",
sources=["call_cfunction.pyx"],
extra_compile_args=["-O3","-std=c99","-ffast-math",
"-mveclibabi=svml","-march=native","-msse",
"-ftree-vectorizer-verbose=5"],
extra_link_args=["-LC:\Program Files\Intel\Composer
XE\compiler\lib\ia32","-lsvml_dispmt","-llibirc"]
)]
setup(
name = 'call_cfunction',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
"
And finally the Python script for validation
"test_cfunction.py":
"
import numpy as np
from call_cfunction import expon_twoD
input = np.array(np.random.rand(256,256),'f',order='C')
check=np.exp(input)
output=np.zeros((256,256),'f',order='C')
expon_twoD(input,output)
print 'This should be zero:',
np.sum(np.abs(output[8:,:]))-np.sum(np.abs(check[8:,:]))
"
Alex
> Here is a working example with a pure C loop containing
> a condition that won't vectorize in Cython. Got some help
> from Mark Florisson that made it work. It's called "cmodule.c".
> It is a stupid example, but it shows how it works.
> "
> #include "math.h"
> void compute_exp(float u[256][256], int sizex, int sizey,
> float v[256][256])
> { int i, j;
> for (i=0; i<sizex; i++)
> for (j=0; j<sizey; j++)
> /* insert condition here.*/
> if (i>7)
> {
> v[i][j] = exp(u[i][j]);}
> }
> "
I just noticed that you get a crash when running this on Windows machines
with a Nehalem type of processor. It is a gcc bug described here:
http://sourceforge.net/tracker/index.php?func=detail&aid=3098320&group_id=202880&atid=983354
The advantage of pure C code over Cython code on Windows machines is that
you can easily compile it with an Intel compiler as well.
The Intel compiler obviously does not have this bug.
Now you proceed by having the Intel compiler generate a separate
object file.
Compile with:
icl -c cmodule.c -Ipath_to_include_dir_with_math.h---> cmodule.obj
Now adjust your setup.py to include the object file generated by the
Intel compiler:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy
ext_modules = [Extension(
name="call_cfunction",\
sources=["call_cfunction.pyx"],\
extra_objects=["cmodule.obj"],
extra_compile_args=["-O3","-std=c99","-ffast-math","-mveclibabi=svml",
"-march=native","-msse","-ftree-vectorizer-verbose=5"],\
extra_link_args=["-LC:\Program Files\Intel\Composer
XE\compiler\lib\ia32","-lsvml_dispmt","-llibirc"],\
include_dirs = [numpy.get_include()]
)]
setup(
name = 'call_cfunction',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
Works fine! Obviously gnu and Intel collaborate well!
Alex.
Apparently one has to add a header file cmodule.h with a line
"
void PrimaryDose_vect(float u[256][256], int sizex, int sizey,
float v[256][256]);
"
to get this working.
and add
#include "cmodule.h"
to cmodule.c
and comment out these lines in call_cfunction.pyx:
cdef extern from "cmodule.c":
cdef void PrimaryDose_vect(float **u, int sizex, int sizey,
float **v)
add insert these:
"
cdef extern from "cmodule.h":
void PrimaryDose_vect(float** u, int sizex, int sizey,
float** v)
"
Alex.