[cython-users] Cython slower than python with numpy

neurino

unread,

May 3, 2010, 4:01:01 PM5/3/10

to cython-users

I've written a little python-numpy sudoku solver in order to make code
faster using cython in order to get familiar with it.

I've implemented basic instructions found on cython docs as early
static typing and using cimport numpy.

I know the pyx file can be more and more optimized but I did'n thougth
it would be slower than standard python code.

I use ipyhton timeit function using _sudoku module (sudoku.py file
standard Python code) and sudoku module (sudoku.pyd cython compiled).

Here the results for solving a schema (named "hardest"):

# Standard Python
In [5]: timeit -n10 _sudoku.PySudoku(hardest)
10 loops, best of 3: 165 ms per loop

# Cython code
In [6]: timeit -n10 sudoku.PySudoku(hardest)
10 loops, best of 3: 195 ms per loop

Sources can be found here: http://code.google.com/p/pysu/downloads/list

Thanks for support

SevenThunders

unread,

May 6, 2010, 2:58:40 PM5/6/10

to cython-users

Hmm just a quick glance at your sudoku.pyx file shows that it's still
mostly pure python code. Your classes are initialized with __init__
instead of __cinit__.
When you declare something like
cdef np.ndarray grid, solved
It will still be mostly python/numpy code that will be generated. You
need to declare the dimensions and types too.

Something like,
np.ndarray[np.float64_t, ndim=2] grid

See for example
http://docs.cython.org/src/tutorial/numpy.html

Make sure all your loop indices are declared as integer types via
cdef. Also I notice constructs like this for your loops
for row, col in np.ndindex(self.size, self.size):

My guess is it's not optimized very well if at all. I think only
certain loop constructs are optimized. Something like
for row in range(maxrows):
for col in range(maxcols) :
might work better.

Also array slicing is not optimized per se, though things may have
improved. The experts can guide you better.

neurino

unread,

May 6, 2010, 5:02:55 PM5/6/10

to cython...@googlegroups.com

You're right,

I know most things can be optimized, np.ndindex is written in pure python code so I was thinking about replacing it with nested xranges, even if I'd loose the comfort of exiting the loop with a simple "break".

Haven't tried __cint__ tho.

I also tried declaring dimensions and types but again pure python were faster (gained just a few milliseconds)

All index (all variables) are cdef declared, I started right from the link you posted.

About comfortable slicing throug numpy (get all numbers in a 3x3 block with a single slice is super), if I abandon it too just to get even with pure python code so why shoul I use numpy then, I'd use simple arrays, why should I use python, I'd use C directly

Anyway I'm here to learn something, and any advice is much appreciated, thanks!

Greetings

2010/5/6 SevenThunders <matt...@earthlink.net>

SevenThunders

unread,

May 7, 2010, 12:00:06 AM5/7/10

to cython-users

Optimization is always a pain in the arse and full of surprises. In
theory you should be able to get arbitrarily close to C with enough
cython annotations. The convenience of cython is that you can drop
back into Python when it makes sense.

Anything in inner loops should be as close to C as possible for
performance reasons. So yeah you can drop back into C if you'd like
or write C-like cython code to get you there as well.

Keep in mind that some numpy calls are going to be pretty well
optimized. I think they call the MKL libraries, which should take
full advantage of multiple cores and SSE instructions. Hopefully the
GIL is suspended when they make those calls, but that is beyond my
knowledge base.

So what I'm saying is don't be afraid to use built in numpy calls
where appropriate. I've gotten very good performance out of large
element by element array multiplies in numpy. In fact I wouldn't be
surprised if that is one reason for the discrepancy. Ordinary
optimized C code will be slower than a call into an intel optimized
BLAS routine.

By the way cython has a pretty good profiler these days. Easier to
use than some of the standard C profilers. That alone is worth the
price of admission IMHO.

Now if you really need some performance boosts you can start looking
into running massive parallel op.s on your graphics card, ala CUDA or
OpenCL. That's what I've been doing these days, and I use cython to
call my CUDA kernels. There is also some very interesting python
support for that via Pycuda Pyopencl , clyther and probably other
stuff.

On May 6, 5:02 pm, neurino <neur...@gmail.com> wrote:
> You're right,
>
> I know most things can be optimized, np.ndindex is written in pure python
> code so I was thinking about replacing it with nested xranges, even if I'd
> loose the comfort of exiting the loop with a simple "break".
> Haven't tried __cint__ tho.
>
> I also tried declaring dimensions and types but again pure python were
> faster (gained just a few milliseconds)
>
> All index (all variables) are cdef declared, I started right from the link
> you posted.
>
> About comfortable slicing throug numpy (get all numbers in a 3x3 block with
> a single slice is super), if I abandon it too just to get even with pure
> python code so why shoul I use numpy then, I'd use simple arrays, why should
> I use python, I'd use C directly
>
> Anyway I'm here to learn something, and any advice is much appreciated,
> thanks!
>
> Greetings
>

> 2010/5/6 SevenThunders <mattc...@earthlink.net>

neurino

unread,

May 7, 2010, 12:12:04 PM5/7/10

to cython...@googlegroups.com

I have no excessive optimizations needs, it's just a sudoku schema ^^

Anyway I updated the pyx file in the repository (http://code.google.com/p/pysu/source/browse/trunk/sudoku.pyx),
- replacing np.ndindex with nested xranges,
- checking numbers != 0 by myself instead of a[a > 0]
- changing various "if n" or "if not n" with "if n != 0" and "if n == 0" (does matter?)
- using 4 more variables to avoid some calculation repetitions (cython version only, not in python code)

result:
cython
In [4]: timeit -n10 sudoku.PySudoku(hardest)
10 loops, best of 3: 166 ms per loop

python

In [5]: timeit -n10 _sudoku.PySudoku(hardest)

10 loops, best of 3: 184 ms per loop

I guess there's more room for improvement as I substitute things like

(candidates[row, col] > 0).sum()

with

v = 0
for n in xrange(size): v += 1 if candidates[row, col, n] != 0 else 0

or

v = 0
for n in xrange(size):
if candidates[row, col, n] != 0: v += 1

i.e. back from a pythonic way to C way.

Or maybe not?

Thanks for your support

2010/5/7 SevenThunders <matt...@earthlink.net>

Robert Bradshaw

unread,

May 7, 2010, 1:58:49 PM5/7/10

to cython...@googlegroups.com

On May 7, 2010, at 9:12 AM, neurino wrote:

> I have no excessive optimizations needs, it's just a sudoku schema ^^
>
> Anyway I updated the pyx file in the repository (http://code.google.com/p/pysu/source/browse/trunk/sudoku.pyx
> ),
> - replacing np.ndindex with nested xranges,
> - checking numbers != 0 by myself instead of a[a > 0]
> - changing various "if n" or "if not n" with "if n != 0" and "if n
> == 0" (does matter?)

No, that won't matter.

> - using 4 more variables to avoid some calculation repetitions
> (cython version only, not in python code)
>
> result:
> cython
> In [4]: timeit -n10 sudoku.PySudoku(hardest)
> 10 loops, best of 3: 166 ms per loop
>
> python
> In [5]: timeit -n10 _sudoku.PySudoku(hardest)
> 10 loops, best of 3: 184 ms per loop
>
> I guess there's more room for improvement as I substitute things like
>
> (candidates[row, col] > 0).sum()
>
> with
>
> v = 0
> for n in xrange(size): v += 1 if candidates[row, col, n] != 0 else 0
>
> or
>
> v = 0
> for n in xrange(size):
> if candidates[row, col, n] != 0: v += 1
>
> i.e. back from a pythonic way to C way.
>
> Or maybe not?

Take a look at

http://sage.math.washington.edu/home/robertwb/cython/sudoku.html

All your loops are bright yellow, so I think you're trying to optimize
in the wrong places. For example, your very inner loops consist of

n in grid[subslice]

or

unique(n, grid[subslice])

or

(grid[subslice] == 0).sum()

which is doing slow stuff that is not going to be able to be optimized
by Cython. Also, typing grid to be a 2x2 array would help a lot, as
currently it can't even take advantage of the fact that you've typed it.

As for your original version being slower once you cythonized it, I'm
not sure why that would be, but my guess would be that it was doing
lots of extra type checking because you declared types (and never took
advantage of them).

- Robert

neurino

unread,

May 7, 2010, 4:39:00 PM5/7/10

to cython...@googlegroups.com

> All your loops are bright yellow, so I think you're trying to optimize in the wrong places. For example, your very inner loops consist of
> n in grid[subslice]

I know, that's why I'm using numpy: to take advantage of subslicing, otherwise I'd use directly standard arrays...

I always checked the html "cython -a" creates.

I also tried to type grid and candidates as np.ndarray[np.int_t, ndim=2] and np.ndarray[np.int_t, ndim=3] but get this error:

E:\pysu>cython -a sudoku.pyx

Error converting Pyrex file to C:
------------------------------------------------------------
...
        self.tries = 0
        self.solved = self.__solve(self.grid)

    #@cython.cdivision(True)
    #@cython.boundscheck(False)
    cdef np.ndarray[DTYPE_t, ndim=2] __solve(self, np.ndarray[DTYPE_t, ndim=2] grid):
                                                                          ^
------------------------------------------------------------

E:\pysu\sudoku.pyx:24:75: Expected ']'

and also later:

'ndarray' is not a type identifier

I made a commit with the code that generates errors, if you can try it and help me to understand what's wrong.

You wrote I did not take advantage of the fact I typed variables and did not take advantage, please explain me how to do it since I'm quite sure I did not understand that too.

Greetings

2010/5/7 Robert Bradshaw <robe...@math.washington.edu>

neurino

unread,

May 7, 2010, 4:44:11 PM5/7/10

to cython...@googlegroups.com

Sorry, forgot the link to code: http://code.google.com/p/pysu/source/browse/trunk/sudoku.pyx

2010/5/7 neurino <neu...@gmail.com>

Robert Bradshaw

unread,

May 7, 2010, 6:59:10 PM5/7/10

to cython...@googlegroups.com

On May 7, 2010, at 1:39 PM, neurino wrote:

> > All your loops are bright yellow, so I think you're trying to
> optimize in the wrong places. For example, your very inner loops
> consist of
> > n in grid[subslice]
>
> I know, that's why I'm using numpy: to take advantage of subslicing,
> otherwise I'd use directly standard arrays...

Even if NumPy slices were optimized, there's no getting around the
fact that you're allocating new objects here (though they do share
data memory). Then when you call .sum() on it you're allocating
another Python object. As it is, to even invoke the slice you have to
allocate a tuple and indices as Python objects too. When your arrays
have hundreds, or millions, of items, slicing is cheap compared to
iterating over the array. When they just contain a dozen or two, that
fixed cost is relatively expensive.

It's conceivable that the compound statement could be optimized, but
we're not to that point yet. As you mentioned, you could unroll these
manually (though it would impact the clarity of your code). This could
be a bit cleaner once we have closures and/or buffers in cdef functions.

> I always checked the html "cython -a" creates.
>
> I also tried to type grid and candidates as np.ndarray[np.int_t,
> ndim=2] and np.ndarray[np.int_t, ndim=3] but get this error:
>
> E:\pysu>cython -a sudoku.pyx
>
> Error converting Pyrex file to C:
> ------------------------------------------------------------
> ...
> self.tries = 0
> self.solved = self.__solve(self.grid)
>
> #@cython.cdivision(True)
> #@cython.boundscheck(False)
> cdef np.ndarray[DTYPE_t, ndim=2] __solve(self,
> np.ndarray[DTYPE_t, ndim=2] grid):
> ^
> ------------------------------------------------------------
>
> E:\pysu\sudoku.pyx:24:75: Expected ']'

You can't use buffer types in cdef functions (yet), just make it a def
function (as it's not called that often relative to all else that's
going on). The same is true of class members.

>
> and also later:
>
> 'ndarray' is not a type identifier
>
> I made a commit with the code that generates errors, if you can try
> it and help me to understand what's wrong.
>
> You wrote I did not take advantage of the fact I typed variables and
> did not take advantage, please explain me how to do it since I'm
> quite sure I did not understand that too.

Because the array type was undeclared, all your indexes were still
passing through Python. BTW, just running your original .py file
through Cython, no changes, gave me a 5% speedup on my machine. I
added a couple of types for a 25% speedup:

Python 0.20190050602
Cython 0.151217198372

http://sage.math.washington.edu/home/robertwb/cython/pysu.0.0.1/sudoku.diff

There's no getting a 10x speedup without avoiding all that slow
slicing though.

- Robert

Stefan Behnel

unread,

May 8, 2010, 12:40:54 AM5/8/10

to cython...@googlegroups.com

Robert Bradshaw, 07.05.2010 19:58:

> For example, your very inner loops consist of
>
> n in grid[subslice]
>
> or
>
> unique(n, grid[subslice])
>
> or
>
> (grid[subslice] == 0).sum()
>
> which is doing slow stuff that is not going to be able to be optimized
> by Cython.

I agree with the latter two, but the first could run in plain C. Cython
knows that 'n' is an int, and that 'grid' is a 3D int8 buffer that is
getting sliced. It could just generate a loop for that, just like Cython
0.13 will for non-literal bytes/unicode strings (literals will give you a
switch statement).

An 'in' test on a buffer slice is a pretty obvious thing to do, IMHO.

Stefan

Dag Sverre Seljebotn

unread,

May 8, 2010, 6:09:48 AM5/8/10

to cython...@googlegroups.com

All of this *could* be done, and none of it is very hard, but then my
thesis started sucking up all my time..

Dag Sverre

Dag Sverre Seljebotn

unread,

May 8, 2010, 6:15:11 AM5/8/10

to cython...@googlegroups.com

Also, part of why Cython seems so badly suited here is because all the
people using Cython for numerics tend to work on arrays that are a
couple of hundreds of megabytes larger...

So Cython currently isn't targeted for these kinds of problems, and if
one wants speed for these kinds of things one needs to write C-like code.

Dag Sverre

neurino

unread,

May 8, 2010, 9:30:47 AM5/8/10

to cython...@googlegroups.com

2010/5/8 Robert Bradshaw <robe...@math.washington.edu>

You can't use buffer types in cdef functions (yet), just make it a def function (as it's not called that often relative to all else that's going on). The same is true of class members.

Sorry but already tried different combinations: def, cdef, cass members only... but always get same errors could someone try and post right code?
_______________

Sudoku is just for plain testing, I have no particular needs for speed as usual.
In my work it's typical to have many and many formulas in single, python readable, defs that are continously checked and changed (since the scientific part of the team has not so clear ideas... ^_^ so readability is a key factor). These defs usually work on small amounts of numbers (not 100s of Mb).

Surely all could be summed in a few big C-like code functions but then make aimed modifications would become very difficult.

In these cases probably Cython does not improve performaces dramatically.
It's important to me to know that because to spend time to cythonize a desktop application just to speed up user experience of, say, 100ms each click, is not worthwhile as normal python code already gives a satisfactory usability.

Of course all would change if I had big amounts of data to process...

Dag Sverre Seljebotn

unread,

May 8, 2010, 9:36:13 AM5/8/10

to cython...@googlegroups.com

neurino wrote:
>
> 2010/5/8 Robert Bradshaw <robe...@math.washington.edu
> <mailto:robe...@math.washington.edu>>

>
> You can't use buffer types in cdef functions (yet), just make it a
> def function (as it's not called that often relative to all else
> that's going on). The same is true of class members.
>
>
> Sorry but already tried different combinations: def, cdef, cass
> members only... but always get same errors could someone try and post
> right code?

def func(np.ndarray[double, ndim=2] arr):
...

If you must use "cdef", do

cdef func(arr):
cdef np.ndarray[double, ndim=2] arr_fast = arr
...

However the "arr_fast = arr" does take some time. This is simply a
missing feature.

>
> In these cases probably Cython does not improve performaces dramatically.
> It's important to me to know that because to spend time to cythonize a
> desktop application just to speed up user experience of, say, 100ms
> each click, is not worthwhile as normal python code already gives a
> satisfactory usability.

Even if I'm one of the Cython developers, I only use Cython on perhaps
10% of my code, where I know it will make a difference. And yes, it gets
hairy if one is not processing large amounts of data -- that's just
life, currently, until there's more time to work on it.

Dag Sverre

Dag Sverre Seljebotn

unread,

May 8, 2010, 9:45:30 AM5/8/10

to cython...@googlegroups.com

Dag Sverre Seljebotn wrote:
> neurino wrote:
>>
>> 2010/5/8 Robert Bradshaw <robe...@math.washington.edu
>> <mailto:robe...@math.washington.edu>>
>>
>> You can't use buffer types in cdef functions (yet), just make it a
>> def function (as it's not called that often relative to all else
>> that's going on). The same is true of class members.
>>
>>
>> Sorry but already tried different combinations: def, cdef, cass
>> members only... but always get same errors could someone try and post
>> right code?
>

I don't remember whether I pointed it out to you earlier, but this:

http://conference.scipy.org/proceedings/SciPy2009/paper_2/

mentions something about what is supported or not. Virtually everything
in "future work" isn't done yet (or sits in a branch which in all
likelihood won't be merged until roughly October, if I'm honest with
myself...)

Dag Sverr

neurino

unread,

May 8, 2010, 9:47:22 AM5/8/10

to cython...@googlegroups.com

2010/5/8 Dag Sverre Seljebotn <da...@student.matnat.uio.no>

def func(np.ndarray[double, ndim=2] arr):
...

If you must use "cdef", do

cdef func(arr):
cdef np.ndarray[double, ndim=2] arr_fast = arr
...

However the "arr_fast = arr" does take some time. This is simply a missing feature.

Seems to me the only place where I can use it without getting errors is

def __solve(self, np.ndarray[DTYPE_t, ndim=2] grid):

Other places in code give errors.

Anyway thanks for your support, it's much appreciated!

Robert Bradshaw

unread,

May 8, 2010, 1:01:52 PM5/8/10

to cython...@googlegroups.com

On May 8, 2010, at 3:09 AM, Dag Sverre Seljebotn wrote:

> Stefan Behnel wrote:
>> Robert Bradshaw, 07.05.2010 19:58:
>>> For example, your very inner loops consist of
>>>
>>> n in grid[subslice]
>>>
>>> or
>>>
>>> unique(n, grid[subslice])
>>>
>>> or
>>>
>>> (grid[subslice] == 0).sum()
>>>
>>> which is doing slow stuff that is not going to be able to be
>>> optimized
>>> by Cython.
>>
>> I agree with the latter two, but the first could run in plain C.
>> Cython knows that 'n' is an int, and that 'grid' is a 3D int8
>> buffer that is getting sliced. It could just generate a loop for
>> that, just like Cython 0.13 will for non-literal bytes/unicode
>> strings (literals will give you a switch statement).
>>
>> An 'in' test on a buffer slice is a pretty obvious thing to do, IMHO.
> All of this *could* be done,

If we assume buffer objects support broadcasting.

> and none of it is very hard, but then my thesis started sucking up
> all my time..

I *totally* understand :)

- Robert

Dag Sverre Seljebotn

unread,

May 8, 2010, 1:06:38 PM5/8/10

to cython...@googlegroups.com

Robert Bradshaw wrote:
> On May 8, 2010, at 3:09 AM, Dag Sverre Seljebotn wrote:
>
>> Stefan Behnel wrote:
>>> Robert Bradshaw, 07.05.2010 19:58:
>>>> For example, your very inner loops consist of
>>>>
>>>> n in grid[subslice]
>>>>
>>>> or
>>>>
>>>> unique(n, grid[subslice])
>>>>
>>>> or
>>>>
>>>> (grid[subslice] == 0).sum()
>>>>
>>>> which is doing slow stuff that is not going to be able to be optimized
>>>> by Cython.
>>>
>>> I agree with the latter two, but the first could run in plain C.
>>> Cython knows that 'n' is an int, and that 'grid' is a 3D int8 buffer
>>> that is getting sliced. It could just generate a loop for that, just
>>> like Cython 0.13 will for non-literal bytes/unicode strings
>>> (literals will give you a switch statement).
>>>
>>> An 'in' test on a buffer slice is a pretty obvious thing to do, IMHO.
>> All of this *could* be done,
>
> If we assume buffer objects support broadcasting.

I'd never want to support this for np.ndarray[...]. I really hinted at
the int[:,:] syntax, where we are free to make any assumptions we want.
Not that I necesarrily think those particular examples should work, just
that this *kind* of programming -- operating on small sets of data with
slices -- could be well supported that way.

Dag Sverre

Robert Bradshaw

unread,

May 8, 2010, 1:15:11 PM5/8/10

to cython...@googlegroups.com

On May 8, 2010, at 6:30 AM, neurino wrote:

> 2010/5/8 Robert Bradshaw <robe...@math.washington.edu>
> You can't use buffer types in cdef functions (yet), just make it a
> def function (as it's not called that often relative to all else
> that's going on). The same is true of class members.
>
>
> Sorry but already tried different combinations: def, cdef, cass
> members only... but always get same errors could someone try and
> post right code?

I did:

http://sage.math.washington.edu/home/robertwb/cython/pysu.0.0.1/

> _______________
>
> Sudoku is just for plain testing, I have no particular needs for
> speed as usual.
> In my work it's typical to have many and many formulas in single,
> python readable, defs that are continously checked and changed
> (since the scientific part of the team has not so clear ideas... ^_^
> so readability is a key factor). These defs usually work on small
> amounts of numbers (not 100s of Mb).
>
> Surely all could be summed in a few big C-like code functions but
> then make aimed modifications would become very difficult.
>
> In these cases probably Cython does not improve performaces
> dramatically.
> It's important to me to know that because to spend time to cythonize
> a desktop application just to speed up user experience of, say,
> 100ms each click, is not worthwhile as normal python code already
> gives a satisfactory usability.

Yes, that's one of the beauties of Cython. You can write your code in
Python, and if you discover a certain piece isn't fast enough you can
sit down and optimized that bit, to whatever level you care about,
with Cython rather than having to re-write it all in C.

> Of course all would change if I had big amounts of data to process...

Yep, that's where the big gains are, because 100ms vs 1ms doesn't
matter like 100 hours vs. 1.

- Robert

Stefan Behnel

unread,

May 8, 2010, 2:23:11 PM5/8/10

to cython...@googlegroups.com

Robert Bradshaw, 08.05.2010 19:01:

> On May 8, 2010, at 3:09 AM, Dag Sverre Seljebotn wrote:
>
>> Stefan Behnel wrote:
>>> Robert Bradshaw, 07.05.2010 19:58:
>>>> For example, your very inner loops consist of
>>>>
>>>> n in grid[subslice]

>>>> [...]

>>>> which is doing slow stuff that is not going to be able to be optimized
>>>> by Cython.
>>>

>>> Cython knows that 'n' is an int, and that 'grid' is a 3D int8 buffer
>>> that is getting sliced. It could just generate a loop for that
>>>

>>> An 'in' test on a buffer slice is a pretty obvious thing to do, IMHO.
>> All of this *could* be done,
>
> If we assume buffer objects support broadcasting.

What do you mean? If we slice into a declared buffer object, the result
should be clear and the addressing obvious. I thought that that was what
the buffer syntax was there for. If we can bypass __getitem__ on index
access, why not on slice access?

Stefan

Robert Bradshaw

unread,

May 8, 2010, 2:32:12 PM5/8/10

to cython...@googlegroups.com

I was referring to the statements like (v[subslice] == 0).sum() where
v is a buffer object. In any case, I like Dag's statement that for the
int[:,:] concept we can make many more assumptions. It's probably safe
to implement the in operator, though a first step (and more generic/
useful?) would be efficient n-arry looping.

- Robert

Reply all

Reply to author

Forward