Performance of C-style vs general memory view access

47 views
Skip to first unread message

Richard Shadrach

unread,
Mar 8, 2025, 8:56:24 AMMar 8
to cython-users
Doing some work in pandas' groupby code, I noticed we had mixed usage of `const int64_t[: :] mask` and `const int64_t [:, ::1] mask` in our Cython functions. Looking to rectify these, I expected to see a performance increase when switch from the general `[:, :]` memory layout to `[:, ::1]`. However, I'm seeing the former just slightly (but consistently) perform better. Is this expected?

An example:

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def foo(
    const int64_t[:, ::1] mask,
) -> int:
    cdef:
        int64_t result = 0
        Py_ssize_t N, K

    N, K = (<object>mask).shape

    with nogil:
        for i in range(N):
            for j in range(K):
                result += mask[i, j]
    return result

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def bar(
    const int64_t[:, :] mask,
) -> int:
    cdef:
        int64_t result = 0
        Py_ssize_t N, K

    N, K = (<object>mask).shape

    with nogil:
        for i in range(N):
            for j in range(K):
                result += mask[i, j]
    return result

%timeit foo(arr)
# 8.85 ms ± 93.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bar(arr)
# 8.78 ms ± 25.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Best,
Richard

Stefan Behnel

unread,
Mar 8, 2025, 9:17:56 AMMar 8
to cython...@googlegroups.com
Richard Shadrach schrieb am 08.03.25 um 13:25:
That's not what I see.

$ python3.12 cythonize.py contig_array.pyx \
--setup 'import numpy; from contig_array import foo, bar;
arr = numpy.empty((10000,10000), dtype=numpy.int64)' \
--timeit 'foo(arr)'
20 loops, best of 9: 18.450 msec per loop (median: 18.838 msec)

$ python3.12 cythonize.py contig_array.pyx \
--setup 'import numpy; from contig_array import foo, bar;
arr = numpy.empty((10000,10000), dtype=numpy.int64)' \
--timeit 'bar(arr)'
5 loops, best of 9: 49.602 msec per loop (median: 50.227 msec)


I also tried it with very small arrays and got about the same ratio.
Are you sure that you used properly optimising CFLAGS etc.?

Stefan


(PS: "cythonize --timeit" is a new Cython 3.1 feature that only just landed
in master, so you probably won't find it in your local installation yet.)

Johannes Fischer

unread,
Mar 8, 2025, 12:07:05 PMMar 8
to cython-users
I had the same experience a couple of days ago. Anaconda, Python 3.12 / last week's pip install version of Cython.

Here is the code:


Here is the compile script:


In the past, it gave me a 5-10% speed up.

Is there something that's wrong with the compile script?

da-woods

unread,
Mar 8, 2025, 2:11:17 PMMar 8
to cython...@googlegroups.com
On 08/03/2025 15:26, Johannes Fischer wrote:
> I had the same experience a couple of days ago. Anaconda, Python 3.12
> / last week's pip install version of Cython.
>
> Here is the code:
>
> https://github.com/hansalemaos/curso_de_cython/blob/main/aula3/meumodulo/uiev.pyx
>
> Here is the compile script:
>
> https://github.com/hansalemaos/curso_de_cython/blob/main/aula3/meumodulo/uiev_compile.py
>
> In the past, it gave me a 5-10% speed up.
>
> Is there something that's wrong with the compile script?


I'd be wary of over-specifying the compile script. Most of the options
have defaults for a reason. Especially the macros - I wouldn't change
them unless you have very good reason to (I know you've commented them
out but it gives the impression that this is the sort of thing you want
to strictly control).

With that said, I can't think of anything should cause problems with
this specific issue.

GCC is `-fopenmp` not `-openmp`, isn't it?

I'm only not convinced that your code looks particularly parallel - I'd
be worried that it'll spend a lot of time fighting for locks and that
this might give you odd results.

Stefan Behnel

unread,
Mar 9, 2025, 1:00:49 AMMar 9
to cython...@googlegroups.com
da-woods schrieb am 08.03.25 um 20:11:
> On 08/03/2025 15:26, Johannes Fischer wrote:
>> https://github.com/hansalemaos/curso_de_cython/blob/main/aula3/meumodulo/
>> uiev_compile.py
>>
>> Is there something that's wrong with the compile script?
>
> I'd be wary of over-specifying the compile script. Most of the options have
> defaults for a reason. Especially the macros - I wouldn't change them
> unless you have very good reason to (I know you've commented them out but
> it gives the impression that this is the sort of thing you want to strictly
> control).

I second that. Most options are "just there", partly (especially the C
macro flags) for internal adaptation, partly because they seemed a good
idea at the time, mostly "because we can" provide them for special needs.
Most user code doesn't have special needs. And if it does, then it's one or
two options that bring a benefit and that a whole project would probably
enable globally for the build, once and for all.

The long list also completely ignores the place where directives are used,
for example. Several of them shouldn't be used globally but at a
per-function level or even more fine-grained, to keep their impact under
control.

The link above seems to refer to a "Cython course". That's probably the
worst place for teaching users to care about all those options. It really
gives a wrong impression of importance. Users should not care about them,
most of the time. And if it really comes to fine-tuning the last couple of
percent of of the code, users can still look through the available (then
up-to-date documented) options to try the ones that appear related, and
then keep the (again) one or two that prove to make a difference.
Everything else is just a distracting heap of bitrotting irrelevance.

Stefan

Johannes Fischer

unread,
Mar 9, 2025, 1:26:07 PMMar 9
to cython-users
Thanks for your response! The macros are all commented out, it is just an overview for myself, and for people who are interested in it. I am not going to talk about them in my videos, because I don't know when to use most of them. Here is the video, by the way, and the point where I test it: https://youtu.be/S3SfG4AJgLM?si=eFJ_iPqIyWsZQy_B&t=3582
The code ran with around 90 ms at the beginning, and 96 ms after adding ":1".
I had no clue why this was happening, since it made my code run faster in the past.

Thanks, D. Woods for your correction! I forgot that one letter! I use mostly Windows (like in the video)! I am going to change that now! 



Richard Shadrach

unread,
Mar 30, 2025, 1:56:40 AMMar 30
to cython...@googlegroups.com
Found some time to get back to this. First, apologies on that awfully incomplete reproducer. I extracted my example out from the pandas build and did it from the ground up. Note that I'm seeing the exact opposite of the timings that Stefan posted. 

> Are you sure that you used properly optimising CFLAGS etc.?

No - I am not at all. Below should be everything, any suggestions would be greatly appreciated.

Best,
Richard

Calling code:

import numpy as np
import main
arr = np.empty((10000,10000), dtype=np.int64)
%timeit main.foo(arr)
# 41.1 ms ± 25.5 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit main.bar(arr)
# 20.8 ms ± 94.1 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

------

setup.py

from setuptools import setup
from Cython.Build import cythonize
import numpy as np

setup(
    ext_modules=cythonize("main.pyx"),
    include_dirs=[np.get_include()],
)

-------

Compilation (using NumPy 1.26.4 currently)

python setup.py build_ext --inplace
running build_ext
building 'main' extension
x86_64-linux-gnu-gcc -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -fPIC -I/home/richard/dev/venvs/pandas/lib/python3.12/site-packages/numpy/core/include -I/home/richard/dev/venvs/pandas/include -I/usr/include/python3.12 -c main.c -o build/temp.linux-x86_64-cpython-312/main.o
In file included from /home/richard/dev/venvs/pandas/lib/python3.12/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929,
                 from /home/richard/dev/venvs/pandas/lib/python3.12/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
                 from /home/richard/dev/venvs/pandas/lib/python3.12/site-packages/numpy/core/include/numpy/arrayobject.h:5,
                 from main.c:1240:
/home/richard/dev/venvs/pandas/lib/python3.12/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
   17 | #warning "Using deprecated NumPy API, disable it with " \
      |  ^~~~~~~
creating build/lib.linux-x86_64-cpython-312
x86_64-linux-gnu-gcc -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 build/temp.linux-x86_64-cpython-312/main.o -L/usr/lib/x86_64-linux-gnu -o build/lib.linux-x86_64-cpython-312/main.cpython-312-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-cpython-312/main.cpython-312-x86_64-linux-gnu.so ->

--------

main.pyx

cimport cython
from cython cimport Py_ssize_t
import numpy as np
cimport numpy as cnp
from numpy cimport int64_t

cnp.import_array()
--

---
You received this message because you are subscribed to a topic in the Google Groups "cython-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cython-users/vQf9oEJpou0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cython-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cython-users/e5836862-afa4-4cce-a77f-867582cd8c92n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages