vectorization of looping on an array from cython

258 views
Skip to first unread message

Antony Lee

unread,
May 27, 2015, 2:29:31 PM5/27/15
to cython...@googlegroups.com
(crossposted from StackOverflow)

Consider the following example of doing an inplace-add on a Cython memoryview:

    #cython: boundscheck=False, wraparound=False, initializedcheck=False, nonecheck=False, cdivision=True
    from libc.stdlib cimport malloc, free
    from libc.stdio cimport printf
    cimport numpy as np
    import numpy as np


    cdef extern from "time.h":
        int clock()


    cdef void inplace_add(double[::1] a, double[::1] b):
        cdef int i
        for i in range(a.shape[0]):
            a[i] += b[i]


    cdef void inplace_addlocal(double[::1] a, double[::1] b):
        cdef int i, n = a.shape[0]
        for i in range(n):
            a[i] += b[i]


    def main(int N):
        cdef:
            int rep = 1000000, i
            double* pa = <double*>malloc(N * sizeof(double))
            double* pb = <double*>malloc(N * sizeof(double))
            double[::1] a = <double[:N]>pa
            double[::1] b = <double[:N]>pb
            int start
        start = clock()
        for i in range(N):
            a[i] = b[i] = 1. / (1 + i)
        for i in range(rep):
            inplace_add(a, b)
        printf("loop %i\n", clock() - start)
        print(np.asarray(a)[:4])
        start = clock()
        for i in range(N):
            a[i] = b[i] = 1. / (1 + i)
        for i in range(rep):
            inplace_addlocal(a, b)
        printf("loop_local %i\n", clock() - start)
        print(np.asarray(a)[:4])

With these Cython directives, the seemingly equivalent `inplace_add` and `inplace_addlocal` both compile to tight C loops.  But for `N=128` (the approximate size I'm expecting) `inplace_addlocal` is twice(!) faster than `inplace_add`, after compilation with `gcc -Ofast` (and directly writing a C function taking a (int, double*, double*) is more or less as fast as `addlocal`, with or without `#openmp simd`).  Passing `-fopt-info` to `gcc` shows that `inplace_addlocal` gets vectorized, but not `inplace_add`.

Is this an issue with the C code that Cython generates (i.e., gcc truly cannot infer whatever guarantees it needs to vectorize the code), or with gcc (i.e., some optimization is missing), or something else?

Thanks.

Antony

Chris Barker

unread,
May 28, 2015, 11:37:17 AM5/28/15
to cython-users
On Wed, May 27, 2015 at 11:29 AM, Antony Lee <anton...@berkeley.edu> wrote:
    cdef void inplace_add(double[::1] a, double[::1] b):
        cdef int i
        for i in range(a.shape[0]):
            a[i] += b[i]

    cdef void inplace_addlocal(double[::1] a, double[::1] b):
        cdef int i, n = a.shape[0]
        for i in range(n):
            a[i] += b[i]
 
With these Cython directives, the seemingly equivalent `inplace_add` and `inplace_addlocal` both compile to tight C loops. 

did you look at the actual for loop generated? In addlocal,  Cyton knows that i and n are both ints, and can therefore write a plain old C for loop. but in add, the argument to the range() function is an arbitrary python type, so I'd expect it to write a less optimized C for loop, and so the C compiler may not know for sure that this is a simple loop through an array.

Anyway, what really is the question here? Cython's type inference is very limited -- you are generally going to get the best performance if you type everything -- particularly the indexes in a loop. So I'm not surprised a bit.

-CHB






 
But for `N=128` (the approximate size I'm expecting) `inplace_addlocal` is twice(!) faster than `inplace_add`, after compilation with `gcc -Ofast` (and directly writing a C function taking a (int, double*, double*) is more or less as fast as `addlocal`, with or without `#openmp simd`).  Passing `-fopt-info` to `gcc` shows that `inplace_addlocal` gets vectorized, but not `inplace_add`.

Is this an issue with the C code that Cython generates (i.e., gcc truly cannot infer whatever guarantees it needs to vectorize the code), or with gcc (i.e., some optimization is missing), or something else?

Thanks.

Antony

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

Sturla Molden

unread,
May 28, 2015, 12:28:03 PM5/28/15
to cython...@googlegroups.com
Chris Barker <chris....@noaa.gov> wrote:

> did you look at the actual for loop generated? In addlocal, Cyton knows
> that i and n are both ints, and can therefore write a plain old C for loop.
> but in add, the argument to the range() function is an arbitrary python
> type, so I'd expect it to write a less optimized C for loop, and so the C
> compiler may not know for sure that this is a simple loop through an array.

Cython should know the type of the .shape attribute of a typed memoryview.
If it does not, it qualifies as a bug.


Sturla

Antony Lee

unread,
May 28, 2015, 12:52:07 PM5/28/15
to cython...@googlegroups.com
The answer has been posted on SO: http://stackoverflow.com/questions/30490310/vectorization-of-looping-on-an-array-from-cython/30492873#30492873
I had looked at the C code myself but overlooked that addlocal implicitly casts a.shape[0] from Py_ssize_t to int (the same as i), whereas for add, a.shape[0] and thus the upper bound of range is Py_ssize_t (64 bit), whereas the index is 32 bit.  Thus, it is technically correct that vectorization of the second C code is impossible as a.shape[0] could be higher than the largest integer that i can be set to (in which case the C output is an infinite loop).  Declaring i as Py_ssize_t (or juste size_t) solves the issue.
Given the huge difference in performance here (as well as the theoretical risk of a bug, for containers of size > 2**32), perhaps this should be mentioned in the docs more "positively" than "Purists could use "Py_ssize_t" which is the proper Python type for array indices."? ("Python for Numpy users")  On the other hand making this an error is certainly way too backwards incompatible.  A warning, perhaps?
Antony

Chris Barker

unread,
May 28, 2015, 10:58:30 PM5/28/15
to cython-users
addlocal implicitly casts a.shape[0] from Py_ssize_t to int (the same as i), whereas for add, a.shape[0] and thus the upper bound of range is Py_ssize_t (64 bit), whereas the index is 32 bit.  Thus, it is technically correct that vectorization of the second C code is impossible as a.shape[0] could be higher than the largest integer that i can be set to (in which case the C output is an infinite loop).  Declaring i as Py_ssize_t (or juste size_t) solves the issue.

2015-05-28 9:27 GMT-07:00 Sturla Molden <sturla...@gmail.com>:
Cython should know the type of the .shape attribute of a typed memoryview.
If it does not, it qualifies as a bug.

apparently it did -- but maybe not the best type -- I've lost track a bit, but .shape will always be returning what could be an index, so it should get the right type for that: size_t? (maybe it did?)

Perhaps this should be mentioned in the docs more "positively" than "Purists could use "Py_ssize_t" which is the proper Python type for array indices."? ("Python for Numpy users")  On the other hand making this an error is certainly way too backwards incompatible.  A warning, perhaps?

that might be a good idea.

While using "int" all over the place is pretty standard practice, I think in both Cython and C, people should be a lot more careful about the type used for array indexes.

-Chris


Antony Lee

unread,
May 28, 2015, 11:16:07 PM5/28/15
to cython...@googlegroups.com
2015-05-28 19:57 GMT-07:00 Chris Barker <chris....@noaa.gov>:


addlocal implicitly casts a.shape[0] from Py_ssize_t to int (the same as i), whereas for add, a.shape[0] and thus the upper bound of range is Py_ssize_t (64 bit), whereas the index is 32 bit.  Thus, it is technically correct that vectorization of the second C code is impossible as a.shape[0] could be higher than the largest integer that i can be set to (in which case the C output is an infinite loop).  Declaring i as Py_ssize_t (or juste size_t) solves the issue.

2015-05-28 9:27 GMT-07:00 Sturla Molden <sturla...@gmail.com>:
Cython should know the type of the .shape attribute of a typed memoryview.
If it does not, it qualifies as a bug.

apparently it did -- but maybe not the best type -- I've lost track a bit, but .shape will always be returning what could be an index, so it should get the right type for that: size_t? (maybe it did?)

It knows correctly that it is a Py_ssize_t, so there's no issue there.

Antony

Reply all
Reply to author
Forward
0 new messages