Cython and Fortran+f2py

114 views
Skip to first unread message

Andrea Gavana

unread,
Apr 1, 2021, 3:47:14 PM4/1/21
to cython-users
Dear Cython Users,

I am a very inexperienced Cython users and I am trying to understand how to speed up a little function - I seem to have tried everything I know but the Cython implementation I have is still between 5 and 50 times slower than a dumb Fortran+f2py routine I have written.

This little function simply finds the row and column indices and the nonzero values of a 2D matrix, which is relatively sparse.

That said, both the Cython and Fortran+f2py routines receive a dense, single precision, fortran-ordered 2D matrix of varying dimensions. Both routines don't return anything (on purpose, this is a very simplified version of the problem I have), but I think I have taken steps to avoid the compiler optimizing away stuff.

If I run both routines with the attached sample, I get this type of elapsed times:

                     Elapsed Times (ms)        
M      N      NNZ    Cython     Fortran + F2PY    Ratio
1661   2608   5833   25.0584    2.7079            9.3
2828   3512   7191   40.0933    6.4342            6.2
780    985    1657   6.7442     0.4403            15.3
799    1558   2772   10.8066    0.7142            15.1
399    540    908    3.1496     0.1289            24.4
10545  10486  23787  271.3768   70.9125           3.8
3369   3684   6962   42.6007    7.9357            5.4
2052   5513   6822   39.4725    7.2816            5.4
224    628    1555   4.8283     0.0944            51.1
553    1475   1849   7.4616     0.4713            15.8

Where M, N are the number of rows/columns of the matrix, NNZ the number of nonzero elements and then two columns for the elapsed time in ms, plus the ratio Cython/Fortran.

It may easily be that passing a single precision, fortran-ordered matrix to Cython is not the best idea, but this is the input I have to live with.

Any suggestion in making the Cython implementation more performant is most welcome - I attach all the scripts at the end of the email. Thank you in advance.

Andrea.

# CYTHON - cython_test_nnz.pyx
import numpy as np
cimport numpy as np
import cython

from cython.view cimport array as cvarray

cdef float SMALL = 1e-10
cdef int MAXNNZ = 1000000

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
@cython.nonecheck(False)
cpdef find_nnz(np.float32_t[::1, :] matrix):

    cdef int n, m, i, j, nza
    cdef float val1

    col_starts  = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')
    row_indices = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')
    x = cvarray(shape=(MAXNNZ,), itemsize=sizeof(double), format='d')

    m = matrix.shape[0]
    n = matrix.shape[1]

    nza = 0
    for i in range(n):
        for j in range(m):
            val1 = matrix[j, i]

            if abs(val1) < SMALL:
                continue

            col_starts[nza] = i+1
            row_indices[nza] = j
            x[nza] = <double>val1
            nza += 1


# Cython setup.py file
from setuptools import setup
from setuptools import Extension

from Cython.Build import cythonize
import numpy as np

if __name__ == '__main__':

    extra_compile_args = ['-O3', '-ffast-math', '-funroll-loops', '-flto', '-ftree-vectorize']
    extra_link_args = ['-flto'] + extra_compile_args

    extensions = [Extension('cython_test_nnz', ['cython_test_nnz.pyx'],
                            extra_compile_args=extra_compile_args,
                            extra_link_args=extra_link_args,
                            include_dirs=[np.get_include()],
                            define_macros=  [('WIN32', 1)] ]

    # build the core extension(s)
    setup_kwargs = {'ext_modules': cythonize(extensions)}

    setup(**setup_kwargs)


# Fortran implementation
subroutine find_nnz(matrix, m, n)

    implicit none
    
    integer, intent(in):: m, n
    real(4), intent(in), dimension(m, n) :: matrix
    
    integer nza, i, j, maxnnz
    real(4) small, val1
    
    parameter(maxnnz=1000000)
    
    double precision, dimension(maxnnz):: x
    integer, dimension(maxnnz):: row_indices, col_starts
    
    small = 1e-10
    nza = 1
    
    do i = 1, n
        do j = 1, m
            val1 = matrix(j, i)

            if (abs(val1) < small) then
                cycle
            endif

            col_starts(nza) = i+1
            row_indices(nza) = j
            x(nza) = dble(val1)
            nza = nza + 1

        enddo
    enddo
    
!   Avoid compiler optimizing away stuff
    x(1) = x(1) + 1.0
    col_starts(1) = col_starts(1) + 1
    row_indices(1) = row_indices(1) + 1
    
end subroutine find_nnz

# Compile the fortran routine

f2py --fcompiler=gfortran --opt="-O3 -ffast-math -funroll-loops -flto -ftree-vectorize" -c fortran_test_nnz.f90 -m fortran_test_nnz


# Test Script
import os
import time
import numpy as np
from scipy.sparse import rand as srand

import cython_test_nnz
import fortran_test_nnz

SIZES = [(1661, 2608)  , (2828, 3512), (780, 985)  , (799, 1558), (399, 540),
         (10545, 10486), (3369, 3684), (2052, 5513), (224, 628) , (553, 1475)]
NNZ = [5833, 7191, 1657, 2772, 908, 23787, 6962, 6822, 1555, 1849]

np.random.seed(123456)

def tester():

    for i in range(len(NNZ)):
        m, n = SIZES[i]
        nnz = NNZ[i]
        density = nnz/float(m*n)
        matrix = srand(m, n, density=density, format='coo', dtype=np.float32)
        matrix = matrix.todense(order='F')

        print('%-6d %-6d %-6d'%(m, n, nnz)),

        for implementation in ['Cython', 'Fortran']:

            function = cython_test_nnz.find_nnz if implementation == 'Cython' else fortran_test_nnz.find_nnz
            time_setup = []

            for k in range(10):
                start = time.clock()
                function(matrix)
                elapsed = time.clock() - start
                time_setup.append(elapsed * 1e3)

            print('%-10.4f' % np.mean(time_setup)),

        print('')

if __name__ == '__main__':
    tester()




da-woods

unread,
Apr 1, 2021, 4:13:43 PM4/1/21
to cython...@googlegroups.com
1) Have you tried getting annotated output from Cython (cython -a filename.pyx)? it'll give you a colour-coded HTML file that'll give you a clue what's not optimized.

2) I suspect you need to type col_starts, row_indices, and x as typed memoryviews (to speed up indexing them)

3) Fortran and f2py are actually pretty quick. You should be able to get close but I don't think you'd expect to beat them by huge margins
--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/b4cca13b-6c0a-455f-8ac3-eb2e03cc06b9n%40googlegroups.com.


Andrea Gavana

unread,
Apr 1, 2021, 4:24:32 PM4/1/21
to cython-users
Hi da-woods,

Thank you for your answer and your interest.

On Thursday, April 1, 2021 at 10:13:43 PM UTC+2 da-woods wrote:
1) Have you tried getting annotated output from Cython (cython -a filename.pyx)? it'll give you a colour-coded HTML file that'll give you a clue what's not optimized.

I have added the type annotations to the memoryviews - and most of the annotated output is white (I am attaching a screenshot). The only yellow remaining is around the typed memoryviews (maybe I have done something wrong).

It is indeed much better: the ratio went down to between 1.5 and 4.2.
 

2) I suspect you need to type col_starts, row_indices, and x as typed memoryviews (to speed up indexing them)

3) Fortran and f2py are actually pretty quick. You should be able to get close but I don't think you'd expect to beat them by huge margins

I was afraid that would be the answer... I guess nothing more can be done to the modified version below?

import numpy as np
cimport numpy as np
import cython

from cython.view cimport array as cvarray

cdef float SMALL = 1e-10
cdef int MAXNNZ = 1000000

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
@cython.nonecheck(False)
cpdef find_nnz(np.float32_t[::1, :] matrix):

    cdef int n, m, i, j, nza
    cdef float val1

    cdef int [:] col_starts  = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')
    cdef int [:] row_indices = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')
    cdef double [:] x = cvarray(shape=(MAXNNZ,), itemsize=sizeof(double), format='d')

    m = matrix.shape[0]
    n = matrix.shape[1]

    nza = 0
    for i in range(n):
        for j in range(m):
            val1 = matrix[j, i]

            if abs(val1) < SMALL:
                continue

            col_starts[nza] = i+1
            row_indices[nza] = j
            x[nza] = <double>val1
            nza += 1


Thank you again.

Andrea.


cython_annotations.PNG

da-woods

unread,
Apr 1, 2021, 4:55:39 PM4/1/21
to cython...@googlegroups.com
There's probably a small gain to be had by specifying contiguous memoryviews (e.g. `int [::1]`).

Fortran-ordered arrays shouldn't be a problem. You should usually iterate over them contiguous dimension first, but it looks like you're doing that.

There may be be a gain to trying not to overallocate too much memory (although I get that this is unknown). E.g. start with something like 5000 elements, but be prepared to reallocate to double if/when it overflows. The cost will depend on how often you get it wrong.

Other than that I have no real suggestions. Unfortunately Fortran+F2py is sometimes my emergency solution if I can't get something fast enough in anything more friendly.

Peter Schay

unread,
Apr 1, 2021, 7:50:11 PM4/1/21
to cython-users
Hi,
You could just change the declarations of the arrays to:

cdef int col_starts[MAXNNZ]
cdef int row_indices[MAXNNZ]
cdef double x[MAXNNZ]

Using C arrays will require a change to the declaration of MAXNNZ:
cdef enum: MAXNNZ = 1000000

Is that faster than the fortran? 

Regards,
Pete


da-woods

unread,
Apr 2, 2021, 3:23:38 AM4/2/21
to cython...@googlegroups.com
It's probably faster if it works, but there's a limit to how much you can allocate on the stack, and this has to be pushing that limit. You may just find it crashes.
--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

Andrea Gavana

unread,
Apr 2, 2021, 3:36:55 AM4/2/21
to cython...@googlegroups.com
Hi Peter, da-woods,

Thank you for your answers. I tried Peter’s suggestion and at first I got ridiculously low elapsed times, so I imagined the compiler was optimizing away everything. 

So I added a “return nza “ at the end of the cython function - I get more reasonable timings, and it’s indeed faster than the fortran version - assuming that single return line fooled the compiler, I have no idea.

That said, it works in the toy function I posted: however, the moment I integrate that change into a method inside a Cython class in a script that wraps a large C library, it bombs out - no error messages, nothing. Simply bombs.

So I guess da-woods might be correct, and I can see only two alternatives:

- Reduce MAXNNZ to a lower value
- Increase the stack size (no idea how to do that)

Thank you again.

Andrea.


Peter Schay

unread,
Apr 2, 2021, 2:56:14 PM4/2/21
to cython-users
Hi Andrea, hi David,
OK; thanks for the feedback on my slightly-too-simple idea.
I tried a few more experiments and I think I learned a bit more about this test.  
(I am on vacation and wanted to learn more about typed memoryviews for a while now, so thanks for posting this excellent test)
It seems this is one of those games where whoever goes first loses :-)
The test.py loop appears to provides a cache advantage to the second implementation, which is most noticeable for large m and n.

Here are some test runs, please let me know if I've understood this correctly and if these are reproducible in your env...

pete-mbp2:testnnz pete$ python test.py | grep 10486
10545  10486  23787  Fortran: 80.9882   
10545  10486  23787  Cython2: 66.8081
10545  10486  23787  Cython : 66.7317 
pete-mbp2:testnnz pete$ python test.py | grep 10486
10545  10486  23787  Cython : 78.1318   
10545  10486  23787  Fortran: 70.2815   
10545  10486  23787  Cython2: 66.7154   
pete-mbp2:testnnz pete$ python test.py | grep 10486
10545  10486  23787  Cython2: 78.2976   
10545  10486  23787  Fortran: 70.2670   
10545  10486  23787  Cython : 66.8833   

The 'Cython' implementation uses the typed memoryview for the 3 arrays, e.g.:  cdef int [:] col_starts = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')
'Cython2' uses the contiguous version with [::1]

Cython3 takes another pure C approach, this time without breaking the stack (one could just as easily malloc  or PyMem_Malloc this memory, and perhaps that might make a difference for the smaller m,n tests, which I did not experiment with much):
cdef int [::1] col_starts_arr = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')
cdef int *col_starts = &col_starts_arr[0]

pete-mbp2:testnnz pete$ python test.py | grep 10486
10545  10486  23787  Cython3: 78.2551   
10545  10486  23787  Fortran: 70.7756   
10545  10486  23787  Cython : 66.9182   

Taking just the first row from the above test runs, it seems Cython memoryviews are as advertised - just as fast as C!
10545  10486  23787  Fortran: 80.9882   
10545  10486  23787  Cython : 78.1318   
10545  10486  23787  Cython2: 78.2976   
10545  10486  23787  Cython3: 78.2551   

Please let me know if I've overlooked anything and how it looks



Reply all
Reply to author
Forward
0 new messages