Cython and Fortran+f2py

Andrea Gavana

unread,

Apr 1, 2021, 3:47:14 PM4/1/21

to cython-users

Dear Cython Users,

I am a very inexperienced Cython users and I am trying to understand how to speed up a little function - I seem to have tried everything I know but the Cython implementation I have is still between 5 and 50 times slower than a dumb Fortran+f2py routine I have written.

This little function simply finds the row and column indices and the nonzero values of a 2D matrix, which is relatively sparse.

That said, both the Cython and Fortran+f2py routines receive a dense, single precision, fortran-ordered 2D matrix of varying dimensions. Both routines don't return anything (on purpose, this is a very simplified version of the problem I have), but I think I have taken steps to avoid the compiler optimizing away stuff.

If I run both routines with the attached sample, I get this type of elapsed times:

Elapsed Times (ms)

M N NNZ Cython Fortran + F2PY Ratio

1661 2608 5833 25.0584 2.7079 9.3

2828 3512 7191 40.0933 6.4342 6.2

780 985 1657 6.7442 0.4403 15.3

799 1558 2772 10.8066 0.7142 15.1

399 540 908 3.1496 0.1289 24.4

10545 10486 23787 271.3768 70.9125 3.8

3369 3684 6962 42.6007 7.9357 5.4

2052 5513 6822 39.4725 7.2816 5.4

224 628 1555 4.8283 0.0944 51.1

553 1475 1849 7.4616 0.4713 15.8

Where M, N are the number of rows/columns of the matrix, NNZ the number of nonzero elements and then two columns for the elapsed time in ms, plus the ratio Cython/Fortran.

It may easily be that passing a single precision, fortran-ordered matrix to Cython is not the best idea, but this is the input I have to live with.

Any suggestion in making the Cython implementation more performant is most welcome - I attach all the scripts at the end of the email. Thank you in advance.

Andrea.

# CYTHON - cython_test_nnz.pyx

import numpy as np

cimport numpy as np

import cython

from cython.view cimport array as cvarray

cdef float SMALL = 1e-10

cdef int MAXNNZ = 1000000

@cython.boundscheck(False) # turn off bounds-checking for entire function

@cython.wraparound(False) # turn off negative index wrapping for entire function

@cython.nonecheck(False)

cpdef find_nnz(np.float32_t[::1, :] matrix):

cdef int n, m, i, j, nza

cdef float val1

col_starts = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')

row_indices = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')

x = cvarray(shape=(MAXNNZ,), itemsize=sizeof(double), format='d')

m = matrix.shape[0]

n = matrix.shape[1]

nza = 0

for i in range(n):

for j in range(m):

val1 = matrix[j, i]

if abs(val1) < SMALL:

continue

col_starts[nza] = i+1

row_indices[nza] = j

x[nza] = <double>val1

nza += 1

# Cython setup.py file

from setuptools import setup

from setuptools import Extension

from Cython.Build import cythonize

import numpy as np

if __name__ == '__main__':

extra_compile_args = ['-O3', '-ffast-math', '-funroll-loops', '-flto', '-ftree-vectorize']

extra_link_args = ['-flto'] + extra_compile_args

extensions = [Extension('cython_test_nnz', ['cython_test_nnz.pyx'],

extra_compile_args=extra_compile_args,

extra_link_args=extra_link_args,

include_dirs=[np.get_include()],

define_macros= [('WIN32', 1)] ]

# build the core extension(s)

setup_kwargs = {'ext_modules': cythonize(extensions)}

setup(**setup_kwargs)

# Fortran implementation

subroutine find_nnz(matrix, m, n)

implicit none

integer, intent(in):: m, n

real(4), intent(in), dimension(m, n) :: matrix

integer nza, i, j, maxnnz

real(4) small, val1

parameter(maxnnz=1000000)

double precision, dimension(maxnnz):: x

integer, dimension(maxnnz):: row_indices, col_starts

small = 1e-10

nza = 1

do i = 1, n

do j = 1, m

val1 = matrix(j, i)

if (abs(val1) < small) then

cycle

endif

col_starts(nza) = i+1

row_indices(nza) = j

x(nza) = dble(val1)

nza = nza + 1

enddo

! Avoid compiler optimizing away stuff

x(1) = x(1) + 1.0

col_starts(1) = col_starts(1) + 1

row_indices(1) = row_indices(1) + 1

end subroutine find_nnz

# Compile the fortran routine

f2py --fcompiler=gfortran --opt="-O3 -ffast-math -funroll-loops -flto -ftree-vectorize" -c fortran_test_nnz.f90 -m fortran_test_nnz

# Test Script

import os

import time

import numpy as np

from scipy.sparse import rand as srand

import cython_test_nnz

import fortran_test_nnz

SIZES = [(1661, 2608) , (2828, 3512), (780, 985) , (799, 1558), (399, 540),

(10545, 10486), (3369, 3684), (2052, 5513), (224, 628) , (553, 1475)]

NNZ = [5833, 7191, 1657, 2772, 908, 23787, 6962, 6822, 1555, 1849]

np.random.seed(123456)

def tester():

for i in range(len(NNZ)):

m, n = SIZES[i]

nnz = NNZ[i]

density = nnz/float(m*n)

matrix = srand(m, n, density=density, format='coo', dtype=np.float32)

matrix = matrix.todense(order='F')

print('%-6d %-6d %-6d'%(m, n, nnz)),

for implementation in ['Cython', 'Fortran']:

function = cython_test_nnz.find_nnz if implementation == 'Cython' else fortran_test_nnz.find_nnz

time_setup = []

for k in range(10):

start = time.clock()

function(matrix)

elapsed = time.clock() - start

time_setup.append(elapsed * 1e3)

print('%-10.4f' % np.mean(time_setup)),

print('')

if __name__ == '__main__':

tester()

da-woods

unread,

Apr 1, 2021, 4:13:43 PM4/1/21

to cython...@googlegroups.com

1) Have you tried getting annotated output from Cython (cython -a filename.pyx)? it'll give you a colour-coded HTML file that'll give you a clue what's not optimized.

2) I suspect you need to type col_starts, row_indices, and x as typed memoryviews (to speed up indexing them)

3) Fortran and f2py are actually pretty quick. You should be able to get close but I don't think you'd expect to beat them by huge margins

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/b4cca13b-6c0a-455f-8ac3-eb2e03cc06b9n%40googlegroups.com.

Andrea Gavana

unread,

Apr 1, 2021, 4:24:32 PM4/1/21

to cython-users

Hi da-woods,

Thank you for your answer and your interest.

On Thursday, April 1, 2021 at 10:13:43 PM UTC+2 da-woods wrote:

1) Have you tried getting annotated output from Cython (cython -a filename.pyx)? it'll give you a colour-coded HTML file that'll give you a clue what's not optimized.

I have added the type annotations to the memoryviews - and most of the annotated output is white (I am attaching a screenshot). The only yellow remaining is around the typed memoryviews (maybe I have done something wrong).

It is indeed much better: the ratio went down to between 1.5 and 4.2.

2) I suspect you need to type col_starts, row_indices, and x as typed memoryviews (to speed up indexing them)

3) Fortran and f2py are actually pretty quick. You should be able to get close but I don't think you'd expect to beat them by huge margins

I was afraid that would be the answer... I guess nothing more can be done to the modified version below?

import numpy as np

cimport numpy as np

import cython

from cython.view cimport array as cvarray

cdef float SMALL = 1e-10

cdef int MAXNNZ = 1000000

@cython.boundscheck(False) # turn off bounds-checking for entire function

@cython.wraparound(False) # turn off negative index wrapping for entire function

@cython.nonecheck(False)

cpdef find_nnz(np.float32_t[::1, :] matrix):

cdef int n, m, i, j, nza

cdef float val1

cdef int [:] col_starts = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')

cdef int [:] row_indices = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')

cdef double [:] x = cvarray(shape=(MAXNNZ,), itemsize=sizeof(double), format='d')

m = matrix.shape[0]

n = matrix.shape[1]

nza = 0

for i in range(n):

for j in range(m):

val1 = matrix[j, i]

if abs(val1) < SMALL:

continue

col_starts[nza] = i+1

row_indices[nza] = j

x[nza] = <double>val1

nza += 1

Thank you again.

Andrea.

da-woods

unread,

Apr 1, 2021, 4:55:39 PM4/1/21

to cython...@googlegroups.com

There's probably a small gain to be had by specifying contiguous memoryviews (e.g. `int [::1]`).

Fortran-ordered arrays shouldn't be a problem. You should usually iterate over them contiguous dimension first, but it looks like you're doing that.

There may be be a gain to trying not to overallocate too much memory (although I get that this is unknown). E.g. start with something like 5000 elements, but be prepared to reallocate to double if/when it overflows. The cost will depend on how often you get it wrong.

Other than that I have no real suggestions. Unfortunately Fortran+F2py is sometimes my emergency solution if I can't get something fast enough in anything more friendly.

To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/041b5953-80da-4807-9b60-d3a924f3e421n%40googlegroups.com.

Peter Schay

unread,

Apr 1, 2021, 7:50:11 PM4/1/21

to cython-users

Hi,

You could just change the declarations of the arrays to:

cdef int col_starts[MAXNNZ]

cdef int row_indices[MAXNNZ]

cdef double x[MAXNNZ]

Using C arrays will require a change to the declaration of MAXNNZ:

cdef enum: MAXNNZ = 1000000

Is that faster than the fortran?

Regards,

Pete

da-woods

unread,

Apr 2, 2021, 3:23:38 AM4/2/21

to cython...@googlegroups.com

It's probably faster if it works, but there's a limit to how much you can allocate on the stack, and this has to be pushing that limit. You may just find it crashes.

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/423360ac-fce9-48fc-bd6b-2d5ba312ea4bn%40googlegroups.com.

Andrea Gavana

unread,

Apr 2, 2021, 3:36:55 AM4/2/21

to cython...@googlegroups.com

Hi Peter, da-woods,

Thank you for your answers. I tried Peter’s suggestion and at first I got ridiculously low elapsed times, so I imagined the compiler was optimizing away everything.

So I added a “return nza “ at the end of the cython function - I get more reasonable timings, and it’s indeed faster than the fortran version - assuming that single return line fooled the compiler, I have no idea.

That said, it works in the toy function I posted: however, the moment I integrate that change into a method inside a Cython class in a script that wraps a large C library, it bombs out - no error messages, nothing. Simply bombs.

So I guess da-woods might be correct, and I can see only two alternatives:

- Reduce MAXNNZ to a lower value

- Increase the stack size (no idea how to do that)

Thank you again.

Andrea.

To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/3c09b99a-1c8b-8c71-810e-ce71535f5eae%40d-woods.co.uk.

Peter Schay

unread,

Apr 2, 2021, 2:56:14 PM4/2/21

to cython-users

Hi Andrea, hi David,

OK; thanks for the feedback on my slightly-too-simple idea.

I tried a few more experiments and I think I learned a bit more about this test.

(I am on vacation and wanted to learn more about typed memoryviews for a while now, so thanks for posting this excellent test)

It seems this is one of those games where whoever goes first loses :-)

The test.py loop appears to provides a cache advantage to the second implementation, which is most noticeable for large m and n.

Here are some test runs, please let me know if I've understood this correctly and if these are reproducible in your env...

pete-mbp2:testnnz pete$ python test.py | grep 10486

10545 10486 23787 Fortran: 80.9882

10545 10486 23787 Cython2: 66.8081

10545 10486 23787 Cython : 66.7317

pete-mbp2:testnnz pete$ python test.py | grep 10486

10545 10486 23787 Cython : 78.1318

10545 10486 23787 Fortran: 70.2815

10545 10486 23787 Cython2: 66.7154

pete-mbp2:testnnz pete$ python test.py | grep 10486

10545 10486 23787 Cython2: 78.2976

10545 10486 23787 Fortran: 70.2670

10545 10486 23787 Cython : 66.8833

The 'Cython' implementation uses the typed memoryview for the 3 arrays, e.g.: cdef int [:] col_starts = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')

'Cython2' uses the contiguous version with [::1]

Cython3 takes another pure C approach, this time without breaking the stack (one could just as easily malloc or PyMem_Malloc this memory, and perhaps that might make a difference for the smaller m,n tests, which I did not experiment with much):

cdef int [::1] col_starts_arr = cvarray(shape=(MAXNNZ,), itemsize=sizeof(int), format='i')

cdef int *col_starts = &col_starts_arr[0]

pete-mbp2:testnnz pete$ python test.py | grep 10486

10545 10486 23787 Cython3: 78.2551

10545 10486 23787 Fortran: 70.7756

10545 10486 23787 Cython : 66.9182

Taking just the first row from the above test runs, it seems Cython memoryviews are as advertised - just as fast as C!

10545 10486 23787 Fortran: 80.9882

10545 10486 23787 Cython : 78.1318

10545 10486 23787 Cython2: 78.2976

10545 10486 23787 Cython3: 78.2551

Please let me know if I've overlooked anything and how it looks

Reply all

Reply to author

Forward