Parallelism for multiple nested loops in Cython

53 views

Skip to first unread message

Toni Barth

unread,

May 9, 2018, 1:01:26 PM5/9/18

to cython...@googlegroups.com

Hi,

i'm currently facing a situation where I need to calculate a matrix of
data which can become rather large.

I need to fill a matrix which should be 1280 x 1280 in size with vector
distances. Each vector is also very large, about 960 for now. Distances
get calculated using the euclidian distance, which itself requires a
loop, summing up to 3 nested loops, which cycle for a total of 1280 x
1280 x 960 times, making a total of 1572864000 cycles.

Calculating this matrix takes some time (about 5 seconds for now).

I want to try to speed this up using parallelism with Cython and OpenMP,
but I don't seem to get started correctly, at least this is my first
time using real parallelism here.

I can provide my single-threaded code below. I already tried to
multithread it, but my only working attempt, using with nogil,
parallel(): and three nested prange-loops ended up even slower than the
single-threaded attempt, so I guess there is definitely something going
wrong there.

Maybe someone skilled with parallelism in Cython can give me some hint
into the right direction?

Here's the single-threaded stuff:

graph.pyx:

cdef class Graph(Node):

cdef double[:, :] distances

cdef void build_distances_matrix(Graph self, int[:, :] dataset):
    cdef double * m = <double*>malloc(dataset.shape[0] *
dataset.shape[0] * sizeof(double))
    cdef unsigned int i, j
    if m == NULL:
      raise TreeError("memory allocation error")
    self.distances = <double[:dataset.shape[0], :dataset.shape[0]]>m
    for i in xrange(dataset.shape[0]):
      for j in xrange(dataset.shape[0]):
        if i == j:
          self.distances[i,j] = .0
        self.distances[i,j] = Graph.get_euklid_distance(dataset[i],
dataset[j])

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@staticmethod
cdef inline double get_euklid_distance(int[:] l, int[:] r):
    cdef double d = 0
    cdef int i
    for i in xrange(l.size):
      d += pow(r[i]-l[i], 2)
    return sqrt(d)

Any help would be appriciated.

Best Regards.

Toni Barth

pEpkey.asc

Stefan Behnel

unread,

May 9, 2018, 3:26:37 PM5/9/18

to cython...@googlegroups.com

I would assign "dataset.shape[0]" to a local (size_t) variable to make it
visible to the C compiler that it will not change during the execution.

>     cdef unsigned int i, j
>     if m == NULL:
>       raise TreeError("memory allocation error")
>     self.distances = <double[:dataset.shape[0], :dataset.shape[0]]>m
>     for i in xrange(dataset.shape[0]):

What happens if you replace this xrange() with Cython's prange()?

>       for j in xrange(dataset.shape[0]):
>         if i == j:
>           self.distances[i,j] = .0

This looks like a good place for an "else:".

> self.distances[i,j] = Graph.get_euklid_distance(dataset[i],
> dataset[j])

Slicing the memory views might be more effort than you'd want to pay here.
If you inline the method manually, and use direct indexing instead of
slicing+indexing, it should become visibly faster.

Basically, when nesting loops over non-trivial data, avoid any overhead
whatsoever in the inner(most) loops.

> @cython.boundscheck(False)
> @cython.wraparound(False)
> @cython.nonecheck(False)
> @staticmethod
> cdef inline double get_euklid_distance(int[:] l, int[:] r):
>     cdef double d = 0
>     cdef int i
>     for i in xrange(l.size):
>       d += pow(r[i]-l[i], 2)
>     return sqrt(d)