prange and OpenMP performance

149 views

Skip to first unread message

Craig Warren

unread,

Feb 22, 2016, 1:38:59 PM2/22/16

to cython-users

I am working on a Finite-Difference Time-Domain code for electromagnetics (https://github.com/gprMax/gprMax). I'm using Cython for the FDTD engine.

I have noticed poor performance of the code when I compile/link with OpenMP. The scaling of threads vs. speedup is not bad, but it starts from a lot slower than the serial code. It takes 4 threads to recover the performance of the serial code. I have created a stripped back example to test with, which is essentially a main (Python) time loop that calls update functions that have been Cythonized. I use NumPy for the arrays and typed memory views to access them in my Cythonized update functions. Here is an example showing one (of six) FDTD update functions:

Main time loop in gprMax.py (Python)

import numpy as np from gprMax.fields_update import update_ex def main(): nthreads = 4 nx = 100 ny = 100 nz = 100 iterations = 1000 ID = np.ones((6, nx + 1, ny + 1, nz + 1), dtype=np.uint32) Ex = np.zeros((nx, ny + 1, nz + 1), dtype=np.float64) Hy = np.zeros((nx, ny + 1, nz), dtype=np.float64) Hz = np.zeros((nx, ny, nz + 1), dtype=np.float64) updatecoeffsE = np.zeros((10, 5), dtype=np.float64) for timestep in range(iterations): update_ex(nx, ny, nz, nthreads, updatecoeffsE, ID, Ex, Hy, Hz)

fields_update.pyx module (Cythonized)

cpdef update_ex(int nx, int ny, int nz, int nthreads, np.float64_t[:, :] updatecoeffsE, np.uint32_t[:, :, :, :] ID, np.float64_t[:, :, :] Ex, np.float64_t[:, :, :] Hy, np.float64_t[:, :, :] Hz):

    """This function updates the Ex field components.

    Args:
         nx, ny, nz (int): Grid size in cells
         nthreads (int): Number of threads to use
         updatecoeffs, ID, E, H (memoryviews): Access to update coeffients, ID and field component arrays 
    """

    cdef int i, j, k, listIndex

    for i in prange(0, nx, nogil=True, schedule='static', chunksize=1, num_threads=nthreads):
         for j in range(1, ny):
             for k in range(1, nz):
                 listIndex = ID[0, i, j, k]
                     Ex[i, j, k] = updatecoeffsE[listIndex, 0] * Ex[i, j, k] + updatecoeffsE[listIndex, 2] * (Hz[i, j, k] - Hz[i, j - 1, k]) - updatecoeffsE[listIndex, 3] * (Hy[i, j, k] - Hy[i, j, k - 1])

I can't see what I'm doing that might be causing such a slow down. Does the main time loop need to be Cythonized because the set up and tear down of the OpenMP environment every time the update functions are called is so expensive? I rather not do this as in the actual code there is a lot of other access to Python objects in the main time loop.

Here is a quick graph of performance (0 threads represents the serial version, i.e. compiled/linked without -fopenmp):

Craig Warren

unread,

Feb 23, 2016, 8:31:50 AM2/23/16

to cython-users

So it seems the cause of the slowdown of the OMP code compared to the serial code was the '-fno-strict-aliasing' compiler flag which was a default flag in my Miniconda environment. When I added '-f-strict-aliasing' in my setup.py and re-compiled all the threaded code was faster and the speed of the threaded code with a single thread only a touch slower than the serial version (as expected).

Any thoughts on what is happening with the Cython generated C code that might cause this?