Puzzling performance differences between operating systems

19 views
Skip to first unread message

Pedro Camargo

unread,
Feb 13, 2023, 11:19:16 AM2/13/23
to Cython Users
Hello,
          I come to you with a question/problem that might not be narrow enough, so I do apologize in advance.


Context:

1. We have recently funded some performance improvements on AequilibraE (an open-source package for transportation modelling) focused on a new data structure (4-ary) for the path-finding algorithm (Dijkstra), and part of the work involved some pretty comprehensive benchmarking for different model sizes, platforms and hardware configurations.

2. The problem of traffic assignment involves computing one-to-many shortest paths many thousands of times (which AequilibraE does in a parallelized way by releasing the GIL) and doing very little processing involving the shortest path tree after its computation

3. We wanted to find out what was the performance penalty of doing a little bit more post-processing of these shortest-path trees when we implemented a new feature called select-link analysis 


Puzzling result:

When benchmarking the code on Windows (in different machines), we saw a performance penalty of 11 to 14%, which was in line with our expectations, but in Linux we got a penalty of 100% to 110%, which sounds absolutely ludicrous.

Running this code on the same PC under Windows and WSL2, we get this same discrepancy, with the base case being almost a perfect match (no meaningful difference between Linux and Windows). PC is a Threadripper 3970X, using all 64 threads.

This same excessive performance penalty was found on machines running Linux as their only operating system.


Question:

Is there some extra magic needed either in compilation or on the code itself (both pasted below)?

The code entire code can be found on its repo.


Thanks,
Pedro




The "extra" piece of code that is being run is this:


@cython.wraparound(False)
@cython.embedsignature(True)
@cython.boundscheck(False)
cdef void sl_network_loading(
long long [:, :] selected_links,
double [:, :] demand,
long long [:] pred,
long long [:] conn,
double [:, :] link_loads,
double [:, :, :] sl_od_matrix,
double [:, :, :] sl_link_loading,
unsigned char [:] has_flow_mask,
long classes) nogil:
# VARIABLES:
# selected_links: 2d memoryview. Each row corresponds to a set of selected links specified by the user
# demand: The input demand matrix for a given origin. The first index corresponds to destination,
# second is the class
# pred: The list of predecessor nodes, i.e. given a node, referencing that node's index in pred
# yields the previous node in the minimum spanning tree
# conn: The list of links which connect predecessor nodes. referencing it by the predecessor yields
# the link it used to connect the two nodes
# link_loads: Stores the loading on each link. Equivalent to link_loads in network_loading
# temp_sl_od_matrix: Stores the OD matrix for each set of selected links sliced for the given origin
# The indices are: set of links, destination, class
# temp_sl_link_loading: Stores the loading on the Selected links, and the paths which use the selected links
# The indices are: set of links, link_id, class)
# has_flow_mask: An array which acts as a flag for which links were used in retracing a given OD path
# classes: the number of subclasses of vehicles for the given TrafficClass
#
# Executes regular loading, while keeping track of SL links
cdef:
int i, j, k, l, dests = demand.shape[0], xshape = has_flow_mask.shape[0]
long long predecessor, connection, lid, link
bint found
for j in range(dests):
memset(&has_flow_mask[0], 0, xshape * sizeof(unsigned char))
connection = conn[j]
predecessor = pred[j]

# Walk the path and mark all used links in the has_flow_mask
while predecessor >= 0:
for
k in range(classes):
link_loads[connection, k] += demand[j, k]
has_flow_mask[connection] = 1
connection = conn[predecessor]
predecessor = pred[predecessor]
# Now iterate through each SL set and see if any of their links were marked
for i in range(selected_links.shape[0]):
# We check to see if the given OD path marked any of our selected links
found = 0
l = 0
while l < selected_links.shape[1] and found == 0:
# Checks to see if the current set of selected links has finished
# NOTE: -1 is a default value for the selected_links array. It cannot be a link id, if -1 turns up
# There is either a serious bug, or the program has reached the end of a set of links in SL.
# This lets us early exit from the loop without needing to iterate through the rest of the array
# Which just has placeholder values
if selected_links[i][l] == -1:
break
if
has_flow_mask[selected_links[i][l]] != 0:
found = 1
l += 1
if found == 0:
continue
for
k in range(classes):
sl_od_matrix[i, j, k] = demand[j, k]
connection = conn[j]
predecessor = pred[j]
while predecessor >= 0:
for
k in range(classes):
sl_link_loading[i, connection, k] += demand[j, k]
connection = conn[predecessor]
predecessor = pred[predecessor]




Our compilation code is the following:

import platform

import numpy as np
import pyarrow as pa
from Cython.Build import cythonize

try:
from setuptools import setup
from setuptools import Extension
except ImportError:
from distutils.core import setup
from distutils.extension import Extension

if "WINDOWS" in platform.platform().upper():
ext_modules = [
Extension(
"AoN",
["AoN.pyx"],
extra_compile_args=["/openmp"],
extra_link_args=["/openmp"],
define_macros=[("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION")],
include_dirs=[np.get_include(), pa.get_include()],
)
]
else:
ext_modules = [
Extension(
"AoN",
["AoN.pyx"],
extra_compile_args=["-fopenmp", "-std=c++17"], # do we want -Ofast?
extra_link_args=["-fopenmp"],
define_macros=[("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION")],
include_dirs=[np.get_include(), pa.get_include()],
)
]

setup(name="AoN", ext_modules=cythonize(ext_modules))


D Woods

unread,
Feb 17, 2023, 3:25:24 PM2/17/23
to cython-users
Can't see anything obviously wrong I'm afraid - your compile/link args look fine for both Windows and linux. I'd start by doing a thorough check of Cython/Python/compiler versions. But I suspect you'll already done that.

Sorry that isn't likely to be much help
Reply all
Reply to author
Forward
0 new messages