Python 3.8 significantly faster than Python 3.9 and 3.10 when Cythonizing

850 views
Skip to first unread message

Mayur Ranchod

unread,
Jun 20, 2023, 12:17:18 PM6/20/23
to cython-users
Hi,

My environment is as follows:

Windows 10
Python 3.8.16
Cython 0.29.35
Numpy 1.24.3

I have cythonized my code, however, I noticed that when cythonizing using python 3.8, my code executes significantly faster compared to when cythonizing using python 3.9 or python 3.10. I have cythonized the exact same code and even created new conda environments with these different python versions to ensure that no other packages are causing the issue, but the issue persists. For context, these are the timings that I am achieving (averaged over 10 executions):

Python 3.8: 13s
Python 3.9 51s
Python 3.10: 60s

The stack trace when cythonizing is identical.
I also noticed that the cython code produced when using python 3.8 is 95KB whereas the file size is 92KB when using python 3.9 or python 3.10. My setup.py file is as follows:

from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy
setup(
ext_modules=cythonize("optimized_ADQ1.pyx",
compiler_directives = { "language_level" : "3"}),
include_dirs=[numpy.get_include()],
)

My code is quite lengthy, if that can assist in anyway, I am happy to add that here as well if requested.

Any assistance would be highly appreciated.

Kind Regards,
Mayur

Peter Schay

unread,
Jun 20, 2023, 2:00:16 PM6/20/23
to cython...@googlegroups.com
Hi,
Is it using more cpu time as well, or just more elapsed time?  If it's the latter, that might indicate something different in Python features the code is using.  If it's cpu time, profiling would be very helpful to understand the behavior and find out if Cython is causing the differences.
Does the code use thread or multiprocessing?

Regards,
Pete

da-woods

unread,
Jun 20, 2023, 2:06:14 PM6/20/23
to cython...@googlegroups.com
It might be worth trying the Cython 3 alpha versions - there's a few speed-ups with disabled on recent versions of Python in the 0.29.35 branch because they were difficult to backport. I'd expect a much smaller slow-down though.

> I also noticed that the cython code produced when using python 3.8 is 95KB whereas the file size is
> 92KB when using python 3.9 or python 3.10.

The .c file or the .so file? Presumably the .so file?

Peter Schay's suggestion of looking at CPU time vs elapsed time is a good one. Beyond that you need to use a profiler to get more information about which bits are slow.

David
--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/a43a9273-4a00-4ab8-8d17-946296afa4f0n%40googlegroups.com.


Mayur Ranchod

unread,
Jun 20, 2023, 3:00:13 PM6/20/23
to cython-users
Hi all,

Thank you for your suggestions.

Both the CPU and elapsed time is increased when using Python 3.9 or Python 3.10 compared to Python 3.8. I will profile the code tomorrow and will provide a further update based on my findings. I usually use PProfile to perform line-by-line profiling, would this be an adequate tool, or do you recommend another way?

The code does not use multithreading or multiprocessing.

I will try with Cython 3 alpha versions tomorrow (most probably alpha 9) and will provide a further update tomorrow.

I am referring to the .so file, not the .c file, apologies for the ambiguity.

Stefan Behnel

unread,
Jun 20, 2023, 3:11:27 PM6/20/23
to cython...@googlegroups.com
Mayur Ranchod schrieb am 20.06.23 um 20:18:
> I will try with Cython 3 alpha versions tomorrow (most probably alpha 9)

I'm not sure why you'd want to try that version (a9). Can't you use the
latest beta or the master branch from github?

Stefan

da-woods

unread,
Jun 20, 2023, 3:18:30 PM6/20/23
to cython...@googlegroups.com
I think this was my bad advice - I was the person to mention "alpha"
here. Stefan's suggestion is what i meant (but didn't write).


For profiling, I've had some success finding performance issues using
"perf". It's possibly more useful for C level details but this issue may
well come down to that. But start with the tools you know.

Mayur Ranchod

unread,
Jun 20, 2023, 4:02:59 PM6/20/23
to cython...@googlegroups.com
Understood.

Thanks, I will try with the latest versions (either beta version or the master branch) and will revert with an update tomorrow.
Thanks, I will see whether I am able to make any progress using PProfile and if not, I will look into "perf".

--

---
You received this message because you are subscribed to a topic in the Google Groups "cython-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cython-users/9bZDOMr6aeM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cython-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/e1c52139-ca01-b5df-ab4f-8b7037be8d9d%40d-woods.co.uk.

Mayur Ranchod

unread,
Jun 21, 2023, 7:24:18 AM6/21/23
to cython-users
Hi all,

I have tried the recommended solutions i.e., using Cython3.0.0b, and by profiling, however, the issue persists.
Specifically, I noticed that now, the .so file produced (with Python 3.0) is 105KB which is larger than before.
The timing is pretty much the same as before.

To profile my code, I followed the tutorial here, https://cython.readthedocs.io/en/latest/src/tutorial/profiling_tutorial.html, however, I am unable to deduce any useful information. This is what is produced:
Wed Jun 21 09:58:36 2023    Profile.prof

         1526258 function calls in 53.321 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   51.120   51.120   53.321   53.321 <string>:1(<module>)
   190530    0.942    0.000    0.942    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   190530    0.418    0.000    1.736    0.000 fromnumeric.py:71(_wrapreduction)
   190512    0.290    0.000    2.059    0.000 fromnumeric.py:2177(sum)
   190551    0.246    0.000    0.246    0.000 {built-in method builtins.getattr}
   190530    0.102    0.000    0.102    0.000 fromnumeric.py:72(<dictcomp>)
   190512    0.052    0.000    0.052    0.000 fromnumeric.py:2172(_sum_dispatcher)
       45    0.051    0.001    0.051    0.001 {method 'sort' of 'numpy.ndarray' objects}
   190527    0.043    0.000    0.043    0.000 {built-in method builtins.isinstance}
   190530    0.029    0.000    0.029    0.000 {method 'items' of 'dict' objects}
       15    0.018    0.001    0.018    0.001 {method 'reshape' of 'numpy.ndarray' objects}
       45    0.003    0.000    0.003    0.000 {method 'copy' of 'numpy.ndarray' objects}
       14    0.003    0.000    0.003    0.000 numeric.py:136(ones)
       90    0.001    0.000    0.001    0.000 {method 'searchsorted' of 'numpy.ndarray' objects}
       15    0.001    0.000    0.057    0.004 histograms.py:678(histogram)
       75    0.000    0.000    0.000    0.000 {built-in method numpy.asarray}
      412    0.000    0.000    0.000    0.000 fromnumeric.py:1980(shape)
       45    0.000    0.000    0.001    0.000 histograms.py:454(_search_sorted_inclusive)
       15    0.000    0.000    0.000    0.000 {built-in method numpy.fft._pocketfft_internal.execute}
       15    0.000    0.000    0.001    0.000 histograms.py:360(_get_bin_edges)
       15    0.000    0.000    0.000    0.000 function_base.py:1324(diff)
       14    0.000    0.000    0.000    0.000 {built-in method numpy.empty}
       45    0.000    0.000    0.054    0.001 fromnumeric.py:865(sort)
       30    0.000    0.000    0.000    0.000 fromnumeric.py:3176(ndim)
       39    0.000    0.000    0.018    0.000 fromnumeric.py:53(_wrapfunc)
      412    0.000    0.000    0.000    0.000 fromnumeric.py:1976(_shape_dispatcher)
       15    0.000    0.000    0.000    0.000 histograms.py:283(_ravel_and_check_weights)
        1    0.000    0.000   53.321   53.321 {built-in method builtins.exec}
       15    0.000    0.000    0.000    0.000 fromnumeric.py:2322(any)
       15    0.000    0.000    0.000    0.000 _pocketfft.py:122(fft)
       24    0.000    0.000    0.000    0.000 fromnumeric.py:1140(argmax)
       15    0.000    0.000    0.018    0.001 fromnumeric.py:200(reshape)
       24    0.000    0.000    0.000    0.000 {method 'argmax' of 'numpy.ndarray' objects}
       75    0.000    0.000    0.000    0.000 fromnumeric.py:3218(size)
       15    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
       15    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
       15    0.000    0.000    0.000    0.000 function_base.py:1320(_diff_dispatcher)
       14    0.000    0.000    0.000    0.000 multiarray.py:1080(copyto)
       15    0.000    0.000    0.000    0.000 fromnumeric.py:2317(_any_dispatcher)
       24    0.000    0.000    0.000    0.000 fromnumeric.py:1136(_argmax_dispatcher)
       15    0.000    0.000    0.000    0.000 _pocketfft.py:118(_fft_dispatcher)
       15    0.000    0.000    0.000    0.000 fromnumeric.py:195(_reshape_dispatcher)
        3    0.000    0.000    0.000    0.000 fromnumeric.py:2974(_prod_dispatcher)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Stefan Behnel

unread,
Jun 21, 2023, 10:02:16 AM6/21/23
to cython...@googlegroups.com
Mayur Ranchod schrieb am 21.06.23 um 10:01:
> To profile my code, I followed the tutorial here,
> https://cython.readthedocs.io/en/latest/src/tutorial/profiling_tutorial.html,
> however, I am unable to deduce any useful information. This is what is
> produced:
> Wed Jun 21 09:58:36 2023 Profile.prof
>
> 1526258 function calls in 53.321 seconds
>
> Ordered by: internal time
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 1 51.120 51.120 53.321 53.321 <string>:1(<module>)
> 190530 0.942 0.000 0.942 0.000 {method 'reduce' of
> 'numpy.ufunc' objects}

Yes, this looks unhelpful. Please make sure that you enabled profiling
supoort for your Cython module. The profile above doesn't seem to list any
Cython compiled modules.

Could you show us the commands that you used for building and
executing/profiling your code?

Stefan

Mayur Ranchod

unread,
Jun 21, 2023, 11:17:09 AM6/21/23
to cython-users
Sure, I made the following changes:

1) Added # cython: linetrace=True to the first line of my .pyx file
2) Added @cython.profile(True) to the top of my function in my .pyx file
3) Created a python script called profiling.py containing the following:
import pstats, cProfile
import optimized_ADQ1
import jpegio as jio

impath = "RGB.jpg"

im = jio.read(impath)
coeffArray = im.coef_arrays[0]
OutputMap = optimized_ADQ1.detectDQ_JPEG(coeffArray)


cProfile.runctx("optimized_ADQ1.detectDQ_JPEG(coeffArray)", {"coeffArray": coeffArray}, locals(), "Profile.prof")

s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()
4) I then executed the profiling.py script using python profiling.py.

I hope this address your query

Stefan Behnel

unread,
Jun 22, 2023, 7:29:41 AM6/22/23
to cython...@googlegroups.com
Mayur Ranchod schrieb am 21.06.23 um 17:11:
> I made the following changes:
>
> 1) Added # cython: linetrace=True to the first line of my .pyx file
> 2) Added @cython.profile(True) to the top of my function in my .pyx file

Hmm, I never tried enabling profiling only selectively for a single
function and not globally. Line tracing isn't the same as profiling, and
you want profiling here. Just to be sure, could you add the option
"profile=True" at the top of your file and try again?

Alternatively, you could use the "line_profiler" package. Or use a C level
profiler (as suggested before). The latter will most likely be much faster.



> 3) Created a python script called profiling.py containing the following:
> import pstats, cProfile
> import optimized_ADQ1
> import jpegio as jio
>
> impath = "RGB.jpg"
>
> im = jio.read(impath)
> coeffArray = im.coef_arrays[0]
> OutputMap = optimized_ADQ1.detectDQ_JPEG(coeffArray)
>
>
> cProfile.runctx("optimized_ADQ1.detectDQ_JPEG(coeffArray)", {"coeffArray":
> coeffArray}, locals(), "Profile.prof")

I assume that detectDQ_JPEG() is your Cython function that you want to
profile? I think this should generally work, although I usually just write
"python -m cProfile …" instead. It rarely hurts to have a couple of
unrelated setup functions in the profile, as long as they don't take ages
to run.

Stefan

da-woods

unread,
Jun 24, 2023, 8:59:48 AM6/24/23
to cython...@googlegroups.com
> Mayur Ranchod schrieb am 21.06.23 um 17:11:

[...]

Hi Mayur,

If you have a version of your code that demonstrates the problem, that
you're prepared to share publicly, and isn't too hard to get running,
I'd be happy to have a look. I'd just like to know if there's a
significant regression in Cython that we should be worrying about.

It might be a week or two before I have time though, so you'll have to
be a bit patient.

David

Mayur Ranchod

unread,
Jun 28, 2023, 10:17:19 AM6/28/23
to cython-users
Hi Stefan and David,

Thank you for your responses.

Unfortunately, I have not yet had an opportunity to act on my problem in the past few days, as soon as I am able to, I will post an update.

Apologies for the delay.

Mayur Ranchod

unread,
Jun 28, 2023, 11:03:09 AM6/28/23
to cython-users
Hi Stefan and David,

Just an update from my side, I have attempted to add profile=True at the top of the script. This produced a new profile result, however, still does not give any insight into the performance of the cython code.

To understand my current situation, I have shared a link to the code and trust that this would be helpful in diagnosing what could be the issue.

Environment setup
1.) Install numpy and cython
2.) Install jpegio in your conda environment here
3.) Build the cython code using  python setup.py build_ext --inplace
4.) Run python main.py (It may also be of interest to execute this script when using different python versions to witness the vast discrepancy in performance).
Unfortunately, I am not able to share the image that I am using, but the issue manifests itself for any image of the same size or larger. I am using an image with size 4032x3024.

Please let me know if there is any other way that I can assist.

Kind Regards,
Mayur

Mayur Ranchod

unread,
Jun 28, 2023, 11:04:23 AM6/28/23
to cython-users
I forgot, the pyIFD package also needs to be installed:

pip install git+https://github.com/eldritchjs/pyIFD



On Wednesday, 28 June 2023 at 15:17:19 UTC+1 Mayur Ranchod wrote:

da-woods

unread,
Jul 1, 2023, 10:28:14 AM7/1/23
to cython...@googlegroups.com
So having tried it (with Python3.8, Python3.10, and using Cython 3.0.0b3 to compile your code, but possibly an earlier version to compile jpegio) is that I get basically the same speed.

With 3.8:
Time Taken: 8.309803009033203s
With 3.10
Time Taken: 8.414199829101562s

This is with a 6060 x 4360 image, which is just an arbitrary photo I had that I tiled a few times.

I had to make a minor change to your code to get it to run (cnp.int32_t rather than int_t):
ctypedef cnp.int32_t INT_DTYPE_t

Since there's nothing interesting to see I haven't done any detailed profiling. My suspicion is that the difference you see is unrelated to Cython and is some other environmental difference - maybe what BLAS library Numpy is linked to or something similar.

If I were you, I'd strip the Cython out of your code (most of it looks like pure Python, with a few buffer types, so it should be easy enough to revert to being Python) and profile it normally as pure Python code. I suspect you'll find the culprit lies in some library that you're using rather than your own code.

David
--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/fd22802a-c8a5-4db9-9d5f-317e1b4567can%40googlegroups.com.


Mayur Ranchod

unread,
Jul 3, 2023, 7:40:46 AM7/3/23
to cython-users
Hi David,

Thank you for your response. I will take your advice and revert with a follow-up as soon as I have implemented them.
Initially, I also suspected that the issue might be due to an inconsistency in environment, but when I tested it out, I created new environments with bare minimum requirements to get the minimum working code to work, and with the same package versions (where possible)with the only difference being the python version, yet the issue persisted.

I will continue to investigate and if possible, will also investigate whether I get the same issue when executing the code on another machine.

Thanks!
Reply all
Reply to author
Forward
0 new messages