[Python-Dev] Python 3.10 vs 3.8 performance degradation

2,907 views
Skip to first unread message

aivazia...@gmail.com

unread,
Dec 19, 2021, 12:20:38 PM12/19/21
to pytho...@python.org
Hello,

Being a programmer myself I realise that a report on performance degradation should ideally contain a small test program that clearly reproduces the problem. However, unfortunately, I do not have the time at present to isolate the issue to a small test case. But the good news (or bad news, I suppose) is that the problem appears to be reasonably general, namely it happens with two completely different programs.

Anyway, what I am claiming is that Python 3.10 is between 1.5 and 2.5 times SLOWER than Python 3.8, for rather generic scientific calculations such as Fourier analysis, ODE solving and plotting. On the one hand, the "test case" is a rather complex program that calculates Wigner function of a quantum system and the result is 9 seconds when run with 3.8 and 23 seconds when run with 3.10 (very easy to reproduce: just clone this repository: https://github.com/tigran123/quantum-infodynamics and run "time bin/harmonic-oscillator-solve.sh" from the dynamics subdirectory and then edit initgauss.py and solve.py to point to python3.10 and run it again). Make sure your TMPDIR points somewhere fast. My machine is a very fast 6-core i7-6800K at 4.2GHz and 128GB RAM. The storage is also a very fast NVMe, about 3GB/s.

After this try a completely different program which simulates a mathematical pendulum using PyQT (GUI) and it gives FPS:14-15 when run with 3.8 and only 11-12 when run with 3.10. Again, it is easy to reproduce if you have cloned the above repository: just go to classical-mechanics/pendulum subdirectory and run psim.py (click on the Play button in the control window and observe FPS in the plot window). Then edit psim.py to point to Python 3.10 and run it again. You would need PyQt5, matplotlib, numpy, scipy, pyFFTW for these programs to work.

I realise that you would much prefer a small specific test case, but I still hope that this report is "better than nothing". I do really desire to help improve Python and will provide more information if requested. I use Python everywhere, even in Termux on Android, and am quite saddened by this degradation...

With Python 3.8 I used these package versions:

matplotlib 3.1.3
numpy 1.18.1
pyFFTW 0.12.0
PyQt5 5.13.2
scipy 1.4.1

With Python 3.10 I used these package versions:

matplotlib 3.5.0
numpy 1.21.4
pyFFTW 0.12.0
PyQt5 5.15.6
scipy 1.7.3

Both Python 3.8 and 3.10 were compiled and installed by myself with "./configure --enable-optimizations ; make ; sudo make install".

Kind regards,
Tigran
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/34FP7FEG36UVBFW3ZDTL7GOQRSRBSXTQ/
Code of Conduct: http://python.org/psf/codeofconduct/

David Mertz, Ph.D.

unread,
Dec 19, 2021, 1:47:22 PM12/19/21
to aivazia...@gmail.com, Python-Dev
My guess is that this difference is predominantly different builds of NumPy.  For example, the Intel optimized builds are very good, and a speed difference of the magnitude shown in this note are typical.  E.g. https://www.intel.com/content/www/us/en/developer/articles/technical/numpyscipy-with-intel-mkl.html
--
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.

Paul Bryan

unread,
Dec 19, 2021, 1:57:21 PM12/19/21
to Tigran Aivazian, pytho...@python.org
"Exactly the same" between Python versions, or exactly the same as previously reported?

On Sun, 2021-12-19 at 18:48 +0000, Tigran Aivazian wrote:
To eliminate the possibility of being affected by the different versions of numpy I have just now upgraded numpy in Python 3.8 environment to the latest version, so both 3.8 and 3.10 and using numpy 1.21.4 and still the timing is exactly the same.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org

Tigran Aivazian

unread,
Dec 19, 2021, 1:57:53 PM12/19/21
to pytho...@python.org
To eliminate the possibility of being affected by the different versions of numpy I have just now upgraded numpy in Python 3.8 environment to the latest version, so both 3.8 and 3.10 and using numpy 1.21.4 and still the timing is exactly the same.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/THPN4OWM3A335LDO7HVIQSIDFFVO5URZ/

Tigran Aivazian

unread,
Dec 19, 2021, 2:06:08 PM12/19/21
to pytho...@python.org
Alas, it is exactly the same as previously reported, so the problem persists. If it was exactly the same between Python versions I would celebrate and shout for joy, seeing that the problem is narrowed down to numpy.

I can carefully upgrade all the other packages in 3.8 to match those in 3.10. As I can downgrade (I will test it first), I should be able to restore my "superfast 3.8 environment", should this upgrade break it. I will report what I discover.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/ZM7UU6CVMIWEJEXB7V57N4FML2A7RLQ3/

David Mertz, Ph.D.

unread,
Dec 19, 2021, 2:07:37 PM12/19/21
to Tigran Aivazian, Python-Dev
Not the version, but the build. Did you compile NumPy from source using the same compiler with both Python versions? If not, that remains my strong hunch about performance difference. 

Given what your programs do, it sure seems like the large majority of runtime is spent in supporting numeric libraries, not in Python interpreter itself. 

Profiling is the way to find out.

Tigran Aivazian

unread,
Dec 19, 2021, 2:09:50 PM12/19/21
to pytho...@python.org
In both cases I installed numpy using "sudo -H pip install numpy". And just now I upgraded numpy in 3.8 using "sudo -H pip3.8 install --upgrade numpy".

I will try to simplify the program by removing all the higher level complexity and see what I find.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/SPI6K4LNO5BFLIUGYBHCMYCXX7FO7YV5/

Sebastian Berg

unread,
Dec 19, 2021, 2:20:03 PM12/19/21
to pytho...@python.org
On Sun, 2021-12-19 at 18:48 +0000, Tigran Aivazian wrote:
> To eliminate the possibility of being affected by the different
> versions of numpy I have just now upgraded numpy in Python 3.8
> environment to the latest version, so both 3.8 and 3.10 and using
> numpy 1.21.4 and still the timing is exactly the same.

NumPy is very unlikely to have gotten slower. Please please time your
script before jumping to conclusion. For example 2/3 of the time of
that pendulum plotter is spend in plotting, and most of that seems to
be spend in text rendering.
(Yeah, there is a a little bit of time in NumPy's `arr.take()` also,
but I doubt that has anything to do with this.)

Now, I don't know what does the text rendering, but maybe that got
slower.

Cheers,

Sebastian
signature.asc

Tigran Aivazian

unread,
Dec 19, 2021, 2:26:18 PM12/19/21
to pytho...@python.org
I think I have found something very interesting. Namely, I removed all multiprocessing (which is done in the shell script, not in Python) and so reduced the program to just a single thread of execution. And lo and behold, Python 3.10 now consistently beats 3.8 by about 5%. However, this is not the END! Namely, it is very important to find out why when running multiple processes simultaneously 3.8 still outperforms 3.10. The thing is -- all these different threads write to completely unrelated data files (.npz and .npy) The only thing they all have in common is the initial data, which they all read from the same 'init.npz' and 'init_W.npy' files using:

with load(args.ifilename + '.npz', allow_pickle=True) as data:

and

Winit = memmap(iWfilename, dtype='float64', mode='r', shape=(Nt, Nx, Np))

So, could this be the problem?
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/SMTEEMBDUJ7ZYM6HYOOZXT6NOHJFJIYY/

Tigran Aivazian

unread,
Dec 19, 2021, 2:31:10 PM12/19/21
to pytho...@python.org
I have created four different sets of initial data, one for each thread of execution and no, unfortunately, that does NOT solve the problem. Still, when four threads are executed in parallel, 3.8 outperforms 3.10 by a factor of 2.4. So, there is some other point of contention between the threads, which I need to find...
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/35QRBPQFN4MOCSADYB4HSTJQXZ2QTSKT/

Mats Wichmann

unread,
Dec 19, 2021, 3:05:39 PM12/19/21
to pytho...@python.org
On 12/19/21 06:46, aivazia...@gmail.com wrote:
> Hello,
>
> Being a programmer myself I realise that a report on performance degradation should ideally contain a small test program that clearly reproduces the problem. However, unfortunately, I do not have the time at present to isolate the issue to a small test case. But the good news (or bad news, I suppose) is that the problem appears to be reasonably general, namely it happens with two completely different programs.
>

Just FYI (if you didn't already know), there is long-term tracking of
performance benchmarks which you can see reflected at
https://speed.python.org. The intent is that things not come as a
surprise, so if there indeed turns out to be a surprise underneath your
issue - and we all know benchmarking of complex workflows is quite
tricky - maybe there's a new check that will want to be added there.


_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/J4YTIZIZCYOFHYGHEHXNOFZC4UZLA6FQ/

Tigran Aivazian

unread,
Dec 19, 2021, 3:08:20 PM12/19/21
to pytho...@python.org
So far I have narrowed it down to a block of code in solve.py doing a lot of multi-threaded FFT (i.e. with fft(..., threads=6) of pyFFTW), as well as numpy exp() and other functions and pure Python heavy list manipulation (yes, lists, not numpy arrays). All of this together (or some one part of it, yet to be discovered) is behaving as if there was some global lock taken behind the scene (i.e. inside Python interpreter), so that when multiple instances of the script (which I loosely called "threads" in previous posts, but here correct myself as the word "threads" is used more appropriately in the context of FFT in this message) are executed in parallel, they slow each other down in 3.10, but not so in 3.8.

So this is definitely a very interesting 3.10 degradation problem. I will try to investigate some more tomorrow...
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/BTXTX7VBXZTJBIJIX2KMAAOOQDE52R5K/

Tigran Aivazian

unread,
Dec 19, 2021, 3:47:10 PM12/19/21
to pytho...@python.org
I have got it narrowed down to the "threads=6" argument of fft() and ifft() functions of pyFFTW! Namely, if I do NOT pass "threads=6" to fft()/iff(), then the parallel execution of multiple instances of the scripts is the same in Python 3.8 and 3.10. But it is a bit slower than with "threads=6", of course (as my "multiprocessing" on the shell script level is tied to the multiple physical problems being solved simultaneously and this number is small -- say 4, but I have 12 processors (6 physical cores) which could execute code in parallel).

So, this is where we are right now: the version pyFFTW 0.12.0 on Python 3.8 with threads=6 is 2.4 times faster than the same version 0.12.0 pyFFTW on Python 3.10, when four scripts are executed in parallel. But removing "threads=6" makes 3.10 much faster, and 3.8 a bit slower. Though not too slow -- instead of 9 vs 23 seconds I get 11.2 (Python 3.8) vs 10.8 (Python 3.10) seconds, so Python 3.10 is even a little bit faster than 3.8, but still not as fast as with threads=6 on 3.8.

However, that pendulum PyQT GUI application does NOT do any Fourier transforms! So, the problem with FPS in pendulum plotting is something different.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/LRQIELQV5R5LDDCRRL2VDTS7DKY7OLPT/

MRAB

unread,
Dec 19, 2021, 4:42:12 PM12/19/21
to pytho...@python.org
On 2021-12-19 20:06, Tigran Aivazian wrote:
> So far I have narrowed it down to a block of code in solve.py doing a lot of multi-threaded FFT (i.e. with fft(..., threads=6) of pyFFTW), as well as numpy exp() and other functions and pure Python heavy list manipulation (yes, lists, not numpy arrays). All of this together (or some one part of it, yet to be discovered) is behaving as if there was some global lock taken behind the scene (i.e. inside Python interpreter), so that when multiple instances of the script (which I loosely called "threads" in previous posts, but here correct myself as the word "threads" is used more appropriately in the context of FFT in this message) are executed in parallel, they slow each other down in 3.10, but not so in 3.8.
>
> So this is definitely a very interesting 3.10 degradation problem. I will try to investigate some more tomorrow...
>
"is behaving as if there was some global lock taken behind the scene
(i.e. inside Python interpreter)"?

The Python interpreter does have the GIL (Global Interpreter Lock). It
can't execute Python bytecodes in parallel, but timeshares between the
threads.

The GIL is released during I/O and by some extensions while they're
processing, but when they want to return, or if they want to use the
Python API, they need to acquire the GIL again.

The only way to get true parallelism in CPython is to use
multiprocessing, where it's running in multiple processes.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/ZDETUOFHAEBAQ3DV5JZROOBN6IE3VO5O/

Christopher Barker

unread,
Dec 19, 2021, 11:20:18 PM12/19/21
to Python Dev
On Sun, Dec 19, 2021 at 1:46 PM MRAB <pyt...@mrabarnett.plus.com> wrote:
On 2021-12-19 20:06, Tigran Aivazian wrote:
> So far I have narrowed it down to a block of code in solve.py doing a lot of multi-threaded FFT (i.e. with fft(..., threads=6) of pyFFTW), as well as numpy exp() and other functions and pure Python heavy list manipulation (yes, lists, not numpy arrays).
 
The Python interpreter does have the GIL (Global Interpreter Lock). It
can't execute Python bytecodes in parallel, but timeshares between the
threads.

Sure. But what the OP seems to have discovered is that there is some difference in behavior between 3.8 and 3.10 -- and AFIAK, there are not intended major changes in the GIL between those two releases.

I *think* that all of the issues have involved numpy (pyFFTW depends on numpy as well), and certainly matplotlib does) -- but I think the OP has made sure that the numpy (and other libs) versions are all the same. There still remains to confirm that numpy (and other libs) are built exactly the same way in the py3.8 and 3.10 versions -- this can be a very complicated stack!

But it seems either cPython itself, or numpy (or Cyhton?) is doing something different. Still to be discovered what that is.

Note the OP: make sure that it's not as simple as a change to the default for the threads parameter.

Note2: even if this is a regression cPython itself, I suspect the numpy list may be a better wey to get it figured out.

-CHB


--
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

David Mertz, Ph.D.

unread,
Dec 19, 2021, 11:41:57 PM12/19/21
to Tigran Aivazian, Python-Dev
These are binary wheel installs though, no? Which means 3.8 version and 3.10 version were compiled at different times, even for the same NumPy version. Also for different platforms, I don't know which you are on.

I haven't checked what's on PyPI for each version. I think PyFFT is largely using NumPy.

You can find details with something like 

>>> import numpy.distutils
>>> numpy.distutils.unixccompiler.sysconfig.get_config_vars()

I suspect that will indicate interesting compiler differences even for the "same version."

As Chris Barker mentions, this will probably find people more familiar with the issue on the NumPy mailing list.
Reply all
Reply to author
Forward
0 new messages