[Python-Dev] Python 3.3 vs. Python 2.7 benchmark results (again, but this time more solid numbers)

31 views
Skip to first unread message

Brett Cannon

unread,
Oct 26, 2012, 3:14:08 PM10/26/12
to python-dev
I re-ran the unladen benchmarks on my work machine and w/o the -b option flipped on (i.e. more thorough benchmark numbers). I figured I would share them now instead of after my PyCon Argentina talk in case people decide to dig into the results now, find a pathological problem in CPython, and then fix it before I give my presentation (if you have trouble running a benchmark or it isn't available in the repo because it's one I hacked together, just ask and I can help you run the benchmark if you want to try to speed things up). I have colour-coded benchmarks based on whether it is faster or slower in Python 3.3 (sorry for those of you who hate HTML email).

But the tl;dr message is that Python 3.3 looks good compared to Python 2.7 (the median benchmark score is 5% slower).

Worst benchmark is nosite_startup, best is telco. The benchmarks people might want to analyze (i.e. more than 20% slower in Python 3.3) are mako_v2, threaded_count, normal_startup, iterative_count, pathlib, formatted_logging, and simple_logging.

###########################################

Report on Linux 3.2.5-gg987 #1 SMP Fri Sep 14 02:36:36 PDT 2012 x86_64 x86_64
Total CPU cores: 12

### 2to3 ###
9.320000 -> 8.980000: 1.04x faster

### call_method ###
Min: 0.417756 -> 0.355247: 1.18x faster
Avg: 0.419688 -> 0.356382: 1.18x faster
Significant (t=92.85)
Stddev: 0.00604 -> 0.00577: 1.0479x smaller

### call_method_slots ###
Min: 0.417611 -> 0.358451: 1.17x faster
Avg: 0.420761 -> 0.359676: 1.17x faster
Significant (t=88.70)
Stddev: 0.00605 -> 0.00588: 1.0291x smaller

### call_method_unknown ###
Min: 0.459057 -> 0.359327: 1.28x faster
Avg: 0.462929 -> 0.360410: 1.28x faster
Significant (t=137.99)
Stddev: 0.00698 -> 0.00583: 1.1969x smaller

### call_simple ###
Min: 0.341689 -> 0.265289: 1.29x faster
Avg: 0.343003 -> 0.266503: 1.29x faster
Significant (t=124.20)
Stddev: 0.00555 -> 0.00511: 1.0859x smaller

### chameleon ###
Min: 0.072232 -> 0.062713: 1.15x faster
Avg: 0.074588 -> 0.064261: 1.16x faster
Significant (t=33.74)
Stddev: 0.00284 -> 0.00245: 1.1599x smaller

### chaos ###
Min: 0.313727 -> 0.367015: 1.17x slower
Avg: 0.317568 -> 0.371473: 1.17x slower
Significant (t=-26.72)
Stddev: 0.00962 -> 0.01053: 1.0942x larger

### django ###
Min: 0.798331 -> 0.855461: 1.07x slower
Avg: 0.801109 -> 0.860996: 1.07x slower
Significant (t=-87.43)
Stddev: 0.00336 -> 0.00348: 1.0356x larger

### fannkuch ###
Min: 1.364705 -> 1.327680: 1.03x faster
Avg: 1.380412 -> 1.337467: 1.03x faster
Significant (t=10.48)
Stddev: 0.02056 -> 0.02040: 1.0077x smaller

### fastpickle ###
Min: 0.763479 -> 0.805715: 1.06x slower
Avg: 0.770036 -> 0.810855: 1.05x slower
Significant (t=-12.73)
Stddev: 0.01618 -> 0.01589: 1.0180x smaller

### fastunpickle ###
Min: 0.588694 -> 0.663616: 1.13x slower
Avg: 0.596622 -> 0.672418: 1.13x slower
Significant (t=-23.22)
Stddev: 0.01503 -> 0.01752: 1.1656x larger

### float ###
Min: 0.363234 -> 0.344408: 1.05x faster
Avg: 0.376159 -> 0.354165: 1.06x faster
Significant (t=8.76)
Stddev: 0.01282 -> 0.01227: 1.0455x smaller

### formatted_logging ###
Min: 0.330988 -> 0.400309: 1.21x slower
Avg: 0.335522 -> 0.408920: 1.22x slower
Significant (t=-33.48)
Stddev: 0.00989 -> 0.01194: 1.2076x larger

### genshi ###
Min: 0.229140 -> 0.251766: 1.10x slower
Avg: 0.232124 -> 0.257252: 1.11x slower
Significant (t=-40.24)
Stddev: 0.00516 -> 0.00564: 1.0925x larger

### go ###
Min: 0.632778 -> 0.710382: 1.12x slower
Avg: 0.636143 -> 0.716748: 1.13x slower
Significant (t=-37.61)
Stddev: 0.00186 -> 0.01504: 8.0815x larger

### hexiom2 ###
Min: 150.982155 -> 154.702444: 1.02x slower
Avg: 151.194622 -> 154.780953: 1.02x slower
Significant (t=-15.83)
Stddev: 0.30047 -> 0.11103: 2.7063x smaller

### iterative_count ###
Min: 0.117036 -> 0.156752: 1.34x slower
Avg: 0.120802 -> 0.172218: 1.43x slower
Significant (t=-34.92)
Stddev: 0.00542 -> 0.00889: 1.6422x larger

### json_dump_v2 ###
Min: 3.449868 -> 3.522645: 1.02x slower
Avg: 3.467124 -> 3.541902: 1.02x slower
Significant (t=-13.20)
Stddev: 0.02701 -> 0.02960: 1.0959x larger

### json_load ###
Min: 0.981740 -> 0.567611: 1.73x faster
Avg: 0.986729 -> 0.572975: 1.72x faster
Significant (t=128.95)
Stddev: 0.01796 -> 0.01386: 1.2955x smaller

### mako_v2 ###
Min: 0.083660 -> 0.243323: 2.91x slower
Avg: 0.084634 -> 0.247875: 2.93x slower
Significant (t=-821.55)
Stddev: 0.00193 -> 0.00400: 2.0737x larger

### meteor_contest ###
Min: 0.257992 -> 0.232116: 1.11x faster
Avg: 0.262581 -> 0.236684: 1.11x faster
Significant (t=14.31)
Stddev: 0.00916 -> 0.00894: 1.0243x smaller

### nbody ###
Min: 0.375414 -> 0.293685: 1.28x faster
Avg: 0.379489 -> 0.299794: 1.27x faster
Significant (t=42.71)
Stddev: 0.00997 -> 0.00864: 1.1537x smaller

### normal_startup ###
Min: 0.360002 -> 0.593214: 1.65x slower
Avg: 0.386755 -> 0.600625: 1.55x slower
Significant (t=-134.28)
Stddev: 0.01055 -> 0.00395: 2.6704x smaller

### nqueens ###
Min: 0.300390 -> 0.363904: 1.21x slower
Avg: 0.304282 -> 0.368003: 1.21x slower
Significant (t=-37.41)
Stddev: 0.00813 -> 0.00888: 1.0920x larger

### pathlib ###
Min: 0.106088 -> 0.138693: 1.31x slower
Avg: 0.107279 -> 0.139885: 1.30x slower
Significant (t=-133.12)
Stddev: 0.00256 -> 0.00290: 1.1324x larger

### pidigits ###
Min: 0.351666 -> 0.341745: 1.03x faster
Avg: 0.354743 -> 0.344146: 1.03x faster
Significant (t=5.89)
Stddev: 0.00965 -> 0.00829: 1.1643x smaller

### raytrace ###
Min: 1.547054 -> 1.641147: 1.06x slower
Avg: 1.552614 -> 1.643716: 1.06x slower
Significant (t=-286.42)
Stddev: 0.00190 -> 0.00120: 1.5920x smaller

### regex_compile ###
Min: 0.494022 -> 0.537924: 1.09x slower
Avg: 0.497904 -> 0.541971: 1.09x slower
Significant (t=-18.23)
Stddev: 0.01177 -> 0.01239: 1.0523x larger

### regex_effbot ###
Min: 0.065431 -> 0.073393: 1.12x slower
Avg: 0.069753 -> 0.077338: 1.11x slower
Significant (t=-10.61)
Stddev: 0.00361 -> 0.00354: 1.0179x smaller

### regex_v8 ###
Min: 0.071053 -> 0.081441: 1.15x slower
Avg: 0.075075 -> 0.086167: 1.15x slower
Significant (t=-12.44)
Stddev: 0.00359 -> 0.00518: 1.4455x larger

### simple_logging ###
Min: 0.325386 -> 0.395093: 1.21x slower
Avg: 0.330235 -> 0.399825: 1.21x slower
Significant (t=-34.22)
Stddev: 0.00952 -> 0.01077: 1.1317x larger

### startup_nosite ###
Min: 0.082137 -> 0.453112: 5.52x slower
Avg: 0.129994 -> 0.459361: 3.53x slower
Significant (t=-276.85)
Stddev: 0.01114 -> 0.00419: 2.6585x smaller

### telco ###
Min: 0.810000 -> 0.010000: 81.00x faster
Avg: 0.823600 -> 0.015200: 54.18x faster
Significant (t=284.37)
Stddev: 0.01946 -> 0.00505: 3.8556x smaller

### threaded_count ###
Min: 0.140653 -> 0.173500: 1.23x slower
Avg: 0.152514 -> 0.270779: 1.78x slower
Significant (t=-49.87)
Stddev: 0.00605 -> 0.01564: 2.5837x larger

### unpack_sequence ###
Min: 0.000077 -> 0.000067: 1.15x faster
Avg: 0.000081 -> 0.000069: 1.18x faster
Significant (t=1163.57)
Stddev: 0.00000 -> 0.00000: 1.7412x larger

The following not significant results are hidden, use -v to show them:
html5lib, richards, silent_logging, spectral_norm.

Armin Rigo

unread,
Oct 27, 2012, 5:35:16 AM10/27/12
to Brett Cannon, python-dev
Hi Brett,

On Fri, Oct 26, 2012 at 9:14 PM, Brett Cannon <br...@python.org> wrote:
> Worst benchmark is nosite_startup, best is telco.

May I express doubts about telco? :-) It looks like the Python 3
version is simply not running:

> ### telco ###
> Min: 0.810000 -> 0.010000: 81.00x faster
> Avg: 0.823600 -> 0.015200: 54.18x faster


A bientôt,

Armin.
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Antoine Pitrou

unread,
Oct 27, 2012, 5:53:52 AM10/27/12
to pytho...@python.org
On Fri, 26 Oct 2012 15:14:08 -0400
Brett Cannon <br...@python.org> wrote:
>
> Worst benchmark is nosite_startup, best is telco. The benchmarks people
> might want to analyze (i.e. more than 20% slower in Python 3.3) are
> mako_v2, threaded_count, normal_startup, iterative_count, pathlib,
> formatted_logging, and simple_logging.

Well, did you check that mako_v2 wasn't subject to the Markupsafe
issue?

threaded_count and iterative_count are completely dumb.
Slower startup is due to the fact that Python 3 needs many more
modules to even start itself.

Regards

Antoine.

Maciej Fijalkowski

unread,
Oct 27, 2012, 6:12:28 AM10/27/12
to Armin Rigo, python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com

I think the original explanation was cDecimal vs decimal.

Stefan Krah

unread,
Oct 27, 2012, 8:33:41 AM10/27/12
to pytho...@python.org
Maciej Fijalkowski <fij...@gmail.com> wrote:
> On Sat, Oct 27, 2012 at 11:35 AM, Armin Rigo <ar...@tunes.org> wrote:
> > May I express doubts about telco? :-) It looks like the Python 3
> > version is simply not running:
> >
> >> ### telco ###
> >> Min: 0.810000 -> 0.010000: 81.00x faster
> >> Avg: 0.823600 -> 0.015200: 54.18x faster
>
> I think the original explanation was cDecimal vs decimal.

Yes, the magnitude of the speedup looks correct. In an isolated benchmark
with the large input file [1] I'm getting 30x speedup for telco.


Stefan Krah

[1] http://www.bytereef.org/mpdecimal/quickstart.html#telco-benchmark - expon180-1e6b.zip

Brett Cannon

unread,
Oct 27, 2012, 9:20:36 AM10/27/12
to Antoine Pitrou, pytho...@python.org

I did check that markup safe as not installed. It might just be mako doing something silly.

The threads tests are very synthetic.

And yes, there are more modules at startup. When was the last to,e we looked at them to make sure we weren't doing needless I ports?

Nick Coghlan

unread,
Oct 27, 2012, 11:22:20 AM10/27/12
to Brett Cannon, Antoine Pitrou, pytho...@python.org
On Sat, Oct 27, 2012 at 11:20 PM, Brett Cannon <bca...@gmail.com> wrote:
> I did check that markup safe as not installed. It might just be mako doing
> something silly.
>
> The threads tests are very synthetic.
>
> And yes, there are more modules at startup. When was the last to,e we looked
> at them to make sure we weren't doing needless I ports?

It's been quite a while.

>>> py3k - py27
set(['reprlib', 'heapq', '_collections', 'functools', '_bisect',
'copyreg', 'io', 'operator', '_heapq', '_io', '_thread',
'encodings.latin_1', 'collections', '_frozen_importlib',
'collections.abc', 'builtins', '_sysconfigdata', '_functools',
'keyword', '_imp', 'bisect', 'weakref', 'itertools', 'marshal'])

>>> py27 - py3k
set(['exceptions', 'copy_reg', 'warnings', 'UserDict', 'traceback',
'encodings.codecs', '__builtin__', 'linecache', '_abcoll',
'encodings.__builtin__', 'encodings.encodings', 'types'])

To check how many of those dependencies stemmed from collections, I
checked against the 2.7 version:

>>> py3k - py27_with_collections
set(['_functools', 'reprlib', '_thread', '_io', '_imp',
'_frozen_importlib', 'functools', 'weakref', 'collections.abc',
'encodings.latin_1', 'io', 'copyreg', 'builtins', 'marshal',
'_sysconfigdata'])

>>> py27_with_collections - py3k
set(['exceptions', 'copy_reg', 'thread', 'warnings', 'UserDict',
'traceback', 'encodings.codecs', '__builtin__', 'linecache',
'_abcoll', 'encodings.__builtin__', 'encodings.encodings', 'types'])

Implicitly bringing in _thread is a bit of a worry. Apparently 3.2 had
the same problem, though:

>>> py3k - py32
{'_imp', '_frozen_importlib', '_warnings', 'collections.abc',
'marshal', '_sysconfigdata'}

>>> py32 - py3k
{'_locale', 'locale', 'traceback', 'linecache', 'token', '_abcoll', 'tokenize'}


Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Antoine Pitrou

unread,
Oct 27, 2012, 3:21:39 PM10/27/12
to pytho...@python.org
On Sat, 27 Oct 2012 09:20:36 -0400
Brett Cannon <bca...@gmail.com> wrote:
> I did check that markup safe as not installed. It might just be mako doing
> something silly.
>
> The threads tests are very synthetic.
>
> And yes, there are more modules at startup. When was the last to,e we
> looked at them to make sure we weren't doing needless I ports?

The last time was between 3.2 and 3.3. It will be hard to lower the
number of imported modules, given the current semantics (io, importlib,
unicode, site.py, sysconfig...). Python 2's view of the world was much
simpler (naïve?) in comparison.

It would be interesting to know *where* the module import time gets
spent, on a lower level. My gut feeling is that execution of Python
module code is the main contributor.

Regards

Antoine.
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Mark Shannon

unread,
Oct 27, 2012, 4:40:26 PM10/27/12
to pytho...@python.org
On 27/10/12 20:21, Antoine Pitrou wrote:
> On Sat, 27 Oct 2012 09:20:36 -0400
> Brett Cannon <bca...@gmail.com> wrote:
>> I did check that markup safe as not installed. It might just be mako doing
>> something silly.
>>
>> The threads tests are very synthetic.
>>
>> And yes, there are more modules at startup. When was the last to,e we
>> looked at them to make sure we weren't doing needless I ports?
>
> The last time was between 3.2 and 3.3. It will be hard to lower the
> number of imported modules, given the current semantics (io, importlib,
> unicode, site.py, sysconfig...). Python 2's view of the world was much
> simpler (naïve?) in comparison.
>
> It would be interesting to know *where* the module import time gets
> spent, on a lower level. My gut feeling is that execution of Python
> module code is the main contributor.

I suspect that stating and loading the .pyc files is responsible for
most of the overhead.
PyRun starts up quite a lot faster thanks to embedding all the modules
in the executable: http://www.egenix.com/products/python/PyRun/

Freezing all the core modules into the executable should reduce start up
time.

Cheers,
Mark

Tim Delaney

unread,
Oct 27, 2012, 4:53:42 PM10/27/12
to pytho...@python.org
On 28 October 2012 07:40, Mark Shannon <ma...@hotpy.org> wrote:

I suspect that stating and loading the .pyc files is responsible for most of the overhead.
PyRun starts up quite a lot faster thanks to embedding all the modules in the executable: http://www.egenix.com/products/python/PyRun/

Freezing all the core modules into the executable should reduce start up time.

That suggests a test to me that the Cython guys might be interested in (or may well have performed in the past). How much of the stdlib could be compiled with Cython and used during the startup process? How much of an effect would it have on startup times and these benchmarks if Cython-compiled extensions were used?

I'm thinking here of elimination of .pyc interpretation and execution (stat calls would be similar, probably slightly higher).

To be clear - I'm *not* suggesting Cython become part of the required build toolchain. But *if* the Cython-compiled extensions prove to be significantly faster I'm thinking maybe it could become a semi-supported option (e.g. a HOWTO with the caveat "it worked on this particular system").

Tim Delaney

mar...@v.loewis.de

unread,
Oct 27, 2012, 4:58:26 PM10/27/12
to pytho...@python.org

Zitat von Tim Delaney <timothy....@gmail.com>:

> To be clear - I'm *not* suggesting Cython become part of the required build
> toolchain. But *if* the Cython-compiled extensions prove to be
> significantly faster I'm thinking maybe it could become a semi-supported
> option (e.g. a HOWTO with the caveat "it worked on this particular system").

This should compare to zipping the standard library, which has been a
supported
configuration for a long time, and also avoids many stat calls.

Regards,
Martin

Antoine Pitrou

unread,
Oct 27, 2012, 4:59:32 PM10/27/12
to pytho...@python.org
On Sat, 27 Oct 2012 21:40:26 +0100
Mark Shannon <ma...@hotpy.org> wrote:
> On 27/10/12 20:21, Antoine Pitrou wrote:
> >
> > It would be interesting to know *where* the module import time gets
> > spent, on a lower level. My gut feeling is that execution of Python
> > module code is the main contributor.
>
> I suspect that stating and loading the .pyc files is responsible for
> most of the overhead.
> PyRun starts up quite a lot faster thanks to embedding all the modules
> in the executable: http://www.egenix.com/products/python/PyRun/

Any numbers?

Regards

Antoine.

Paul Moore

unread,
Oct 27, 2012, 5:07:11 PM10/27/12
to mar...@v.loewis.de, pytho...@python.org
On 27 October 2012 21:58, <mar...@v.loewis.de> wrote:
>
> Zitat von Tim Delaney <timothy....@gmail.com>:
>
>
>> To be clear - I'm *not* suggesting Cython become part of the required
>> build
>> toolchain. But *if* the Cython-compiled extensions prove to be
>> significantly faster I'm thinking maybe it could become a semi-supported
>> option (e.g. a HOWTO with the caveat "it worked on this particular
>> system").
>
>
> This should compare to zipping the standard library, which has been a
> supported
> configuration for a long time, and also avoids many stat calls.

Interestingly, I just did a quick test of this: This is on my Windows
7 PC, running under Powershell. D:\Apps\Python33 is a standard
installation, whereas D:\Dev\P33 has a zipped stdlib:

PS 22:02 D:\Data
>foreach ($i in 1..10) { measure-command { D:\Apps\Python33\python.exe -c "raise SystemExit" } | % { $_.TotalSeconds } }
0.0737877
0.1014695
0.0950326
0.0910734
0.0689548
0.084994
0.0772204
0.0958197
0.0696385
0.0806066
PS 22:03 D:\Data
>foreach ($i in 1..10) { measure-command { D:\Dev\P33\python.exe -c "raise SystemExit" } | % { $_.TotalSeconds } }
0.1922151
0.1879894
0.2455766
0.2842425
0.1937161
0.2168928
0.2441508
0.1860206
0.1866409
0.1897004

Looks like the normal configuration is over twice as fast as the zipped one...

Paul.

Mark Shannon

unread,
Oct 27, 2012, 5:11:01 PM10/27/12
to pytho...@python.org
On 27/10/12 21:59, Antoine Pitrou wrote:
> On Sat, 27 Oct 2012 21:40:26 +0100
> Mark Shannon <ma...@hotpy.org> wrote:
>> On 27/10/12 20:21, Antoine Pitrou wrote:
>>>
>>> It would be interesting to know *where* the module import time gets
>>> spent, on a lower level. My gut feeling is that execution of Python
>>> module code is the main contributor.
>>
>> I suspect that stating and loading the .pyc files is responsible for
>> most of the overhead.
>> PyRun starts up quite a lot faster thanks to embedding all the modules
>> in the executable: http://www.egenix.com/products/python/PyRun/
>
> Any numbers?

No numbers, but I did see this talk:
http://2012.pyconuk.net/Talks/PyRun
The abstract claims that PyRun "has a greatly improved startup time
compared to regular Python"

Cheers,
Mark

Antoine Pitrou

unread,
Oct 27, 2012, 5:25:34 PM10/27/12
to pytho...@python.org
On Sat, 27 Oct 2012 22:11:01 +0100
Mark Shannon <ma...@hotpy.org> wrote:
> On 27/10/12 21:59, Antoine Pitrou wrote:
> > On Sat, 27 Oct 2012 21:40:26 +0100
> > Mark Shannon <ma...@hotpy.org> wrote:
> >> On 27/10/12 20:21, Antoine Pitrou wrote:
> >>>
> >>> It would be interesting to know *where* the module import time gets
> >>> spent, on a lower level. My gut feeling is that execution of Python
> >>> module code is the main contributor.
> >>
> >> I suspect that stating and loading the .pyc files is responsible for
> >> most of the overhead.
> >> PyRun starts up quite a lot faster thanks to embedding all the modules
> >> in the executable: http://www.egenix.com/products/python/PyRun/
> >
> > Any numbers?
>
> No numbers, but I did see this talk:
> http://2012.pyconuk.net/Talks/PyRun
> The abstract claims that PyRun "has a greatly improved startup time
> compared to regular Python"

Sounds great ;-)

cheers

Antoine.

Brett Cannon

unread,
Oct 27, 2012, 6:06:05 PM10/27/12
to Mark Shannon, pytho...@python.org
On Sat, Oct 27, 2012 at 4:40 PM, Mark Shannon <ma...@hotpy.org> wrote:
On 27/10/12 20:21, Antoine Pitrou wrote:
On Sat, 27 Oct 2012 09:20:36 -0400
Brett Cannon <bca...@gmail.com> wrote:
I did check that markup safe as not installed. It might just be mako doing
something silly.

The threads tests are very synthetic.

And yes, there are more modules at startup. When was the last to,e we
looked at them to make sure we weren't doing needless I ports?

The last time was between 3.2 and 3.3. It will be hard to lower the
number of imported modules, given the current semantics (io, importlib,
unicode, site.py, sysconfig...). Python 2's view of the world was much
simpler (naïve?) in comparison.

It would be interesting to know *where* the module import time gets
spent, on a lower level. My gut feeling is that execution of Python
module code is the main contributor.

I suspect that stating and loading the .pyc files is responsible for most of the overhead.

I really doubt that as the amount of stat calls is significantly reduced in Python 3.3 compared to Python 3.2 (startup benchmarks show Python 3.3 is roughly 1.66x faster than 3.2 thanks to caching filenames in a directory). More modules means more work (e.g. I/O, executing the module, etc.).

The only way to lower stat call overhead is to simply not check if a directory's contents changed during startup by assuming Python itself will not write any new module files. Without benchmarking I don't know if it would make that much of a difference, though.
 
PyRun starts up quite a lot faster thanks to embedding all the modules in the executable: http://www.egenix.com/products/python/PyRun/

Freezing all the core modules into the executable should reduce start up time.

 Sure, but working with a frozen module is a pain so it is not something to take lightly.

Brett Cannon

unread,
Oct 27, 2012, 6:07:48 PM10/27/12
to Paul Moore, mar...@v.loewis.de, pytho...@python.org
Are both debug builds (asking because of the path names)? CPython is now significantly slower in a debug build thanks to the overhead it adds to any Python code executing, which means importlib runs much slower. 

Serhiy Storchaka

unread,
Oct 27, 2012, 6:16:20 PM10/27/12
to pytho...@python.org
On 28.10.12 00:07, Paul Moore wrote:
> Looks like the normal configuration is over twice as fast as the zipped one...

The normal configuration does 269 stats, but the zipped one does 12636
seeks.

Serhiy Storchaka

unread,
Oct 27, 2012, 6:39:42 PM10/27/12
to pytho...@python.org
On 28.10.12 01:06, Brett Cannon wrote:
> I really doubt that as the amount of stat calls is significantly reduced
> in Python 3.3 compared to Python 3.2 (startup benchmarks show Python 3.3
> is roughly 1.66x faster than 3.2 thanks to caching filenames in a
> directory).

$ strace ./python -c '' 2>&1 | grep -c stat

Python 2.7 - 161 stats
Python 3.2 - 555 stats
Python 3.3 - 243 stats

Antoine Pitrou

unread,
Oct 27, 2012, 7:00:48 PM10/27/12
to pytho...@python.org
On Sun, 28 Oct 2012 01:39:42 +0300
Serhiy Storchaka <stor...@gmail.com> wrote:

> On 28.10.12 01:06, Brett Cannon wrote:
> > I really doubt that as the amount of stat calls is significantly reduced
> > in Python 3.3 compared to Python 3.2 (startup benchmarks show Python 3.3
> > is roughly 1.66x faster than 3.2 thanks to caching filenames in a
> > directory).
>
> $ strace ./python -c '' 2>&1 | grep -c stat
>
> Python 2.7 - 161 stats
> Python 3.2 - 555 stats
> Python 3.3 - 243 stats

This will probably depend on the length of sys.path:

$ strace -e stat python2.7 -Sc "" 2>&1 | wc -l
35
$ strace -e stat python3.2 -Sc "" 2>&1 | wc -l
298
$ strace -e stat python3.3 -Sc "" 2>&1 | wc -l
106

$ strace -e stat python2.7 -c "" 2>&1 | wc -l
200
$ strace -e stat python3.2 -c "" 2>&1 | wc -l
726
$ strace -e stat python3.3 -c "" 2>&1 | wc -l
180

Regards

Antoine.

Gregory P. Smith

unread,
Oct 27, 2012, 11:38:58 PM10/27/12
to pytho...@python.org
One word: profile.

Looking at stat counts alone rather than measuring the total time spent in all types of system calls from strace and profiling is not really useful. ;)

Another thing to keep an eye out for within a startup profile:  how often does the gc collect?  our default gc collection thresholds haven't been tuned in ages afaik [or am i forgetting something] and I know of pathological cases at work where simply doing a gc.disable() before importing a bunch of modules (tons of generated protocol buffer code) and re-enabling it afterwards speeds up this application's startup way more significantly than seems healthy in 2.x... that could be related to the particulars of the protobuf module code though.

-gps

Stefan Behnel

unread,
Oct 28, 2012, 3:22:07 AM10/28/12
to pytho...@python.org
Tim Delaney, 27.10.2012 22:53:
> On 28 October 2012 07:40, Mark Shannon wrote:
>> I suspect that stating and loading the .pyc files is responsible for most
>> of the overhead.
>> PyRun starts up quite a lot faster thanks to embedding all the modules in
>> the executable: http://www.egenix.com/**products/python/PyRun/<http://www.egenix.com/products/python/PyRun/>
>>
>> Freezing all the core modules into the executable should reduce start up
>> time.
>
> That suggests a test to me that the Cython guys might be interested in (or
> may well have performed in the past). How much of the stdlib could be
> compiled with Cython and used during the startup process?

We have a Jenkins job set up to run the CPython test suite with a compiled
stdlib:

https://sage.math.washington.edu:8091/hudson/job/cython-devel-tests-pyregr-stdlib/

Basically, we use pyximport as an import hook that tries to compile Python
modules on import and then imports the shared library if it worked or the
original Python module if it failed. A solution that explicitly runs over
the stdlib and compiles it would be substantially cleaner and more stable.

I don't have numbers for Py3.4 because we currently have a hard crash in
one of the tests on that platform when compiling recursively on import
(likely meaning that one of the stdlib modules and/or tests would have to
be excluded from compilation), but I get 434 automatically compiled stdlib
modules for the latest Py2.7 branch out of 744 (excluding the test suite).
And Py3.x code tends to pass as least as well through the compiler, often
better.

Note that quite a number of modules are excluded accidentally because they
are already imported as Python modules when Cython starts working.
Compiling them explicitly would remove that limitation, maybe adding
another (wild guess) 50 modules or so. Another few are not being compiled
because the test module that uses them fails to compile. So missing shared
libraries are not always due to failures to compile that particular Python
module.

I didn't pay much attention to this part of our integration tests so far -
a bit of debugging should get the Py3.4 build working.


> How much of an
> effect would it have on startup times and these benchmarks if
> Cython-compiled extensions were used?

Depends on what and how much code you use. If you compile everything into
one big module that "imports" all of the stdlib when it gets loaded, you'd
likely loose a lot of time because it would take a while to initialise all
that useless code on startup. If you keep it separate, it would likely be a
lot faster because you avoid the interpreter for most of the module startup.

Most Python code runs about 30% faster when compiled, some faster, some
slower. If you want better numbers, you can start optimising the code by
giving Cython static type hints. I did that for difflib a while ago, for
example. Changing two methods made it some 50% faster back then:

http://blog.behnel.de/index.php?p=155

That particular module should compile without changes these days, and you
can provide the type hints externally, i.e. without modifying the Python
code itself.


> I'm thinking here of elimination of .pyc interpretation and execution (stat
> calls would be similar, probably slightly higher).

CPython checks for .so files before looking for .py files and imports are
absolute by default in Py3, so there should be a slight reduction in stat
calls. The net result then obviously also depends on how fast your shared
library loader and linker is, etc., but I doubt that that path is any
slower than loading and running a .pyc file.

BTW, you'd still get nice stack traces for compiled modules as long as your
.py files lie right next to your .so files.


> To be clear - I'm *not* suggesting Cython become part of the required build
> toolchain. But *if* the Cython-compiled extensions prove to be
> significantly faster I'm thinking maybe it could become a semi-supported
> option (e.g. a HOWTO with the caveat "it worked on this particular system").

Sounds reasonable.

Stefan

Stefan Behnel

unread,
Oct 28, 2012, 3:37:19 AM10/28/12
to pytho...@python.org
Stefan Behnel, 28.10.2012 08:22:
> Tim Delaney, 27.10.2012 22:53:
>> How much of an effect would it have on startup times and these benchmarks if
>> Cython-compiled extensions were used?
>
> Depends on what and how much code you use. If you compile everything into
> one big module that "imports" all of the stdlib when it gets loaded, you'd
> likely loose a lot of time because it would take a while to initialise all
> that useless code on startup. If you keep it separate, it would likely be a
> lot faster because you avoid the interpreter for most of the module startup.
>
> Most Python code runs about 30% faster when compiled, some faster, some
> slower.

Some more unoptimised pure-Python benchmarks, just in case:

2.7:

https://sage.math.washington.edu:8091/hudson/job/cython-devel-pybenchmarks-py27/lastSuccessfulBuild/artifact/bench_chart.html

3.3:

https://sage.math.washington.edu:8091/hudson/job/cython-devel-pybenchmarks-py3k/lastSuccessfulBuild/artifact/bench_chart.html

Note that the 3.3 benchmarks are not entirely up to date, the last
successful run was a month ago (likely due to the branch into 3.4 which we
use since then). Didn't have time to fix them yet.

Note also that the variations are pretty high from run to run as the
machine that executes them is not a dedicated benchmark server.

Antoine Pitrou

unread,
Oct 28, 2012, 7:11:10 AM10/28/12
to pytho...@python.org
On Sat, 27 Oct 2012 20:38:58 -0700
"Gregory P. Smith" <gr...@krypto.org> wrote:
> One word: profile.
>
> Looking at stat counts alone rather than measuring the total time spent in
> all types of system calls from strace and profiling is not really useful. ;)

Agreed, but I can't seem to cope properly with gprof. Any suggestion?

> Another thing to keep an eye out for within a startup profile: how often
> does the gc collect? our default gc collection thresholds haven't been
> tuned in ages afaik [or am i forgetting something] and I know of
> pathological cases at work where simply doing a gc.disable() before
> importing a bunch of modules (tons of generated protocol buffer code) and
> re-enabling it afterwards speeds up this application's startup way more
> significantly than seems healthy in 2.x... that could be related to the
> particulars of the protobuf module code though.

That's a good suggestion indeed.

Thanks

Antoine.


_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Hynek Schlawack

unread,
Oct 28, 2012, 9:25:40 AM10/28/12
to pytho...@python.org

Am 28.10.2012 um 12:11 schrieb Antoine Pitrou <soli...@pitrou.net>:

>> One word: profile.
>>
>> Looking at stat counts alone rather than measuring the total time spent in
>> all types of system calls from strace and profiling is not really useful. ;)
> Agreed, but I can't seem to cope properly with gprof. Any suggestion?

http://oprofile.sourceforge.net/news/
http://valgrind.org/docs/manual/cl-manual.html

Are both useful. gprof is virtually useless.

Antoine Pitrou

unread,
Oct 28, 2012, 1:38:22 PM10/28/12
to pytho...@python.org
On Sat, 27 Oct 2012 20:38:58 -0700
"Gregory P. Smith" <gr...@krypto.org> wrote:
>
> Another thing to keep an eye out for within a startup profile: how often
> does the gc collect? our default gc collection thresholds haven't been
> tuned in ages afaik [or am i forgetting something] and I know of
> pathological cases at work where simply doing a gc.disable() before
> importing a bunch of modules (tons of generated protocol buffer code) and
> re-enabling it afterwards speeds up this application's startup way more
> significantly than seems healthy in 2.x... that could be related to the
> particulars of the protobuf module code though.

http://bugs.python.org/issue16351 shows us that the number of
collections at 3.4 startup is tiny:

$ ./python -Sc "import gc; print(gc.get_stats())"
[{'collections': 6, 'uncollectable': 0, 'collected': 0},
{'collections': 0, 'uncollectable': 0, 'collected': 0},
{'collections': 0, 'uncollectable': 0, 'collected': 0}]

$ ./python -c "import gc; print(gc.get_stats())"
[{'collected': 0, 'uncollectable': 0, 'collections': 12},
{'collected': 0, 'uncollectable': 0, 'collections': 1},
{'collected': 0, 'uncollectable': 0, 'collections': 0}]


Notably, there are no full collections.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Tim Delaney

unread,
Oct 28, 2012, 3:48:16 PM10/28/12
to pytho...@python.org
On 28 October 2012 18:22, Stefan Behnel <stef...@behnel.de> wrote:
How much of an
effect would it have on startup times and these benchmarks if
Cython-compiled extensions were used?

Depends on what and how much code you use. If you compile everything into one big module that "imports" all of the stdlib when it gets loaded, you'd likely loose a lot of time because it would take a while to initialise all that useless code on startup. If you keep it separate, it would likely be a lot faster because you avoid the interpreter for most of the module startup.

I was specifically thinking in terms of the tests Brett ran (that was the full set on speed.python.org, wasn't it?), and having each stdlib module be its own extension i.e. no big import module. A literal 1:1 replacement where possible.
 
I'm thinking here of elimination of .pyc interpretation and execution (stat
calls would be similar, probably slightly higher).

CPython checks for .so files before looking for .py files and imports are absolute by default in Py3, so there should be a slight reduction in stat calls. The net result then obviously also depends on how fast your shared library loader and linker is, etc., but I doubt that that path is any slower than loading and running a .pyc file.

D'oh. I knew that and still got it backwards.
 
To be clear - I'm *not* suggesting Cython become part of the required build
toolchain. But *if* the Cython-compiled extensions prove to be
significantly faster I'm thinking maybe it could become a semi-supported
option (e.g. a HOWTO with the caveat "it worked on this particular system").

Sounds reasonable.

I think a stdlib compile script + pre-packaged hints for the 3.3 release would likely help both 3.3 and Cython acceptance.

Putting aside my development interest and looking at it purely from the PoV of a Python *user*, I'd really like to see Cython on speed.python.org eventually (in two modes - one without hints as a baseline and one with hints). Of course the ideal situation would be to have every implementation of Python 3.3 that is capable of running on the hardware contributing numbers e.g. if/when Jython achieves 3.3 compatibility I'd love to see numbers for it.

Tim Delaney

Catalin Iacob

unread,
Oct 28, 2012, 4:53:33 PM10/28/12
to pytho...@python.org
On Sat, Oct 27, 2012 at 11:07 PM, Paul Moore <p.f....@gmail.com> wrote:
> Interestingly, I just did a quick test of this: This is on my Windows
> 7 PC, running under Powershell.

snip

> Looks like the normal configuration is over twice as fast as the zipped one...

This result is influenced by zipimport fseek-ing for every file in the
imported zip and fseek flushing buffers in Microsoft's CRT
implementation. There's a patch which avoids the seek in
http://bugs.python.org/issue8745. Reviews welcome!

With that patch the time taken to import is half of the current state
of things so according to your test that would make zipped and non
zipped configurations roughly equally fast.

Brett Cannon

unread,
Oct 29, 2012, 9:01:38 AM10/29/12
to Tim Delaney, pytho...@python.org
On Sun, Oct 28, 2012 at 3:48 PM, Tim Delaney <timothy....@gmail.com> wrote:
On 28 October 2012 18:22, Stefan Behnel <stef...@behnel.de> wrote:
How much of an
effect would it have on startup times and these benchmarks if
Cython-compiled extensions were used?

Depends on what and how much code you use. If you compile everything into one big module that "imports" all of the stdlib when it gets loaded, you'd likely loose a lot of time because it would take a while to initialise all that useless code on startup. If you keep it separate, it would likely be a lot faster because you avoid the interpreter for most of the module startup.

I was specifically thinking in terms of the tests Brett ran (that was the full set on speed.python.org, wasn't it?),

It's not the full set as not all of them can be run on Python 3, but it is as many as can be run.

 -Brett

Brett Cannon

unread,
Oct 29, 2012, 9:56:57 AM10/29/12
to python-dev
To see if the bad iterative_count and threaded_count results were consistently bad, I ran the benchmark suite on my MacBook Pro to see how "reliable" the benchmarks were. The output is below.

Basically 6 benchmarks (regex_effbot, queens, startup_nosite, iterative_count, threaded_count, and telco) had a variance of more than 15% performance between my 2 computers, although queens, iterative_count, and threaded_count were the only ones that swung between neutral/good to bad depending on the machine (the rest either want from bad to very bad, or very good to more very good).

And before Antoine asks, I added a ``sys.modules['markupsafe'] = None` line to the mako_v2 benchmark locally. =) Still need to either explicitly block it or emit a warning in the code in the repo.


#########################################


Report on Darwin Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64 i386
Total CPU cores: 8

### 2to3 ###
10.321463 -> 9.525119: 1.08x faster

### call_method ###
Min: 0.466812 -> 0.417812: 1.12x faster
Avg: 0.483324 -> 0.427158: 1.13x faster
Significant (t=28.77)
Stddev: 0.01876 -> 0.01483: 1.2644x smaller

### call_method_slots ###
Min: 0.484923 -> 0.409452: 1.18x faster
Avg: 0.487877 -> 0.413054: 1.18x faster
Significant (t=131.11)
Stddev: 0.00395 -> 0.00577: 1.4589x larger

### call_method_unknown ###
Min: 0.547050 -> 0.406866: 1.34x faster
Avg: 0.550721 -> 0.409359: 1.35x faster
Significant (t=328.32)
Stddev: 0.00415 -> 0.00325: 1.2795x smaller

### call_simple ###
Min: 0.391213 -> 0.332055: 1.18x faster
Avg: 0.393563 -> 0.335362: 1.17x faster
Significant (t=127.15)
Stddev: 0.00363 -> 0.00427: 1.1764x larger

### chameleon ###
Min: 0.078505 -> 0.070175: 1.12x faster
Avg: 0.083754 -> 0.071500: 1.17x faster
Significant (t=2.95)
Stddev: 0.05086 -> 0.00119: 42.8425x smaller

### chaos ###
Min: 0.353739 -> 0.423587: 1.20x slower
Avg: 0.356297 -> 0.428197: 1.20x slower
Significant (t=-108.44)
Stddev: 0.00200 -> 0.00424: 2.1147x larger

### django ###
Min: 0.824149 -> 0.862750: 1.05x slower
Avg: 0.831614 -> 0.869112: 1.05x slower
Significant (t=-21.47)
Stddev: 0.01020 -> 0.00697: 1.4634x smaller

### fannkuch ###
Min: 1.776913 -> 1.832973: 1.03x slower
Avg: 1.793116 -> 1.915348: 1.07x slower
Significant (t=-11.57)
Stddev: 0.01436 -> 0.07329: 5.1030x larger

### fastpickle ###
Min: 0.810968 -> 0.739322: 1.10x faster
Avg: 0.818099 -> 0.745148: 1.10x faster
Significant (t=58.02)
Stddev: 0.00577 -> 0.00677: 1.1731x larger

### fastunpickle ###
Min: 0.644198 -> 0.659345: 1.02x slower
Avg: 0.647976 -> 0.666154: 1.03x slower
Significant (t=-18.96)
Stddev: 0.00343 -> 0.00584: 1.7020x larger

### float ###
Min: 0.420888 -> 0.363410: 1.16x faster
Avg: 0.432285 -> 0.376179: 1.15x faster
Significant (t=38.14)
Stddev: 0.00762 -> 0.00708: 1.0766x smaller

### formatted_logging ###
Min: 0.325707 -> 0.413196: 1.27x slower
Avg: 0.329846 -> 0.418099: 1.27x slower
Significant (t=-119.89)
Stddev: 0.00397 -> 0.00337: 1.1787x smaller

### genshi ###
Min: 0.254604 -> 0.269696: 1.06x slower
Avg: 0.258585 -> 0.275615: 1.07x slower
Significant (t=-33.39)
Stddev: 0.00283 -> 0.00557: 1.9704x larger

### go ###
Min: 0.676453 -> 0.745504: 1.10x slower
Avg: 0.681833 -> 0.752170: 1.10x slower
Significant (t=-48.67)
Stddev: 0.00520 -> 0.00880: 1.6917x larger

### hexiom2 ###
Min: 186.378727 -> 172.939507: 1.08x faster
Avg: 186.679821 -> 173.103242: 1.08x faster
Significant (t=39.61)
Stddev: 0.42581 -> 0.23156: 1.8389x smaller

### html5lib ###
Min: 11.827770 -> 11.239556: 1.05x faster
Avg: 11.858253 -> 11.370960: 1.04x faster
Significant (t=6.93)
Stddev: 0.02825 -> 0.15466: 5.4746x larger

### iterative_count ###
Min: 0.168182 -> 0.154105: 1.09x faster
Avg: 0.169512 -> 0.155952: 1.09x faster
Significant (t=50.77)
Stddev: 0.00139 -> 0.00128: 1.0899x smaller

### json_dump_v2 ###
Min: 3.350528 -> 3.795307: 1.13x slower
Avg: 3.369661 -> 3.825400: 1.14x slower
Significant (t=-125.93)
Stddev: 0.01470 -> 0.02095: 1.4250x larger

### json_load ###
Min: 0.999717 -> 0.607549: 1.65x faster
Avg: 1.007319 -> 0.613016: 1.64x faster
Significant (t=289.24)
Stddev: 0.00673 -> 0.00690: 1.0240x larger

### mako_v2 ###
Min: 0.094817 -> 0.279593: 2.95x slower
Avg: 0.096962 -> 0.286479: 2.95x slower
Significant (t=-866.63)
Stddev: 0.00182 -> 0.00454: 2.4945x larger

### meteor_contest ###
Min: 0.276138 -> 0.243228: 1.14x faster
Avg: 0.279559 -> 0.246018: 1.14x faster
Significant (t=72.30)
Stddev: 0.00298 -> 0.00136: 2.1943x smaller

### nbody ###
Min: 0.421698 -> 0.320496: 1.32x faster
Avg: 0.425878 -> 0.323483: 1.32x faster
Significant (t=158.15)
Stddev: 0.00386 -> 0.00247: 1.5638x smaller

### normal_startup ###
Min: 0.612120 -> 0.876470: 1.43x slower
Avg: 0.618945 -> 0.885492: 1.43x slower
Significant (t=-280.36)
Stddev: 0.00422 -> 0.00523: 1.2397x larger

### nqueens ###
Min: 0.402125 -> 0.410580: 1.02x slower
Avg: 0.406403 -> 0.414676: 1.02x slower
Significant (t=-12.06)
Stddev: 0.00442 -> 0.00199: 2.2189x smaller

### pathlib ###
Min: 0.132423 -> 0.164525: 1.24x slower
Avg: 0.136298 -> 0.168843: 1.24x slower
Significant (t=-49.05)
Stddev: 0.00763 -> 0.00720: 1.0586x smaller

### pidigits ###
Min: 0.387690 -> 0.367871: 1.05x faster
Avg: 0.391308 -> 0.371194: 1.05x faster
Significant (t=32.69)
Stddev: 0.00369 -> 0.00230: 1.6066x smaller

### raytrace ###
Min: 1.650066 -> 1.808829: 1.10x slower
Avg: 1.660110 -> 1.832654: 1.10x slower
Significant (t=-25.26)
Stddev: 0.01165 -> 0.04687: 4.0224x larger

### regex_compile ###
Min: 0.559449 -> 0.571906: 1.02x slower
Avg: 0.563738 -> 0.580054: 1.03x slower
Significant (t=-8.38)
Stddev: 0.00434 -> 0.01306: 3.0087x larger

### regex_effbot ###
Min: 0.074999 -> 0.097456: 1.30x slower
Avg: 0.076343 -> 0.099435: 1.30x slower
Significant (t=-39.79)
Stddev: 0.00147 -> 0.00383: 2.5994x larger

### regex_v8 ###
Min: 0.087433 -> 0.104053: 1.19x slower
Avg: 0.088804 -> 0.105520: 1.19x slower
Significant (t=-39.48)
Stddev: 0.00115 -> 0.00277: 2.4122x larger

### richards ###
Min: 0.247208 -> 0.222483: 1.11x faster
Avg: 0.251661 -> 0.225276: 1.12x faster
Significant (t=44.04)
Stddev: 0.00392 -> 0.00161: 2.4275x smaller

### silent_logging ###
Min: 0.099170 -> 0.095099: 1.04x faster
Avg: 0.099713 -> 0.095892: 1.04x faster
Significant (t=33.32)
Stddev: 0.00045 -> 0.00068: 1.5062x larger

### simple_logging ###
Min: 0.316639 -> 0.392833: 1.24x slower
Avg: 0.320059 -> 0.396853: 1.24x slower
Significant (t=-120.31)
Stddev: 0.00224 -> 0.00392: 1.7450x larger

### spectral_norm ###
Min: 0.434691 -> 0.379294: 1.15x faster
Avg: 0.437958 -> 0.383761: 1.14x faster
Significant (t=67.75)
Stddev: 0.00410 -> 0.00390: 1.0502x smaller

### startup_nosite ###
Min: 0.209685 -> 0.660867: 3.15x slower
Avg: 0.218654 -> 0.673249: 3.08x slower
Significant (t=-458.50)
Stddev: 0.00646 -> 0.00752: 1.1645x larger

### telco ###
Min: 0.840453 -> 0.018312: 45.90x faster
Avg: 0.844250 -> 0.019255: 43.85x faster
Significant (t=1088.45)
Stddev: 0.00521 -> 0.00127: 4.0959x smaller

### threaded_count ###
Min: 0.197525 -> 0.151649: 1.30x faster
Avg: 0.213657 -> 0.153572: 1.39x faster
Significant (t=52.58)
Stddev: 0.00779 -> 0.00214: 3.6451x smaller

### unpack_sequence ###
Min: 0.000060 -> 0.000052: 1.16x faster
Avg: 0.000088 -> 0.000069: 1.29x faster
Significant (t=1118.61)
Stddev: 0.00000 -> 0.00000: 1.0022x larger

Antoine Pitrou

unread,
Oct 29, 2012, 3:22:34 PM10/29/12
to pytho...@python.org
On Mon, 29 Oct 2012 09:56:57 -0400
Brett Cannon <br...@python.org> wrote:

> To see if the bad iterative_count and threaded_count results were
> consistently bad, I ran the benchmark suite on my MacBook Pro to see how
> "reliable" the benchmarks were. The output is below.
>
> Basically 6 benchmarks (regex_effbot, queens, startup_nosite,
> iterative_count, threaded_count, and telco) had a variance of more than 15%
> performance between my 2 computers, although queens, iterative_count, and
> threaded_count were the only ones that swung between neutral/good to bad
> depending on the machine (the rest either want from bad to very bad, or
> very good to more very good).

This is using different compilers on the 2 computers, right?

Regards

Antoine.

Brett Cannon

unread,
Oct 29, 2012, 4:01:18 PM10/29/12
to Antoine Pitrou, pytho...@python.org
On Mon, Oct 29, 2012 at 3:22 PM, Antoine Pitrou <soli...@pitrou.net> wrote:
On Mon, 29 Oct 2012 09:56:57 -0400
Brett Cannon <br...@python.org> wrote:

> To see if the bad iterative_count and threaded_count results were
> consistently bad, I ran the benchmark suite on my MacBook Pro to see how
> "reliable" the benchmarks were. The output is below.
>
> Basically 6 benchmarks (regex_effbot, queens, startup_nosite,
> iterative_count, threaded_count, and telco) had a variance of more than 15%
> performance between my 2 computers, although queens, iterative_count, and
> threaded_count were the only ones that swung between neutral/good to bad
> depending on the machine (the rest either want from bad to very bad, or
> very good to more very good).

This is using different compilers on the 2 computers, right?

Yes: gcc 4.6.3 on Linux and Clang 3.1 on OS X.

Stefan Behnel

unread,
Oct 30, 2012, 2:47:19 AM10/30/12
to pytho...@python.org
Tim Delaney, 28.10.2012 20:48:
> On 28 October 2012 18:22, Stefan Behnel wrote:
>>> How much of an effect would it have on startup times and these
>>> benchmarks if Cython-compiled extensions were used?
>>
>> Depends on what and how much code you use. If you compile everything into
>> one big module that "imports" all of the stdlib when it gets loaded, you'd
>> likely loose a lot of time because it would take a while to initialise all
>> that useless code on startup. If you keep it separate, it would likely be a
>> lot faster because you avoid the interpreter for most of the module startup.
>
> I was specifically thinking in terms of the tests Brett ran (that was the
> full set on speed.python.org, wasn't it?), and having each stdlib module be
> its own extension i.e. no big import module. A literal 1:1 replacement
> where possible.

There's also an intermediate solution of linking the top-N modules into the
interpreter core and leaving the rest outside, but I'd rather go for the
straight forward approach of having separate libs first.

Compiling all that can be compiled is easy enough. I fixed up a couple of
things in Cython (so you need the latest github master) and then ran this
setup.py script from the Lib directory with "build_ext -i":

"""
from distutils.core import setup
from Cython.Build import cythonize
from Cython.Compiler import Options

# improve Python compatibility by allowing some broken code
Options.error_on_unknown_names = False

import sys

setup(
name = 'stuff',
ext_modules = cythonize(
["**/*.py"],
exclude=['**/test/**/*.py', '**/tests/**/*.py',
'**/__init__.py',
'idlelib/MultiCall.py'],
exclude_failures=True,
language_level=sys.version_info[0],
compiler_directives=dict(auto_cpdef=True)
),
)
"""

Note that the extra compiler option above disables fatal compile errors on
unknown (usually mistyped) names of which Cython hits a couple in the
stdlib. pylint should find them as well, they're worth fixing.

The directive at the end enables automatic module internal C calls which
usually gives a major speed-up by allowing the C compiler to see what happens.

With the above setup, Cython compiles 612 out of 620 Python modules for me,
excluding test modules and __init__.py files. The rest fails to compile due
to either compiler bugs or statically detected bugs in the Python code.
I'll look through them when I find a bit of time.

One major problem I ran into is that the new importlib bootstrap module
crashes with a RuntimeError("maximum recursion depth exceeded while calling
a Python object)" when it hits compiled modules with import cycles (e.g.
shutil and tarfile, or os and posixpath). I guess that's the kind of corner
case you get when working code gets rewritten. Worth giving Py3.2 a try in
comparison.


>>> To be clear - I'm *not* suggesting Cython become part of the required build
>>> toolchain. But *if* the Cython-compiled extensions prove to be
>>> significantly faster I'm thinking maybe it could become a semi-supported
>>> option (e.g. a HOWTO with the caveat "it worked on this particular
>>> system").
>>
>> Sounds reasonable.
>
> I think a stdlib compile script

... see above ...

> + pre-packaged hints for the 3.3 release
> would likely help both 3.3 and Cython acceptance.

That would certainly be a cool feature. This can often be as easy as
putting a .pxd file next to the .py file that overrides the declarations of
functions and classes with static types.


> Putting aside my development interest and looking at it purely from the PoV
> of a Python *user*, I'd really like to see Cython on
> speed.python.org eventually (in two modes - one without hints as a
> baseline and one with
> hints).

I think the above setup.py script, with appropriately adapted glob
patterns, should do that trick well enough for now. Certainly better and
simpler than my initial pyximport configuration. With the obvious caveat
that it takes a bit longer to compile everything, not just the modules that
are actually used. But that's only an install time issue.

Philip Jenvey

unread,
Nov 2, 2012, 2:16:25 PM11/2/12
to Brett Cannon, python-dev

On Oct 26, 2012, at 12:14 PM, Brett Cannon wrote:

>
> Worst benchmark is nosite_startup, best is telco. The benchmarks people might want to analyze (i.e. more than 20% slower in Python 3.3) are mako_v2, threaded_count, normal_startup, iterative_count, pathlib, formatted_logging, and simple_logging.

>
> ### mako_v2 ###
> Min: 0.083660 -> 0.243323: 2.91x slower
> Avg: 0.084634 -> 0.247875: 2.93x slower
> Significant (t=-821.55)
> Stddev: 0.00193 -> 0.00400: 2.0737x larger
> Timeline: b'http://tinyurl.com/98n9fab'

So Mike Bayer and I narrowed down mako_v2's slowness to use of an inline re

This:

http://www.makotemplates.org/trac/changeset/c1468b12f115ac9e469150ce24ea042aeae5e270

brings it down to around:

### mako_v2 ###
Min: 0.087608 -> 0.066748: 1.31x faster
Avg: 0.091348 -> 0.071224: 1.28x faster
Significant (t=26.10)
Stddev: 0.00312 -> 0.00447: 1.4340x larger
Timeline: http://tinyurl.com/as2zedo

The culprit is the lru_cache on re._compile_typed. Notice functools' numbers from the profiler:

http://paste.ofcode.org/yZRKnJfTsHesFR8hMWfc7f

Mike also noticed that the mako fix above does nothing to 2.7's numbers.

--
Philip Jenvey

Brett Cannon

unread,
Nov 2, 2012, 2:42:55 PM11/2/12
to Philip Jenvey, python-dev
Issue filed for the performance issue: http://bugs.python.org/issue16390

With that change and running on tip of Mako on my laptop now reports 1.25x slower which is much better than it was. This performance issue might also explain why all of the regex compilation benchmarks are worse under Python 3.3 by a decent margin.

On Fri, Nov 2, 2012 at 2:16 PM, Philip Jenvey <pje...@underboss.org> wrote:
lru_cache on re._compile_typed

Maciej Fijalkowski

unread,
Nov 3, 2012, 10:48:25 AM11/3/12
to Brett Cannon, python-dev
I would like to warn you about modifying benchmarks like this (or
frameworks). Why is it relevant anyway?

Brett Cannon

unread,
Nov 3, 2012, 12:29:18 PM11/3/12
to Maciej Fijalkowski, python-dev
I'm not modifying any benchmark or framework. At best I will replace Mako 0.7.2 with Mako 0.7.3 in the benchmark suite since no one is historically recording the mako_v2 benchmark yet and it should be running with the newest version until we set it in stone.
Reply all
Reply to author
Forward
0 new messages