[Python-Dev] Python startup time

Victor Stinner

unread,

Jul 19, 2017, 9:03:00 AM7/19/17

to Python Dev

Hi,

On Twitter, Raymond Hettinger wrote:

"The decision making process on Python-dev is an anti-pattern,
governed by anecdotal data and ambiguity over what problem is solved."

https://twitter.com/raymondh/status/887069454693158912

About "anecdotal data", I would like to discuss the Python startup time.

== Python 3.7 compared to 2.7 ==

First of all, on speed.python.org, we have:

* Python 2.7: 6.4 ms with site, 3.0 ms without site (-S)
* master (3.7): 14.5 ms with site, 8.4 ms without site (-S)

Python 3.7 startup time is 2.3x slower with site (default mode), or
2.8x slower without site (-S command line option).

(I will skip Python 3.4, 3.5 and 3.6 which are much worse than Python 3.7...)

So if an user complained about Python 2.7 startup time: be prepared
for a 2x - 3x more angry user when "forced" to upgrade to Python 3!

== Mercurial vs Git, Python vs C, startup time ==

Startup time matters a lot for Mercurial since Mercurial is compared
to Git. Git and Mercurial have similar features, but Git is written in
C whereas Mercurial is written in Python. Quick benchmark on the
speed.python.org server:

* hg version: 44.6 ms +- 0.2 ms
* git --version: 974 us +- 7 us

Mercurial startup time is already 45.8x slower than Git whereas tested
Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial
developers, with a startup time 2x - 3x slower...

I tested Mecurial 3.7.3 and Git 2.7.4 on Ubuntu 16.04.1 using "python3
-m perf command -- ...".

== CPython core developers don't care? no, they do care ==

Christian Heimes, Naoki INADA, Serhiy Storchaka, Yury Selivanov, me
(Victor Stinner) and other core developers made multiple changes last
years to reduce the number of imports at startup, optimize impotlib,
etc.

IHMO all these core developers are well aware of the competition of
programming languages, and honesty Python startup time isn't "good".
So let's compare it to other programming languages similar to Python.

== PHP, Ruby, Perl ==

I measured the startup time of other programming languages which are
similar to Python, still on the speed.python.org server using "python3
-m perf command -- ...":

* perl -e ' ': 1.18 ms +- 0.01 ms
* php -r ' ': 8.57 ms +- 0.05 ms
* ruby -e ' ': 32.8 ms +- 0.1 ms

Wow, Perl is quite good! PHP seems as good as Python 2 (but Python 3
is worse). Ruby startup time seems less optimized than other
languages.

Tested versions:

* perl 5, version 22, subversion 1 (v5.22.1)
* PHP 7.0.18-0ubuntu0.16.04.1 (cli) ( NTS )
* ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu]

== Quick Google search ==

I also searched for "python startup time" and "python slow startup
time" on Google and found many articles. Some examples:

"Reducing the Python startup time"
http://www.draketo.de/book/export/html/498
=> "The python startup time always nagged me (17-30ms) and I just
searched again for a way to reduce it, when I found this: The
Python-Launcher caches GTK imports and forks new processes to reduce
the startup time of python GUI programs."

https://nelsonslog.wordpress.com/2013/04/08/python-startup-time/
=> "Wow, Python startup time is worse than I thought."

"How to speed up python starting up and/or reduce file search while
loading libraries?"
https://stackoverflow.com/questions/15474160/how-to-speed-up-python-starting-up-and-or-reduce-file-search-while-loading-libra
=> "The first time I log to the system and start one command it takes
6 seconds just to show a few line of help. If I immediately issue the
same command again it takes 0.1s. After a couple of minutes it gets
back to 6s. (proof of short-lived cache)"

"How does one optimise the startup of a Python script/program?"
https://www.quora.com/How-does-one-optimise-the-startup-of-a-Python-script-program
=> "I wrote a Python program that would be used very often (imagine
'cd' or 'ls') for very short runtimes, how would I make it start up as
fast as possible?"

"Python Interpreter Startup time"
https://bytes.com/topic/python/answers/34469-pyhton-interpreter-startup-time

"Python is very slow to start on Windows 7"
https://stackoverflow.com/questions/29997274/python-is-very-slow-to-start-on-windows-7
=> "Python takes 17 times longer to load on my Windows 7 machine than
Ubuntu 14.04 running on a VM"
=> "returns in 0.614s on Windows and 0.036s on Linux"

"How to make a fast command line tool in Python" (old article Python 2.5.2)
https://files.bemusement.org/talks/OSDC2008-FastPython/
=> "(...) some techniques Bazaar uses to start quickly, such as lazy imports."

--

So please continue efforts for make Python startup even faster to beat
all other programming languages, and finally convince Mercurial to
upgrade ;-)

Victor
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Oleg Broytman

unread,

Jul 19, 2017, 9:56:18 AM7/19/17

to pytho...@python.org

On Wed, Jul 19, 2017 at 02:59:52PM +0200, Victor Stinner <victor....@gmail.com> wrote:
> "Python is very slow to start on Windows 7"
> https://stackoverflow.com/questions/29997274/python-is-very-slow-to-start-on-windows-7

However hard you are going to optimize Python you cannot fix those
"defenders", "guards" and "protectors". :-) This particular link can be
excluded from consideration.

Oleg.
--
Oleg Broytman http://phdru.name/ p...@phdru.name
Programmers don't die, they just GOSUB without RETURN.

Nick Coghlan

unread,

Jul 19, 2017, 10:07:44 AM7/19/17

to Victor Stinner, Python Dev

On 19 July 2017 at 22:59, Victor Stinner <victor....@gmail.com> wrote:
> == CPython core developers don't care? no, they do care ==
>
> Christian Heimes, Naoki INADA, Serhiy Storchaka, Yury Selivanov, me
> (Victor Stinner) and other core developers made multiple changes last
> years to reduce the number of imports at startup, optimize impotlib,
> etc.

I actually also care myself, since interpreter startup time feeds
directly into cost of execution when running in environments like AWS
Lambda which charge by the "gigabyte second" (i.e. you allocate a
certain amount of RAM to a particular command, and then get charged
for that RAM for the amount of time it takes to run, as measured with
subsecond precision - if you exceed the limits of the free tier,
anything you 're losing to language runtime startup in such an
environment translates almost directly to higher costs).

In aggregate, shaving time off CPython startup saves *scary* amounts
of collective compute time around the world - even though most runtime
environments don't track that as closely in financial terms as Lambda
does, we're still nudging the power & cooling requirements of data
centers slightly higher than they would otherwise be. So even when the
per-invocation impact of a performance improvement is small, it's
worth keeping in mind that CPython gets invoked a *lot*, whether it's
to respond to a web request, run a test, run a build, deploy another
application, analyse some data, etc :)

However, I'm also of the view that module & API maintainers *do* have
the authority to set the design priorities for the parts of the
standard library that they're personally responsible for, and if we'd
like them to change their minds based on information we have that they
don't, then reopening enhancement requests that they already closed is
*not* the way to go about it (as while the issue tracker is an
excellent venue for figuring out the technical details of a change, or
deciding whether or not an RFE is a good idea given a common
understanding of the relevant design priorities, it's almost always a
*terrible* venue for resolving outright disagreements as to what the
most relevant design priorities actually are).

Rather, the best available way to publicly request reconsideration is
the way Antoine did when he escalated the namedtuple question to
python-dev: by explicitly acknowledging that there's a conflict in
design priorities between core developers, and asking for a collective
discussion (and potentially a determination from Guido) as to the
right way forward for the project as a whole.

Cheers,
Nick.

P.S. I'll also note that we're not *actually* limited to resolving
such conflicts in public venues (even though I think that's a good
default habit for us to retain): as long as we report the outcome of
any mutual agreements about design priorities back to the relevant
public venue (e.g. a tracker issue), there's nothing wrong with
shifting our attempts to better understand each other's perspectives
to private email, IRC, video chat, etc. A non-trivial number of
previously vociferous arguments have been resolved amicably once the
main parties involved have had a chance to discuss them in person at a
conference or sprint. It can even make sense to reach out to other
core devs for help, since it's almost always easier for someone not
caught in the midst of an argument to see both sides of it, and
potentially spot a core of agreement amidst various surface level
disagreements :)

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Victor Stinner

unread,

Jul 19, 2017, 10:29:03 AM7/19/17

to Python Dev

2017-07-19 15:22 GMT+02:00 Oleg Broytman <p...@phdru.name>:
> On Wed, Jul 19, 2017 at 02:59:52PM +0200, Victor Stinner <victor....@gmail.com> wrote:
>> "Python is very slow to start on Windows 7"
>> https://stackoverflow.com/questions/29997274/python-is-very-slow-to-start-on-windows-7
>
> However hard you are going to optimize Python you cannot fix those
> "defenders", "guards" and "protectors". :-) This particular link can be
> excluded from consideration.

Sorry, I didn't read carefully each link I posted. Even for me knowing
what Python does at startup, it's hard to explain why 3 people have
different timing: 15 ms, 75 ms and 300 ms for example. In my
experience, the following things impact Python startup:

* -S option: loading or not the site module
* Paths in sys.path: PYTHONPATH environment variable for example
* .pth files files in sys.path
* Python running in a virtual environment or not
* Operating system: Python loads different modules at startup
depending on the OS. Naoki INADA just removed _osx_support from being
imported in the site module on macOS for example.

My list is likely incomplete.

In the performance benchmark suite, a controlled virtual environment
is created to have a known set of modules. FYI running Python is a
virtual environment is slower than "system" python which runs outside
a virtual environment...

Victor

Antoine Pitrou

unread,

Jul 19, 2017, 11:15:16 AM7/19/17

to pytho...@python.org

On Wed, 19 Jul 2017 14:59:52 +0200
Victor Stinner <victor....@gmail.com> wrote:
> Hi,
>
> On Twitter, Raymond Hettinger wrote:
>
> "The decision making process on Python-dev is an anti-pattern,
> governed by anecdotal data and ambiguity over what problem is solved."
>
> https://twitter.com/raymondh/status/887069454693158912
>
> About "anecdotal data", I would like to discuss the Python startup time.

And I would like to step back and examine the general criticism of
"anecdotal data". Large software and hardware companies have the
resources to conduct comprehensive surveys of how people use their
products. For example, Intel might have accumulated millions of traces
of critical production x86 code that they want to keep running
efficiently (or even keep running at all). Apple might have thousands
of third-party applications which they can simulate running on a newer
version of whatever OS, core library or pieces of hardware those
applications rely on. Even Google may nowadays have hundreds or
thousands of critical services written in Go, and they may be able to
assess the effect of further changes of the Go runtime on those
services (not sure they do, but they would certainly have the resources
to).

CPython is a comparatively small, disorganized and volunteer-based
community. It doesn't have the resources or organization required to
lead such studies on a regular basis. Chances are it will never have.
So all we can rely on is 1) our respective individual experiences in
the field 2) anecdotal data.

When we rewrote the Python 3 IO stack in C, we were relying on our
intuition that high-performance IO is important, and on anecdotal data
(micro-benchmarks) that the pure Python IO stack is slow. When Tim or
Raymond tweak the lookup function for dicts, they rely on anecdotal data
delivered by a few select micro-benchmarks, and their intuition that
some use cases need to be fast (for example dicts with string keys or
keys made up of consecutive integers). We don't have any hard data
that all those optimizations are necessary for the majority of Python
applications. I don't think anybody in the world has statistically
sound data about the entire body of Python code, or even a sufficiently
large and relevant subset thereof (such as "Python code used in
production for critical services").

We aren't scientists. We are engineers and have to make with whatever
anecdotes we are aware of (be they from our own experiences, or users'
complaints). We can't just say "yes, there seems be a performance issue,
but I'll wait until we have non-anecdotal data that it's important".
Because that day will probably never come, and in the meantime our
users will have fled elsewhere.

Regards

Antoine.

Antoine Pitrou

unread,

Jul 19, 2017, 11:23:43 AM7/19/17

to pytho...@python.org

On Wed, 19 Jul 2017 14:59:52 +0200
Victor Stinner <victor....@gmail.com> wrote:

> Hi,
>
> On Twitter, Raymond Hettinger wrote:
>
> "The decision making process on Python-dev is an anti-pattern,
> governed by anecdotal data and ambiguity over what problem is solved."
>
> https://twitter.com/raymondh/status/887069454693158912

Kind-of OT: while I understand (and have sometimes felt myself) the
desire to vent frustration about a decision one doesn't agree with,
thers should be *at least* a link to the discussion alluded to so that
readers make their own mind.

Otherwise, it feels to me like any disagreement here may end up
chastised on Twitter by some influential figure of authority. That's
not a pleasant place to be in.

Regards

Antoine.

Guido van Rossum

unread,

Jul 19, 2017, 11:59:38 AM7/19/17

to Antoine Pitrou, Python-Dev

Exactly. This is how Python came to be in the first place. Benchmarks are great, but don't underestimate creativity.

Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

Larry Hastings

unread,

Jul 19, 2017, 3:17:42 PM7/19/17

to pytho...@python.org

On 07/19/2017 05:59 AM, Victor Stinner wrote:

Mercurial startup time is already 45.8x slower than Git whereas tested
Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial
developers, with a startup time 2x - 3x slower...

When Matt Mackall spoke at the Python Language Summit some years back, I recall that he specifically complained about Python startup time. He said Python 3 "didn't solve any problems for [them]"--they'd already solved their Unicode hygiene problems--and that Python's slow startup time was already a big problem for them. Python 3 being even slower to start was absolutely one of the reasons why they didn't want to upgrade.

You might think "what's a few milliseconds matter". But if you run hundreds of commands in a shell script it adds up. git's speed is one of the few bright spots in its UX, and hg's comparative slowness here is a palpable disadvantage.

So please continue efforts for make Python startup even faster to beat
all other programming languages, and finally convince Mercurial to
upgrade ;-)

I believe Mercurial is, finally, slowly porting to Python 3.

https://www.mercurial-scm.org/wiki/Python3

Nevertheless, I can't really be annoyed or upset at them moving slowly to adopt Python 3, as Matt's objections were entirely legitimate.

Cheers,

/arry

Ben Hoyt

unread,

Jul 19, 2017, 3:28:42 PM7/19/17

to Larry Hastings, Python-Dev

Yes, agreed that startup time matters for scripting. I was talking to someone on the Google Cloud SDK (CLI) team recently, and they said startup time is a big deal for them ... it's especially problematic for shell tab completion helpers, because every time you press tab the shell has to load your Python program to do the completion. Even a couple dozen milliseconds is noticeable when you're typing quickly.

-Ben

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/benhoyt%40gmail.com

Antoine Pitrou

unread,

Jul 19, 2017, 4:37:10 PM7/19/17

to pytho...@python.org

On Wed, 19 Jul 2017 15:26:47 -0400
Ben Hoyt <ben...@gmail.com> wrote:
> Yes, agreed that startup time matters for scripting. I was talking to
> someone on the Google Cloud SDK (CLI) team recently, and they said startup
> time is a big deal for them ... it's especially problematic for shell tab
> completion helpers, because every time you press tab the shell has to load
> your Python program to do the completion.

And also, for the same reason, for shell prompt additions such as
git-prompt. Mercurial had to write a C client (chg) to make this
usable.

Regards

Antoine.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Chris Barker

unread,

Jul 19, 2017, 7:13:59 PM7/19/17

to Antoine Pitrou, Python Dev

As long as we are talking anecdotes:

If it could save a person’s life, could you find a way to save ten seconds off the boot time? If there were five million people using the Mac, and it took ten seconds extra to turn it on every day, that added up to three hundred million or so hours per year people would save, which was the equivalent of at least one hundred lifetimes saved per year.

Steve Jobs.

(http://stevejobsdailyquote.com/2014/03/26/boot-time/)

It really does depend on how/what users are using Python for. In general, Python has been moving more and more toward a "systems development language" from a "scripting language". Which may make us think "scripting" issues like startup time don't matter -- but,. of course, they matter a lot to those use cases.

-CHB

Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA 98115   (206) 526-6317   main reception

Chris....@noaa.gov

Steven D'Aprano

unread,

Jul 19, 2017, 9:21:21 PM7/19/17

to pytho...@python.org

On Wed, Jul 19, 2017 at 04:11:24PM -0700, Chris Barker wrote:
> As long as we are talking anecdotes:
>
> If it could save a person’s life, could you find a way to save ten seconds
> off the boot time? If there were five million people using the Mac, and it
> took ten seconds extra to turn it on every day, that added up to three
> hundred million or so hours per year people would save, which was the
> equivalent of at least one hundred lifetimes saved per year.
>
> Steve Jobs.

And about a fifth of the time they spent standing in lines waiting to
buy the latest unnecessary iGadget...

But seriously, that calculation is completely bogus. Not only is Steve
Job's arithmetic *completely* wrong, but the whole premise is nonsense.

Do the maths yourself: ten seconds per day is 3650 seconds in a year,
which is slightly over an hour (3600 seconds). Multiply by five million
users, that's about five million hours, not 300 million. So Jobs
exaggerates the time saved by a factor of sixty.

(Or maybe Jobs was warning that Macs crash sixty times a day...)

But the premise is wrong too. Those hypothetical people don't turn their
Macs on in sequence, each person turning their computer on only after
the previous person's Mac had finished booting. They effectively boot
them up in parallel but offset, spread out over a 24 hour period, so
about 3472 people booting up at the same time each minute of the day.
Time savings for parallel processes don't add in the way Jobs adds them,
if we treat this as 1440 parallel processes (one per minute of the day)
we save 1440 hours a year.

But really, the only meaningful calculation is the each person saves 10
seconds per day. We can't even meaningfully say they save one hour a
year: it doesn't come nicely packaged up for you all at once, so you can
actually do something useful with it, nor can you save those ten seconds
from one day to the next. You only get one shot at using them. What can
you do with ten seconds per day? By the time you decide what to do with
the extra time, it's already gone.

There are good reasons for speeding up boot time, but this sort of
calculation is not one of them. I think it is in particularly bad taste
to exaggerate the significance of it by putting it in terms of saving
lives. You want to save real lives? How about fixing the conditions in
the sweatshops that make Apple phones? And installing suicide nets
around the building doesn't count.

--
Steve

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Zero Piraeus

unread,

Jul 19, 2017, 10:46:02 PM7/19/17

to Steven D'Aprano, Python-Dev

:

On 19 July 2017 at 21:19, Steven D'Aprano <st...@pearwood.info> wrote:
> But the premise is wrong too. Those hypothetical people don't turn their
> Macs on in sequence, each person turning their computer on only after
> the previous person's Mac had finished booting. They effectively boot
> them up in parallel but offset, spread out over a 24 hour period, so
> about 3472 people booting up at the same time each minute of the day.
> Time savings for parallel processes don't add in the way Jobs adds them,
> if we treat this as 1440 parallel processes (one per minute of the day)
> we save 1440 hours a year.

Ah, but the relevant unit here is person-hours, not hours: Jobs is
claiming that *each* Mac user loses X% of *their* life to boot times,
and then adds all those slices of life together into N lifetimes
(which again, are counted in person-years, not years).

It's still wrong, though: longer boot times actually increase the
proportion of your life spent in meaningful activity (e.g. going to
the canteen and talking to someone).

-[]z.

Terry Reedy

unread,

Jul 20, 2017, 2:22:35 AM7/20/17

to pytho...@python.org

On 7/19/2017 10:05 AM, Nick Coghlan wrote:
> P.S. I'll also note that we're not *actually* limited to resolving
> such conflicts in public venues (even though I think that's a good
> default habit for us to retain): as long as we report the outcome of
> any mutual agreements about design priorities back to the relevant
> public venue (e.g. a tracker issue), there's nothing wrong with
> shifting our attempts to better understand each other's perspectives
> to private email, IRC, video chat, etc.

I expect and hope that there will be discussion of this issue at the
core developer sprint in September, with summary reports back here on pydev.

> It can even make sense to reach out to other
> core devs for help, since it's almost always easier for someone not
> caught in the midst of an argument to see both sides of it, and
> potentially spot a core of agreement amidst various surface level
> disagreements :)

I always understood the Python development process, both for core and
users, to be "Make it right; then make it faster", with the second
clause conditioned on 'while keeping it right' and maybe, and especially
for core development 'if significantly slow'. (People can rightly work
on speed of personal code for other reasons.) I believe we pretty much
agree on the principles. The disagreement seems to be on whether a
particular case is 'significantly slow'. I believe that the burden of
proof is with those who propose a change.

The burden of the proof depends on the final qualification: 'without
adding unnecessary or extreme complexity'. If there is no added
complication, the burden is slight. If not, we will likely disagree
about complexity and its tradeoff with speed.

About 'keeping it right': It has been mentioned that more complicated
code *generally* makes it harder to 'see' that the code is (basically)
correct. The second line of defense is the automated test suite. I
think, for instance, that someone interested in changing namedtuple (to
a faster and presumably more complicated implementation) should check
the coverage of the current code, with branches checked both ways.
Then, bring the coverage up to 100% if is not already, and carefully
check the test for possible missing cases.

A small static set of test cases cannot cover everything. The third
test of an implementation is accumulated user experience. A new
implementation starts at 0. One way to increase that is test the
implementation with 3rd-part code. Another, I think, is through
randomized testing.

Proposal 1: Depending on our confidence in a new implementation,
simulate user experience with randomized tests, perhaps running for
hours. Example: we develop a random (unicode) identifier generator that
starts with any of the legal initial codepoints and continues with a
random number of legal follow codepoints. Then test (old) and new
namedtuple with random class and a random number of random field names.
A developer could also use third-party packages, like hypothesis. Code
and a summary could be uploaded to bpo. A summary could even go in the
code file.

Note 1: Tim Peters did something like this when developing timsort. He
provided a nice summary of test cases and time results.

Note 2: Randomized tests require that either a) randomized inputs are
verified by property or predicate, rather than by hard-coded values, or
b) inputs are generated from outputs, where either the output or inverse
generation are randomized. Tests of sorting can use either
is_sorted(list(sorted(random_input))) or
list(sorted(random_shuffle(output))) == output.

Proposal 2: Add randomized tests here and there in the test suite. Each
randomized test x 30 buildbots x 2 runs/day x 365 days/year is about
22000 random inputs a year. Since each buildbot would be running a
slightly different test, we need to act on and not ignore sporadic
failures. Victor Stinner's buildbot work is making this feasible.

--
Terry Jan Reedy

--
Terry Jan Reedy

Victor Stinner

unread,

Jul 20, 2017, 5:05:19 AM7/20/17

to Python Dev

Hi,

I applied the patch above to count the number of times that Python is
run. Running the Python test suite with "./python -m test -j0 -rW"
runs Python 2,256 times.

Honestly, I expected more. I'm running tests with Python compiled in
debug mode. And in debug mode, Python startup time is much worse:

haypo@selma$ python3 -m perf command --inherit=PYTHONPATH -v -- ./python -c pass
command: Mean +- std dev: 46.4 ms +- 2.3 ms

FYI I'm using gcc -O0 rather than -Og to make compilation even faster.

Victor

diff --git a/Lib/site.py b/Lib/site.py
index 7dc1b04..4b0c167 100644
--- a/Lib/site.py
+++ b/Lib/site.py
@@ -540,6 +540,21 @@ def execusercustomize():
(err.__class__.__name__, err))

+def run_counter():
+ import fcntl
+
+ fd = os.open("/home/haypo/prog/python/master/run_counter",
+ os.O_WRONLY | os.O_CREAT | os.O_APPEND)
+ try:
+ fcntl.flock(fd, fcntl.LOCK_EX)
+ try:
+ os.write(fd, b'\x01')
+ finally:
+ fcntl.flock(fd, fcntl.LOCK_UN)
+ finally:
+ os.close(fd)
+
+
def main():
"""Add standard site-specific directories to the module search path.

@@ -568,6 +583,7 @@ def main():
execsitecustomize()
if ENABLE_USER_SITE:
execusercustomize()
+ run_counter()

# Prevent extending of sys.path when python was started with -S and
# site is imported later.

Ivan Levkivskyi

unread,

Jul 20, 2017, 7:26:23 AM7/20/17

to Terry Reedy, Python-Dev

I agree the start-up time is important. There is something that is related. ABCMeta is currently implemented in Python.

This makes it slow, creation of an ABC is 2x slower than creation of a normal class.

However, ABCs are used by many medium and large size projects.
Also, both abc and _collections_abc are imported at start-up (in particular importlib uses several ABCs, os also needs them for environments).

Finally, all generics in typing module and user-defined generic types are ABCs (to allow interoperability with collections.abc).

My idea is to re-implement ABCMeta (and ingredients it depends on, like WeakSet) in C.

I didn't find such proposal on b.p.o., I have two questions:
* Are there some potential problems with this idea (except that it may take some time and effort)?

* Is it something worth doing as an optimization?

(If answers are no and yes, then maybe I would spend part of my vacation in August on it.)

--

Ivan

INADA Naoki

unread,

Jul 20, 2017, 8:31:30 AM7/20/17

to Ivan Levkivskyi, Python-Dev, Terry Reedy

Hi, Ivan.

First of all, Yes, please do it!

On Thu, Jul 20, 2017 at 8:24 PM, Ivan Levkivskyi <levki...@gmail.com> wrote:
> I agree the start-up time is important. There is something that is related.
> ABCMeta is currently implemented in Python.
> This makes it slow, creation of an ABC is 2x slower than creation of a
> normal class.

Additionally, ABC infects by inheritance.
When people use mix-in provided by collections.abc, the class is ABC even if
it's concrete class.

There are no documented/recommended way to inherit from ABC class
but not use ABCMeta.

> However, ABCs are used by many medium and large size projects.

Many people having other language background uses ABC for Java's interface
or Abstract Class.

So it may worth enough to have just Abstract, but not ABC.
See https://mail.python.org/pipermail/python-ideas/2017-July/046495.html

> Also, both abc and _collections_abc are imported at start-up (in particular
> importlib uses several ABCs, os also needs them for environments).
> Finally, all generics in typing module and user-defined generic types are
> ABCs (to allow interoperability with collections.abc).
>

Yes. Even if site.py doesn't use typing, many application and
libraries will start
using typing.
And it's much slower than collections.abc.

> My idea is to re-implement ABCMeta (and ingredients it depends on, like
> WeakSet) in C.
> I didn't find such proposal on b.p.o., I have two questions:
> * Are there some potential problems with this idea (except that it may take
> some time and effort)?

WeakSet should be cared specially.
Maybe, ABCMeta can be optimized first.

Currently, ABCMeta use three WeakSets. But it can be delayed until
`register` or
`issubclass` is called.
So even if WeakSet is implemented in Python, I think ABCMeta can be much faster.

> * Is it something worth doing as an optimization?
> (If answers are no and yes, then maybe I would spend part of my vacation in
> August on it.)
>
> --
> Ivan
>
>

Bests,

Antoine Pitrou

unread,

Jul 20, 2017, 8:59:04 AM7/20/17

to pytho...@python.org

On Thu, 20 Jul 2017 21:29:18 +0900
INADA Naoki <songof...@gmail.com> wrote:
>
> WeakSet should be cared specially.
> Maybe, ABCMeta can be optimized first.
>
> Currently, ABCMeta use three WeakSets. But it can be delayed until
> `register` or
> `issubclass` is called.
> So even if WeakSet is implemented in Python, I think ABCMeta can be much faster.

Simple uses of WeakSet can probably be replaced with regular sets +
weakref callbacks. As long as you are not doing one of the delicate
things (such as iterate), it should be fine.

Regards

Antoine.

Stefan Behnel

unread,

Jul 20, 2017, 9:34:52 AM7/20/17

to pytho...@python.org

Ivan Levkivskyi schrieb am 20.07.2017 um 13:24:
> I agree the start-up time is important. There is something that is related.
> ABCMeta is currently implemented in Python.
> This makes it slow, creation of an ABC is 2x slower than creation of a
> normal class.
> However, ABCs are used by many medium and large size projects.
> Also, both abc and _collections_abc are imported at start-up (in particular
> importlib uses several ABCs, os also needs them for environments).
> Finally, all generics in typing module and user-defined generic types are
> ABCs (to allow interoperability with collections.abc).
>
> My idea is to re-implement ABCMeta (and ingredients it depends on, like
> WeakSet) in C.

I know that this hasn't really been an accepted option so far (and it's
actually not an option for a few really early modules during startup), but
compiling a Python module with Cython will usually speed it up quite
noticibly (often 10-30%, sometimes more if you're lucky, e.g. [1]). And
that also applies to the startup time, simply because it's pre-compiled.

So, before considering to write an accelerator module in C that replaces
some existing Python module, and thus duplicating its entire source code
with highly increased complexity, I'd like to remind you that simply
compiling the Python module itself to C should give at least reasonable
speed-ups *without* adding to the maintenance burden, and can be done
optionally as part of the build process. We do that for Cython itself
during its installation, for example.

Stefan (Cython core developer)

[1] 3x faster URL routing by compiling a single Django module with Cython:
https://us.pycon.org/2017/schedule/presentation/693/

Nick Coghlan

unread,

Jul 20, 2017, 10:03:57 AM7/20/17

to Stefan Behnel, pytho...@python.org

On 20 July 2017 at 23:32, Stefan Behnel <stef...@behnel.de> wrote:
> So, before considering to write an accelerator module in C that replaces
> some existing Python module, and thus duplicating its entire source code
> with highly increased complexity, I'd like to remind you that simply
> compiling the Python module itself to C should give at least reasonable
> speed-ups *without* adding to the maintenance burden, and can be done
> optionally as part of the build process. We do that for Cython itself
> during its installation, for example.

And if folks are concerned about the potential bootstrapping issues
with this approach, the gist is that it would have to look something
like this:

Phase 0: freeze importlib
- build a CPython with only builtin and frozen module support
- use it to freeze importlib

Phase 1: traditional CPython
- build the traditional Python interpreter with no Cython accelerated modules

Phase 2: accelerated CPython
- if not otherwise available, use the traditional Python interpreter
to download & install Cython in a virtual environment
- run Cython to selectively precompile key modules (such as those
implicitly imported at startup)

Technically, phase 2 doesn't actually *change* CPython itself, since
the import system is already setup such that if an extension module
and a source module are side-by-side in the same directory, then the
extension module will take precedence. As a result, precompiling with
Cython is similar in many ways to precompiling to bytecode, its just
that the result is native machine code with Python C API calls, rather
than CPython bytecode.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Cesare Di Mauro

unread,

Jul 20, 2017, 1:11:30 PM7/20/17

to Victor Stinner, Python Dev

Hi Victor,

I assume that Python loads compiled (.pyc and/or .pyo) from the stdlib. That's something that also influences the startup time (compiling source vs loading pre-compiled modules).

Bests,

Cesare

Mail priva di virus. www.avast.com

Victor Stinner

unread,

Jul 20, 2017, 1:25:42 PM7/20/17

to Cesare Di Mauro, Python Dev

2017-07-20 19:09 GMT+02:00 Cesare Di Mauro <cesare....@gmail.com>:
> I assume that Python loads compiled (.pyc and/or .pyo) from the stdlib. That's something that also influences the startup time (compiling source vs loading pre-compiled modules).

My benchmark was "python3 -m perf command -- python3 -c pass": I don't
explicitly remove .pyc files, I expect that Python uses prebuilt .pyc
files from __pycache__.

Cesare Di Mauro

unread,

Jul 20, 2017, 3:41:53 PM7/20/17

to Victor Stinner, Python Dev

2017-07-20 19:23 GMT+02:00 Victor Stinner <victor....@gmail.com>:

2017-07-20 19:09 GMT+02:00 Cesare Di Mauro <cesare....@gmail.com>:
> I assume that Python loads compiled (.pyc and/or .pyo) from the stdlib. That's something that also influences the startup time (compiling source vs loading pre-compiled modules).

My benchmark was "python3 -m perf command -- python3 -c pass": I don't
explicitly remove .pyc files, I expect that Python uses prebuilt .pyc
files from __pycache__.

Victor

OK, that should be the best case.

An idea to improve the situation might be to find an alternative structure for .pyc/pyo files, which allows to (partially) "parallelize" their loading (not execution, of course), or at least speed-up the process. Maybe a GSoC project for some student, if no core dev has time to investigate it.

Cesare

Nick Coghlan

unread,

Jul 20, 2017, 10:46:38 PM7/20/17

to Cesare Di Mauro, Python Dev

Unmarshalling the code object from disk generally isn't the slow part - it's the module level execution that takes time.

Using the typing module as an example, a full reload cycle takes almost 10 milliseconds:

$ python3 -m perf timeit -s "import typing; from importlib import reload" "reload(typing)"
.....................
Mean +- std dev: 9.89 ms +- 0.46 ms

(Don't try timing "import typing" directly - the sys.modules cache amortises the cost down to being measured in nanoseconds, since you're effectively just measuring the speed of a dict lookup)

We can separately measure the cost of unmarshalling the code object:

$ python3 -m perf timeit -s "import typing; from marshal import loads; from importlib.util import cache_from_source; cache = cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]" "loads(data)"
.....................
Mean +- std dev: 286 us +- 4 us

Finding the module spec:

$ python3 -m perf timeit -s "from importlib.util import find_spec" "find_spec('typing')"
.....................
Mean +- std dev: 69.2 us +- 2.3 us

And actually running the module's code (this includes unmarshalling the code object, but *not* calculating the import spec):

$ python3 -m perf timeit -s "import typing; loader_exec = typing.__spec__.loader.exec_module" "loader_exec(typing)"
.....................
Mean +- std dev: 9.68 ms +- 0.43 ms

Cheers,

Nick.

Nick Coghlan

unread,

Jul 20, 2017, 10:54:00 PM7/20/17

to Cesare Di Mauro, Python Dev

On 21 July 2017 at 12:44, Nick Coghlan <ncog...@gmail.com> wrote:
> We can separately measure the cost of unmarshalling the code object:
>
> $ python3 -m perf timeit -s "import typing; from marshal import loads; from
> importlib.util import cache_from_source; cache =
> cache_from_source(typing.__file__); data = open(cache, 'rb').read()[12:]"
> "loads(data)"
> .....................
> Mean +- std dev: 286 us +- 4 us

Slight adjustment here, as the cost of locating the cached bytecode
and reading it from disk should really be accounted for in each
iteration:

$ python3 -m perf timeit -s "import typing; from marshal import loads;
from importlib.util import cache_from_source" "cache =

cache_from_source(typing.__spec__.origin); data = open(cache,

'rb').read()[12:]; loads(data)"
.....................

Mean +- std dev: 337 us +- 8 us

That will have a bigger impact when loading from spinning disk or a
network drive, but it's fairly negligible when loading from a local
SSD or an already primed filesystem cache.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Cesare Di Mauro

unread,

Jul 21, 2017, 1:32:05 AM7/21/17

to Nick Coghlan, Python Dev

Thanks for your tests, Nick. It's quite evident that the marshal code cannot improve the situation, so I regret from my proposal.

I took a look at the typing module, and there are some small things that can be optimized, but it'll not change the overall situation unfortunately.

Code execution can be improved. :) However, it requires a massive amount of time experimenting...

Bests,

Cesare

Nick Coghlan

unread,

Jul 21, 2017, 2:25:48 AM7/21/17

to Cesare Di Mauro, Python Dev

It was still a good suggestion, since it made me realise I *hadn't* actually measured the relative timings lately, so it was technically an untested assumption that module level code execution still dominated the overall import time.

typing is also a particularly large & complex module, and bytecode unmarshalling represents a larger fraction of the import time for simpler modules like abc:

$ python3 -m perf timeit -s "import abc; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(abc.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)"
.....................
Mean +- std dev: 45.2 us +- 1.1 us

$ python3 -m perf timeit -s "import abc; loader_exec = abc.__spec__.loader.exec_module" "loader_exec(abc)"
.....................
Mean +- std dev: 172 us +- 5 us

$ python3 -m perf timeit -s "import abc; from importlib import reload" "reload(abc)"
.....................
Mean +- std dev: 280 us +- 14 us

And _weakrefset:

$ python3 -m perf timeit -s "import _weakrefset; from marshal import loads; from importlib.util import cache_from_source" "cache = cache_from_source(_weakrefset.__spec__.origin); data = open(cache, 'rb').read()[12:]; loads(data)"
.....................
Mean +- std dev: 57.7 us +- 1.3 us

$ python3 -m perf timeit -s "import _weakrefset; loader_exec = _weakrefset.__spec__.loader.exec_module" "loader_exec(_weakrefset)"
.....................
Mean +- std dev: 129 us +- 6 us

$ python3 -m perf timeit -s "import _weakrefset; from importlib import reload" "reload(_weakrefset)"
.....................
Mean +- std dev: 226 us +- 4 us

The conclusion still holds (the absolute numbers here are likely still too small for the extra complexity of parallelising bytecode loading to pay off in any significant way), but it also helps us set reasonable expectations around how much of a gain we're likely to be able to get just from precompilation with Cython.

That does actually raise a small microbenchmarking problem: for source and bytecode imports, we can force the import system to genuinely rerun the module or unmarshal the bytecode inside a single Python process, allowing perf to measure it independently of CPython startup. While I'm pretty sure it's possible to trick the import machinery into rerunning module level init functions even for old-style extension modules (hence allowing us to run similar tests to those above for a Cython compiled module), I don't actually remember how to do it off the top of my head.

Cheers,

Nick.

P.S. I'll also note that in these cases where the import overhead is proportionally significant for always-imported modules, we may want to look at the benefits of freezing them (if they otherwise remain as pure Python modules), or compiling them as builtin modules (if we switch them over to Cython), in addition to looking at ways to make the modules themselves faster. Being built directly into the interpreter binary is pretty much the best case scenario for reducing import overhead.

David Mertz

unread,

Jul 21, 2017, 3:14:19 AM7/21/17

to Nick Coghlan, Python-Dev

How implausible is it to write out the actual memory image of a loaded Python process? I.e. on a specific machine, OS, Python version, etc? This can only be overhead initially, of course, but on subsequent runs it's just one memory map, which the cheapest possible operation.

E.g.

$ python3.7 --write-image "import typing, re, os, numpy"

I imagine this creating a file like:

/tmp/__python__/python37-typing-re-os-numpy.mem

Then just terminating as if just that line had run, however long it takes (but snapshotting before exit).

Then subsequent invocations would only restore the image to memory. Maybe:

$ pyrunner --load-image python37-typing-re-os-numpy myscript.py

The last line could be aliased of course. I suppose we'd need to check if relevant file exists, and if not fall back to just ignoring the '--load-image' flag and running plain old Python.

This helps not at all for something like AWS Lambda where each instance is spun up fresh. But for the use-case of running many Python shell commands at an interactive shell on one machine, it seems like that could be very fast.

In my hypothetical I suppose pre-loading some collection of modules in the image. Of course, the script may need to load others, and it may not use some in the image. But users could decide their typical needed modules themselves under this idea.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/mertz%40gnosis.cx

Antoine Pitrou

unread,

Jul 21, 2017, 4:56:29 AM7/21/17

to pytho...@python.org

On Fri, 21 Jul 2017 00:12:20 -0700
David Mertz <me...@gnosis.cx> wrote:
> How implausible is it to write out the actual memory image of a loaded
> Python process? I.e. on a specific machine, OS, Python version, etc? This
> can only be overhead initially, of course, but on subsequent runs it's just
> one memory map, which the cheapest possible operation.

You can't rely on the file being remapped at the same address when you
reload it. So you'd have to write a relocation routine that's able to
find and fix *all* pointers inside the Python object tree and CPython's
internal structures (fixing the pointers is not necessarily difficult,
finding them without missing any is the difficult part).

Regards

Antoine.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

INADA Naoki

unread,

Jul 21, 2017, 5:30:43 AM7/21/17

to David Mertz, Nick Coghlan, Python-Dev

On Fri, Jul 21, 2017 at 4:12 PM, David Mertz <me...@gnosis.cx> wrote:
> How implausible is it to write out the actual memory image of a loaded
> Python process? I.e. on a specific machine, OS, Python version, etc? This
> can only be overhead initially, of course, but on subsequent runs it's just
> one memory map, which the cheapest possible operation.

FYI, you may be interested in very recent node.js security issue.
https://nodejs.org/en/blog/vulnerability/july-2017-security-releases/#node-js-specific-security-flaws

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Nikolaus Rath

unread,

Jul 21, 2017, 7:27:29 AM7/21/17

to pytho...@python.org

On Jul 21 2017, David Mertz <me...@gnosis.cx> wrote:
> How implausible is it to write out the actual memory image of a loaded
> Python process?

That is what Emacs does, and it causes them a lot of trouble. They're
trying to move away from it at the moment, but the direction is not yet
clear. The keyword is "unexec", and it wrecks havoc with malloc.

Best,
-Nikolaus
--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Stefan Behnel

unread,

Jul 21, 2017, 4:45:43 PM7/21/17

to pytho...@python.org

Nick Coghlan schrieb am 21.07.2017 um 08:23:
> I'll also note that in these cases where the import overhead is
> proportionally significant for always-imported modules, we may want to look
> at the benefits of freezing them (if they otherwise remain as pure Python
> modules), or compiling them as builtin modules (if we switch them over to
> Cython), in addition to looking at ways to make the modules themselves
> faster.

Just for the sake of it, I gave the Cython compilation a try. I had to
apply the attached hack to Lib/typing.py to get the test passing, because
it uses frame call offsets in some places and Cython functions do not
create frames when being called (they only create them for exception
traces). I also had to disable the import of "abc" in the Cython generated
module to remove the circular self dependency at startup when the "abc"
module is compiled. That shouldn't have an impact on the runtime
performance, though.

Note that this is otherwise using the unmodified Python code, as provided
in the current modules, constructing and using normal Python classes for
everything, no extension types etc. Only two stdlib Python modules were
compiled into shared libraries, and not statically linked into the CPython
core.

I used the "python_startup" benchmark in the "performance" project to
measure the overall startup times of a clean non-debug non-pgo build of
CPython 3.7 (rev d0969d6) against the same build with a compiled typing.py
and abc.py. To compile these modules, I used the following command (plus
the attached patch)

$ cythonize -3 -X binding=True -i Lib/typing.py Lib/abc.py

I modified the startup benchmark to run "python -c 'import typing'" etc.
instead of just executing "pass".

- stock CPython starting up and running "pass":
Mean +- std dev: 14.7 ms +- 0.3 ms

- stock CPython starting up and running "import abc":
Mean +- std dev: 14.8 ms +- 0.3 ms

- with compiled abc.py:
Mean +- std dev: 14.9 ms +- 0.3 ms

- stock CPython starting up and running "import typing":
Mean +- std dev: 34.6 ms +- 1.0 ms

- with compiled abc.py
Mean +- std dev: 34.4 ms +- 0.6 ms

- with compiled typing.py:
Mean +- std dev: 33.5 ms +- 0.7 ms

- with both compiled:
Mean +- std dev: 33.1 ms +- 0.4 ms

That's only a 4% improvement in the overall startup time on my machine, and
about a 7% faster overall runtime of "import typing" compared to "pass".
Note also that compiling abc.py leads to a slightly *increased* startup
time in the "import abc" case, which might be due to the larger file size
of the abc.so file compared to the abc.pyc file. This is amortised by the
decreased runtime in the "import typing" case (I guess).

I then ran the test suites for both modules in lack of a better
post-startup runtime benchmark. The improvement for abc.py is in the order
of 1-2%, but test_typing.py has many more tests and wins about 13% overall:

- stock CPython executing essentially "runner.run(deepcopy(suite))" in
"test_typing.py" (the deepcopy() takes about 6 ms):
Mean +- std dev: 68.6 ms +- 0.8 ms

- compiled abc.py and typing.py:
Mean +- std dev: 60.7 ms +- 0.7 ms

One more thing to note: the compiled modules are quite large. I get these
file sizes:

8658 Lib/abc.py
7525 Lib/__pycache__/abc.cpython-37.pyc
369930 Lib/abc.c
122048 Lib/abc.cpython-37m-x86_64-linux-gnu.so

80290 Lib/typing.py
73921 Lib/__pycache__/typing.cpython-37.pyc
2951893 Lib/typing.c
1182632 Lib/typing.cpython-37m-x86_64-linux-gnu.so

The .so files are about 16x as large as the .pyc files. The typing.so file
weighs in with about 40% of the size of the stripped python binary:

2889136 python

As it stands, the gain is probably not worth the increase in library file
size, which also translates to a higher bottom line for the memory
consumption. At least not for these two modules. Manually optimising the
files would likely also reduce the .so file size in addition to giving
better speedups, though, because the generated code would become less generic.

Stefan

typing_frames.patch

Barry Warsaw

unread,

Jul 21, 2017, 6:22:52 PM7/21/17

to pytho...@python.org

On Jul 21, 2017, at 01:25 PM, Nikolaus Rath wrote:

>That is what Emacs does, and it causes them a lot of trouble. They're
>trying to move away from it at the moment, but the direction is not yet
>clear. The keyword is "unexec", and it wrecks havoc with malloc.

Emacs has been unexec'ing for as long as I can remember (which is longer than
I can remember Python :). I know that it's been problematic and there have
been many efforts over the years to replace it, but I think it's been a fairly
successful technique in practice, at least on platforms that support it.
That's another problem with the approach of course; it's not universally
possible to implement.

-Barry

Skip Montanaro

unread,

Jul 21, 2017, 6:35:59 PM7/21/17

to Barry Warsaw, python-dev Dev

Emacs has been unexec'ing for as long as I can remember (which is longer than
I can remember Python :). I know that it's been problematic and there have
been many efforts over the years to replace it, but I think it's been a fairly
successful technique in practice, at least on platforms that support it.

I've been using Emacs far longer than Python. I remember having to invoke temacs on something. Still, if I didn't know better, I could be convinced you were referring to the GIL. :-)

Skip

David Mertz

unread,

Jul 21, 2017, 6:55:29 PM7/21/17

to Barry Warsaw, Python-Dev

I would guess that Windows users don't tend to run lots of command line tools where startup time dominates, as *nix users do.

Unsubscribe: https://mail.python.org/mailman/options/python-dev/mertz%40gnosis.cx

--

Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons. Intellectual property is
to the 21st century what the slave trade was to the 16th.

Paul Moore

unread,

Jul 22, 2017, 4:15:43 AM7/22/17

to David Mertz, Barry Warsaw, Python-Dev

On 21 July 2017 at 23:53, David Mertz <me...@gnosis.cx> wrote:
> I would guess that Windows users don't tend to run lots of command line
> tools where startup time dominates, as *nix users do.

Well, in the sense that many Windows users don't use the command line
at all, this is true. However, startup time is a definite problem for
Windows users who *do* use the command line, because process creation
cost is a lot higher than on Unix, so starting new commands is
*already* costly, and therefore minimising additional overhead is
crucial.

It's a bit of a chicken and egg problem - Windows users avoid
excessive command line program invocation because startup time is
high, so no-one optimises startup time because Windows users don't use
short-lived command line programs. But I'm seeing a trend away from
that - more and more Windows tools these days seem to be comfortable
spawning subprocesses. I don't know what prompted that trend.

Paul

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Alex Walters

unread,

Jul 22, 2017, 4:38:56 AM7/22/17

to Python-Dev

> -----Original Message-----
> From: Python-Dev [mailto:python-dev-bounces+tritium-
> list=sdamo...@python.org] On Behalf Of Paul Moore
> Sent: Saturday, July 22, 2017 4:14 AM
> To: David Mertz <me...@gnosis.cx>
> Cc: Barry Warsaw <ba...@python.org>; Python-Dev <python-
> d...@python.org>
> Subject: Re: [Python-Dev] Python startup time

> It's a bit of a chicken and egg problem - Windows users avoid
> excessive command line program invocation because startup time is
> high, so no-one optimises startup time because Windows users don't use
> short-lived command line programs. But I'm seeing a trend away from
> that - more and more Windows tools these days seem to be comfortable
> spawning subprocesses. I don't know what prompted that trend.

The programs I see that are comfortable spawning processes willy-nilly on
windows are mostly .net, which has a lot of the runtime assemblies cached by
the OS in the GAC - if you are spawning a second processes of yourself, or
something that uses the same libraries as you, the compile step on those can
be skipped. Unless you are talking about python/non-.NET programs, in which
case, I have no answer.

> Paul
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> https://mail.python.org/mailman/listinfo/python-dev

> Unsubscribe: https://mail.python.org/mailman/options/python-dev/tritium-
> list%40sdamon.com

Steve Dower

unread,

Jul 22, 2017, 10:23:15 AM7/22/17

to Alex Walters, Python-Dev

I believe the trend is due to language like Python and Node.js, most of which aggressively discourage threading (more from the broader community than the core languages, but I see a lot of apps using these now), and also the higher reliability afforded by out-of-process tasks (that is, one crash doesn’t kill the entire app – e.g browser tabs).

Optimizing startup time is incredibly valuable, and having tried it a few times I believe that the import system (in essence, stat calls) is the biggest culprit. The tens of ms prior to the first user import can’t really go anywhere.

Cheers,

Steve

Top-posted from my Windows phone

Unsubscribe: https://mail.python.org/mailman/options/python-dev/steve.dower%40python.org

Brett Cannon

unread,

Jul 22, 2017, 1:19:15 PM7/22/17

to Steve Dower, Alex Walters, Python-Dev

On Sat, Jul 22, 2017, 07:22 Steve Dower, <steve...@python.org> wrote:

I believe the trend is due to language like Python and Node.js, most of which aggressively discourage threading (more from the broader community than the core languages, but I see a lot of apps using these now), and also the higher reliability afforded by out-of-process tasks (that is, one crash doesn’t kill the entire app – e.g browser tabs).

Optimizing startup time is incredibly valuable, and having tried it a few times I believe that the import system (in essence, stat calls) is the biggest culprit. The tens of ms prior to the first user import can’t really go anywhere.

Stat calls in the import system were optimized in importlib a while back to be cached in finders so at this point you will have to remove a stat call to lower that cost or cache more which goes into breaking abstractions or designing new APIs.

-brett

Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Steve Dower

unread,

Jul 22, 2017, 7:37:44 PM7/22/17

to Brett Cannon, Alex Walters, Python-Dev

“Stat calls in the import system were optimized in importlib a while back”

Yes, I’m aware of that, which is why I don’t have any specific suggestions off-hand. But given the differences in file systems between Windows and other OSs, it wouldn’t surprise me if there were a more optimal approach for NTFS to amortize calls better. Perhaps not, but it is still the most expensive part of startup that we have any ability to change, so it’s worth investigating.

Cheers,

Steve

Top-posted from my Windows phone

Michel Desmoulin

unread,

Jul 23, 2017, 3:54:06 AM7/23/17

to pytho...@python.org

> Optimizing startup time is incredibly valuable,

I've been reading that from the beginning of this thread but I've been
using python since the 2.4 and I never felt the burden of the startup time.

I'm guessing a lot of people are like me, they just don't express them
self because "better startup time can't be bad so let's not put a
barrier on this".

I'm not against it, but since the necessity of a faster Python in
general has been a debate for years and is only finally catching up with
the work of Victor Stinner, can somebody explain me the deal with start
up time ?

I understand where it can improve your lives. I just don't get why it's
suddenly such an explosion of expectations and needs.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Antoine Pitrou

unread,

Jul 23, 2017, 4:20:46 AM7/23/17

to pytho...@python.org

On Sat, 22 Jul 2017 16:35:31 -0700
Steve Dower <steve...@python.org> wrote:
>
> Yes, I’m aware of that, which is why I don’t have any specific suggestions off-hand. But given the differences in file systems between Windows and other OSs, it wouldn’t surprise me if there were a more optimal approach for NTFS to amortize calls better. Perhaps not, but it is still the most expensive part of startup that we have any ability to change, so it’s worth investigating.

Can you expand on it being "the most expensive part of startup that we
have any ability to change"?

For example, how do Nick's benchmarks above fare on Windows?

Regards

Antoine.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Brett Cannon

unread,

Jul 23, 2017, 1:39:05 PM7/23/17

to Michel Desmoulin, pytho...@python.org

On Sun, Jul 23, 2017, 00:53 Michel Desmoulin, <desmoul...@gmail.com> wrote:

> Optimizing startup time is incredibly valuable,

I've been reading that from the beginning of this thread but I've been
using python since the 2.4 and I never felt the burden of the startup time.

I'm guessing a lot of people are like me, they just don't express them
self because "better startup time can't be bad so let's not put a
barrier on this".

I'm not against it, but since the necessity of a faster Python in
general has been a debate for years and is only finally catching up with
the work of Victor Stinner, can somebody explain me the deal with start
up time ?

I understand where it can improve your lives. I just don't get why it's
suddenly such an explosion of expectations and needs.

It's actually always been something we have tried to improve, it just comes in waves. For instance we occasionally re-examine what modules get pulled in during startup. Importlib was optimized to help with startup. This just happens to be the latest round of trying to improve the situation.

As for why we care, every command-line app wants to at least appear faster if not be faster because just getting to the point of being able to e.g. print a version number is dominated by Python and app start-up. And this is not guessing; I work with a team that puts out a command line app and one of the biggest complaints they get is the startup time.

-brett

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Michel Desmoulin

unread,

Jul 23, 2017, 2:01:24 PM7/23/17

to Brett Cannon, pytho...@python.org

Le 23/07/2017 à 19:36, Brett Cannon a écrit :
>
>
> On Sun, Jul 23, 2017, 00:53 Michel Desmoulin, <desmoul...@gmail.com
> <mailto:desmoul...@gmail.com>> wrote:
>
>
>
> > Optimizing startup time is incredibly valuable,
>
> I've been reading that from the beginning of this thread but I've been
> using python since the 2.4 and I never felt the burden of the
> startup time.
>
> I'm guessing a lot of people are like me, they just don't express them
> self because "better startup time can't be bad so let's not put a
> barrier on this".
>
> I'm not against it, but since the necessity of a faster Python in
> general has been a debate for years and is only finally catching up with
> the work of Victor Stinner, can somebody explain me the deal with start
> up time ?
>
> I understand where it can improve your lives. I just don't get why it's
> suddenly such an explosion of expectations and needs.
>
>
> It's actually always been something we have tried to improve, it just
> comes in waves. For instance we occasionally re-examine what modules get
> pulled in during startup. Importlib was optimized to help with startup.
> This just happens to be the latest round of trying to improve the situation.
>
> As for why we care, every command-line app wants to at least appear
> faster if not be faster because just getting to the point of being able
> to e.g. print a version number is dominated by Python and app start-up.

Fair enought.

> And this is not guessing; I work with a team that puts out a command
> line app and one of the biggest complaints they get is the startup time.

This I don't get. When I run any command line utility in python (grin,
ffind, pyped, django-admin.py...), the execute in a split second.

I can't even SEE the different between:

python3 -c "import os; [print(x) for x in os.listdir('.')]"

and

ls .

I'm having a hard time understanding how the Python VM startup time can
be perceived as a barriere here. I can understand if you have an
application firing Python 1000 times a second, like a CGI service or
some kind of code exec service. But scripting ?

Now I can imagine that a given Python program can be slow to start up,
because it imports a lot of things. But not the VM itself.

>
> -brett
>
> _______________________________________________
> Python-Dev mailing list

> Pytho...@python.org <mailto:Pytho...@python.org>

> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Brett Cannon

unread,

Jul 23, 2017, 4:32:25 PM7/23/17

to Michel Desmoulin, pytho...@python.org

So you're viewing it from a single OS and single machine perspective. Stuff varies so much that you can't compare something like this based on a single experience.

I also said "appear" on purpose. 😉 Some people just compare Python against other languages based on benchmarks like startup when choosing a language so part of this is optics. This also applies when people compare Python 2 to 3.

Now I can imagine that a given Python program can be slow to start up,
because it imports a lot of things. But not the VM itself.

There's also the fact that some things we might do to speed up Python's own startup will propagate to user code and so have a bigger effect, e.g. making namedtuple cheaper reaches into user code that uses namedtuple.

IOW based on experience this is worth the time to look into.

Nick Coghlan

unread,

Jul 24, 2017, 12:01:23 AM7/24/17

to Steve Dower, Python-Dev

On 23 July 2017 at 09:35, Steve Dower <steve...@python.org> wrote:
> Yes, I’m aware of that, which is why I don’t have any specific suggestions
> off-hand. But given the differences in file systems between Windows and
> other OSs, it wouldn’t surprise me if there were a more optimal approach for
> NTFS to amortize calls better. Perhaps not, but it is still the most
> expensive part of startup that we have any ability to change, so it’s worth
> investigating.

That does remind me of a capability we haven''t played with a lot recently:

$ python3 -m site
sys.path = [
'/home/ncoghlan',
'/usr/lib64/python36.zip',
'/usr/lib64/python3.6',
'/usr/lib64/python3.6/lib-dynload',
'/home/ncoghlan/.local/lib/python3.6/site-packages',
'/usr/lib64/python3.6/site-packages',
'/usr/lib/python3.6/site-packages',
]
USER_BASE: '/home/ncoghlan/.local' (exists)
USER_SITE: '/home/ncoghlan/.local/lib/python3.6/site-packages' (exists)
ENABLE_USER_SITE: True

The interpreter puts a zip file ahead of the regular unpacked standard
library on sys.path because at one point in time that was a useful
optimisation technique for reducing import costs on application
startup. It was a potentially big win with the old "multiple stat
calls" import implementation, but I'm not aware of any more recent
benchmarks relative to the current listdir-caching based import
implementation.

So I think some interesting experiments to try measuring might be:

- pushing the "always imported" modules into a dedicated zip archive
- having the interpreter pre-seed sys.modules with the contents of
that dedicated archive
- freezing those modules and building them into the interpreter that way
- compiling the standalone top-level modules with Cython, and loading
them as extension modules
- compiling in the Cython generated modules as builtins (not currently
an option for packages & submodules due to [1])

The nice thing about those kinds of approaches is that they're all
fairly general purpose, and relate primarily to how the Python
interpreter is put together, rather than how the individual modules
are written in the first place.

(I'm not volunteering to run those experiments, though - just pointing
out some of the technical options we have available to us that don't
involve adding more handcrafted C extension modules to CPython)

[1] https://bugs.python.org/issue1644818

Cheers,
NIck.

P.S. Checking the current list of source modules implicitly loaded at
startup, I get:

>>> import sys
>>> sorted(k for k, m in sys.modules.items() if m.__spec__ is not None and type(m.__spec__.loader).__name__ == "SourceFileLoader")
['_collections_abc', '_sitebuiltins', '_weakrefset', 'abc', 'codecs',
'encodings', 'encodings.aliases', 'encodings.latin_1',
'encodings.utf_8', 'genericpath', 'io', 'os', 'os.path', 'posixpath',
'rlcompleter', 'site', 'stat']

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Nick Coghlan

unread,

Jul 24, 2017, 10:26:56 AM7/24/17

to Stefan Behnel, pytho...@python.org

On 22 July 2017 at 06:43, Stefan Behnel <stef...@behnel.de> wrote:
> Nick Coghlan schrieb am 21.07.2017 um 08:23:
>> I'll also note that in these cases where the import overhead is
>> proportionally significant for always-imported modules, we may want to look
>> at the benefits of freezing them (if they otherwise remain as pure Python
>> modules), or compiling them as builtin modules (if we switch them over to
>> Cython), in addition to looking at ways to make the modules themselves
>> faster.
>
> Just for the sake of it, I gave the Cython compilation a try. I had to
> apply the attached hack to Lib/typing.py to get the test passing, because
> it uses frame call offsets in some places and Cython functions do not
> create frames when being called (they only create them for exception
> traces). I also had to disable the import of "abc" in the Cython generated
> module to remove the circular self dependency at startup when the "abc"
> module is compiled. That shouldn't have an impact on the runtime
> performance, though.

[snip]

> As it stands, the gain is probably not worth the increase in library file
> size, which also translates to a higher bottom line for the memory
> consumption. At least not for these two modules. Manually optimising the
> files would likely also reduce the .so file size in addition to giving
> better speedups, though, because the generated code would become less generic.

Thanks for trying the experiment! I agree with your conclusion that
the file size impact likely rules it out as a general technique.

Selective freezing may still be interesting though, since that at
least avoids the import path searches and merges the disk read into
the initial loading of the executable.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Gregory Szorc

unread,

May 1, 2018, 11:55:03 PM5/1/18

to Larry Hastings, pytho...@python.org

On 7/19/2017 12:15 PM, Larry Hastings wrote:
>
>
> On 07/19/2017 05:59 AM, Victor Stinner wrote:
>> Mercurial startup time is already 45.8x slower than Git whereas tested
>> Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial
>> developers, with a startup time 2x - 3x slower...
>
> When Matt Mackall spoke at the Python Language Summit some years back, I
> recall that he specifically complained about Python startup time. He
> said Python 3 "didn't solve any problems for [them]"--they'd already
> solved their Unicode hygiene problems--and that Python's slow startup
> time was already a big problem for them. Python 3 being /even slower/
> to start was absolutely one of the reasons why they didn't want to upgrade.
>
> You might think "what's a few milliseconds matter". But if you run
> hundreds of commands in a shell script it adds up. git's speed is one
> of the few bright spots in its UX, and hg's comparative slowness here is
> a palpable disadvantage.
>
>
>> So please continue efforts for make Python startup even faster to beat
>> all other programming languages, and finally convince Mercurial to
>> upgrade ;-)
>
> I believe Mercurial is, finally, slowly porting to Python 3.
>
> https://www.mercurial-scm.org/wiki/Python3
>
> Nevertheless, I can't really be annoyed or upset at them moving slowly
> to adopt Python 3, as Matt's objections were entirely legitimate.

I just now found found this thread when searching the archive for
threads about startup time. And I was searching for threads about
startup time because Mercurial's startup time has been getting slower
over the past few months and this is causing substantial pain.

As I posted back in 2014 [1], CPython's startup overhead was >10% of the
total CPU time in Mercurial's test suite. And when you factor in the
time to import modules that get Mercurial to a point where it can run
commands, it was more like 30%!

Mercurial's full test suite currently runs `hg` ~25,000 times. Using
Victor's startup time numbers of 6.4ms for 2.7 and 14.5ms for
3.7/master, Python startup overhead contributes ~160s on 2.7 and ~360s
on 3.7/master. Even if you divide this by the number of available CPU
cores, we're talking dozens of seconds of wall time just waiting for
CPython to get to a place where Mercurial's first bytecode can execute.

And the problem is worse when you factor in the time it takes to import
Mercurial's own modules.

As a concrete example, I recently landed a Mercurial patch [2] that
stubs out zope.interface to prevent the import of 9 modules on every
`hg` invocation. This "only" saved ~6.94ms for a typical `hg`
invocation. But this decreased the CPU time required to run the test
suite on my i7-6700K from ~4450s to ~3980s (~89.5% of original) - a
reduction of almost 8 minutes of CPU time (and over 1 minute of wall time)!

By the time CPython gets Mercurial to a point where we can run useful
code, we've already blown most of or past the time budget where humans
perceive an action/command as instantaneous. If you ignore startup
overhead, Mercurial's performance compares quite well to Git's for many
operations. But the reality is that CPython startup overhead makes it
look like Mercurial is non-instantaneous before Mercurial even has the
opportunity to execute meaningful code!

Mercurial provides a `chg` program that essentially spins up a daemon
`hg` process running a "command server" so the `chg` program [written in
C - no startup overhead] can dispatch commands to an already-running
Python/`hg` process and avoid paying the startup overhead cost. When you
run Mercurial's test suite using `chg`, it completes *minutes* faster.
`chg` exists mainly as a workaround for slow startup overhead.

Changing gears, my day job is maintaining Firefox's build system. We use
Python heavily in the build system. And again, Python startup overhead
is problematic. I don't have numbers offhand, but we invoke likely a few
hundred Python processes as part of building Firefox. It should be
several thousand. But, we've had to "hack" parts of the build system to
"batch" certain build actions in single process invocations in order to
avoid Python startup overhead. This undermines the ability of some build
tools to formulate a reasonable understanding of the DAG and it causes a
bit of pain for build system developers and makes it difficult to
achieve "no-op" and fast incremental builds because we're always
invoking certain Python processes because we've had to move DAG
awareness out of the build backend and into Python. At some point, we'll
likely replace Python code with Rust so the build system is more "pure"
and easier to maintain and reason about.

I've seen posts in this thread and elsewhere in the CPython development
universe that challenge whether milliseconds in startup time matter.
Speaking as a Mercurial and Firefox build system developer,
*milliseconds absolutely matter*. Going further, *fractions of
milliseconds matter*. For Mercurial's test suite with its ~25,000 Python
process invocations, 1ms translates to ~25s of CPU time. With 2.7,
Mercurial can dispatch commands in ~50ms. When you load common
extensions, it isn't uncommon to see process startup overhead of
100-150ms! A millisecond here. A millisecond there. Before you know it,
we're talking *minutes* of CPU (and potentially wall) time in order to
run Mercurial's test suite (or build Firefox, or ...).

From my perspective, Python process startup and module import overhead
is a severe problem for Python. I don't say this lightly, but in my mind
the problem causes me to question the viability of Python for popular
use cases, such as CLI applications. When choosing a programming
language, I want one that will scale as a project grows. Vanilla process
overhead has Python starting off significantly slower than compiled code
(or even Perl) and adding module import overhead into the mix makes
Python slower and slower as projects grow. As someone who has to deal
with this slowness on a daily basis, I can tell you that it is extremely
frustrating and it does matter. I hope that the importance of the
problem will be acknowledged (milliseconds *do* matter) and that
creative minds will band together to address it. Since I am
disproportionately impacted by this issue, if there's anything I can do
to help, let me know.

Gregory

[1] https://mail.python.org/pipermail/python-dev/2014-May/134528.html
[2] https://www.mercurial-scm.org/repo/hg/rev/856f381ad74b

Ray Donnelly

unread,

May 2, 2018, 3:05:33 AM5/2/18

to Gregory Szorc, Python-Dev

Is your Python interpreter statically linked? The Python 3 ones from the anaconda distribution (use Miniconda!) are for Linux and macOS and that roughly halved our startup times.

Gregory

[1] https://mail.python.org/pipermail/python-dev/2014-May/134528.html
[2] https://www.mercurial-scm.org/repo/hg/rev/856f381ad74b
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/mingw.android%40gmail.com

Victor Stinner

unread,

May 2, 2018, 5:28:46 AM5/2/18

to Gregory Szorc, python-dev

What do you propose to make Python startup faster?

As I wrote in my previous emails, many Python core developers care of
the startup time and we are working on making it faster.

INADA Naoki added -X importtime to identify slow imports and
understand where Python spent its startup time.

Recent example: Barry Warsaw identified that pkg_resources is slow and
added importlib.resources to Python 3.7:
https://docs.python.org/dev/library/importlib.html#module-importlib.resources

Brett Cannon is also working on a standard solution for lazy imports
since many years:
https://pypi.org/project/modutil/
https://snarky.ca/lazy-importing-in-python-3-7/

Nick Coghlan is working on the C API to configure Python startup: PEP
432. When it will be ready, maybe Mercurial could use a custom Python
optimized for its use case.

IMHO Python import system is inefficient. We try too many alternative names.

Example with Python 3.8

$ ./python -vv:
>>> import dontexist
# trying /home/vstinner/prog/python/master/dontexist.cpython-38dm-x86_64-linux-gnu.so
# trying /home/vstinner/prog/python/master/dontexist.abi3.so
# trying /home/vstinner/prog/python/master/dontexist.so
# trying /home/vstinner/prog/python/master/dontexist.py
# trying /home/vstinner/prog/python/master/dontexist.pyc
# trying /home/vstinner/prog/python/master/Lib/dontexist.cpython-38dm-x86_64-linux-gnu.so
# trying /home/vstinner/prog/python/master/Lib/dontexist.abi3.so
# trying /home/vstinner/prog/python/master/Lib/dontexist.so
# trying /home/vstinner/prog/python/master/Lib/dontexist.py
# trying /home/vstinner/prog/python/master/Lib/dontexist.pyc
# trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.cpython-38dm-x86_64-linux-gnu.so
# trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.abi3.so
# trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.so
# trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.py
# trying /home/vstinner/prog/python/master/build/lib.linux-x86_64-3.8-pydebug/dontexist.pyc
# trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.cpython-38dm-x86_64-linux-gnu.so
# trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.abi3.so
# trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.so
# trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.py
# trying /home/vstinner/.local/lib/python3.8/site-packages/dontexist.pyc
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'dontexist'

Why do we still check for the .pyc file outside __pycache__ directories?

Why do we have to check for 3 different names for .so files?

Does Mercurial need all directories of sys.path?

What's the status of the "system python" project? :-)

I also would prefer Python without the site module. Can we rewrite
this module in C maybe? Until recently, the site module was needed on
Python to create the "mbcs" encoding alias. Hopefully, the feature has
been removed into Lib/encodings/__init__.py (new private _alias_mbcs()
function).

Python 3.7b3+:

$ python3.7 -X importtime -c pass
import time: self [us] | cumulative | imported package
import time: 95 | 95 | zipimport
import time: 589 | 589 | _frozen_importlib_external
import time: 67 | 67 | _codecs
import time: 498 | 565 | codecs
import time: 425 | 425 | encodings.aliases
import time: 641 | 1629 | encodings
import time: 228 | 228 | encodings.utf_8
import time: 143 | 143 | _signal
import time: 335 | 335 | encodings.latin_1
import time: 58 | 58 | _abc
import time: 265 | 322 | abc
import time: 298 | 619 | io
import time: 69 | 69 | _stat
import time: 196 | 265 | stat
import time: 169 | 169 | genericpath
import time: 336 | 505 | posixpath
import time: 1190 | 1190 | _collections_abc
import time: 600 | 2557 | os
import time: 223 | 223 | _sitebuiltins
import time: 214 | 214 | sitecustomize
import time: 74 | 74 | usercustomize
import time: 477 | 3544 | site

Victor

Antoine Pitrou

unread,

May 2, 2018, 5:45:35 AM5/2/18

to pytho...@python.org

On Wed, 2 May 2018 11:26:35 +0200
Victor Stinner <vsti...@redhat.com> wrote:
>
> Brett Cannon is also working on a standard solution for lazy imports
> since many years:
> https://pypi.org/project/modutil/
> https://snarky.ca/lazy-importing-in-python-3-7/

AFAIK, Mercurial already has its own lazy importer.

> Nick Coghlan is working on the C API to configure Python startup: PEP
> 432. When it will be ready, maybe Mercurial could use a custom Python
> optimized for its use case.
>
> IMHO Python import system is inefficient. We try too many alternative names.

The overhead of importing is not in trying too many names, but in
loading the module and executing its bytecode.

> Why do we still check for the .pyc file outside __pycache__ directories?

Because we support sourceless distributions.

> Why do we have to check for 3 different names for .so files?

See https://bugs.python.org/issue32387

Regards

Antoine.

Gregory Szorc

unread,

May 2, 2018, 12:51:25 PM5/2/18

to Victor Stinner, python-dev

On Tue, May 1, 2018 at 11:55 PM, Ray Donnelly <mingw....@gmail.com> wrote:

> Is your Python interpreter statically linked? The Python 3 ones from the anaconda distribution (use Miniconda!) are for Linux and macOS and that roughly halved our startup times.

My Python interpreters use a shared library. I'll definitely investigate the performance of a statically-linked interpreter.

Correct me if I'm wrong, but aren't there downsides with regards to C extension compatibility to not having a shared libpython? Or does all the packaging tooling "just work" without a libpython? (It's possible I have my wires crossed up with something else regarding a statically linked Python.)

On Wed, May 2, 2018 at 2:26 AM, Victor Stinner <vsti...@redhat.com> wrote:

What do you propose to make Python startup faster?

That's a very good question. I'm not sure I'm able to answer it because I haven't dug too much into CPython's internals much farther than what is required to implement C extensions. But I can share insight from what the Mercurial project has collectively learned.

As I wrote in my previous emails, many Python core developers care of
the startup time and we are working on making it faster.

INADA Naoki added -X importtime to identify slow imports and
understand where Python spent its startup time.

-X importtime is a great start! For a follow-up enhancement, it would be useful to see what aspects of import are slow. Is it finding modules (involves filesystem I/O)? Is it unmarshaling pyc files? Is it executing the module code? If executing code, what part is slow? Inline statements/expressions? Compiling types? Printing the microseconds it takes to import a module is useful. But it only gives me a general direction: I want to know what parts of the import made it slow so I know if I should be focusing on code running during module import, slimming down the size of a module, eliminating the module import from fast paths, pursuing alternative module importers, etc.

Recent example: Barry Warsaw identified that pkg_resources is slow and
added importlib.resources to Python 3.7:
https://docs.python.org/dev/library/importlib.html#module-importlib.resources

Brett Cannon is also working on a standard solution for lazy imports
since many years:
https://pypi.org/project/modutil/
https://snarky.ca/lazy-importing-in-python-3-7/

Mercurial has used lazy module imports for years. On 2.7.14, it reduces `hg version` from ~160ms to ~55ms (~34% of original). On Python 3, we're using `importlib.util.LazyLoader` and it reduces `hg version` on 3.7 from ~245ms to ~120ms (~49% of original). I'm not sure why Python 3's built-in module importer doesn't yield the speedup that our custom Python 2 importer does. One explanation is our custom importer is more advanced than importlib. Another is that Python 3's import mechanism is slower (possibly due to being written in Python instead of C). We haven't yet spent much time optimizing Mercurial for Python 3: our immediate goal is to get it working first. Given the startup performance problem on Python 3, it is only a matter of time before we dig into this further.

It's worth noting that lazy module importing can be undone via common patterns. Most commonly, `from foo import X`. It's *really* difficult to implement a proper object proxy. Mercurial's lazy importer gives up in this case and imports the module and exports the symbol. (But if the imported module is a package, we detect that and make the module exports proxies to a lazy module.)

Another common undermining of the lazy importer is code that runs during import time module exec that accesses an attribute. e.g.

```

import foo

class myobject(foo.Foo):

pass

```

Mercurial goes out of its way to avoid these patterns so modules can be delay imported as much as possible. As long as import times are problematic, it would be helpful if the standard library adopted similar patterns. Although I recognize there are backwards compatibility concerns that tie your hands a bit.

Nick Coghlan is working on the C API to configure Python startup: PEP
432. When it will be ready, maybe Mercurial could use a custom Python
optimized for its use case.

That looks great!

The direction Mercurial is going in is that `hg` will likely become a Rust binary (instead of a #!python script) that will use an embedded Python interpreter. So we will have low-level control over the interpreter via the C API. I'd also like to see us distribute a copy of Python in our official builds. This will allow us to take various shortcuts, such as not having to probe various sys.path entries since certain packages can only exist in one place. I'd love to get to the state Google is at where they have self-contained binaries with ELF sections containing Python modules. But that requires a bit of very low-level hacking. We'll likely have a Rust binary (that possibly static links libpython) and a separate JAR/zip-like file containing resources.

But many people obtain Python via their system package manager and no matter how hard we scream that Mercurial is a standalone application, they will configure their packages to link against the system libpython and use the system Python's standard library. This will potentially undo many of our startup time wins.

Yes, I also cringe every time I trace Python's system calls and see these needless stats and file opens. Unless Python adds the ability to tell the import mechanism what type of module to import, Mercurial will likely modify our custom importer to only look for specific files. We do provide pure Python modules for modules that have C implementations. But we have code that ensures that the C version is loaded for certain Python configurations because we don't want users accidentally using the non-C modules and then complaining about Mercurial's performance! We already denote the set of modules backed by C. What we're missing (but is certainly possible to implement) is code that limits the module finding search depending on whether the module is backed by Python or C. But this only really works for Mercurial's modules: we don't really know what the standard library is doing and coding assumptions into Mercurial about standard library behavior feels dangerous.

If we ship our own Python distribution, we'll likely have a jar-like file containing all modules. Determining which file to load will read an in-memory file index and not require any expensive system calls to look for files.

Does Mercurial need all directories of sys.path?

No and yes. Mercurial by itself can get by with just the standard library and Mercurial's own packages. But extensions change everything. An extension could modify sys.path though. So limiting sys.path inside Mercurial is somewhat reasonable. Although it's definitely unexpected for a Python application to be removing entries from sys.path when the application starts.

What's the status of the "system python" project? :-)

I also would prefer Python without the site module. Can we rewrite
this module in C maybe? Until recently, the site module was needed on
Python to create the "mbcs" encoding alias. Hopefully, the feature has
been removed into Lib/encodings/__init__.py (new private _alias_mbcs()
function).

I also lament the startup time effects of site.py. When `hg` is a Rust binary, we will almost certainly skip site.py and manually perform any required actions that it was performing.

As for things Python could do to make things better, one idea is for "package bundles." Instead of using .py, .pyc, .so, etc files as separate files on the filesystem, allow Python packages to be distributed as standalone "archive" files. Like Java's jar files. This has the advantage that there is only a single place to look for files in a given Python package. And since the bundle is immutable, you can index it so imports don't need to touch the filesystem to discover what is present: you do a quick memory lookup and jump straight to the available file. If you go this route, please don't require the use of zlib for file compression, as zlib is painfully slow compared to alternatives like lz4 and zstandard.

I know this kinda/sorta exists with zipimporter. But zipimporter uses zlib (slow) and only allows .py/.pyc files. And I think some Python application distribution tools have also solved this problem. I'd *really* like to see a proper/robust solution in Python itself. Along that vein, it would be really nice if the "standalone Python application" story were a bit more formalized. From my perspective, it is insanely difficult to package and distribute an application that happens to use Python. It requires vastly different solutions for different platforms. I want to declare a minimal boilerplate somewhere (perhaps in setup.py) and run a command that produces an as-self-contained-as-possible application complete with platform-native installers. Presumably such a self-contained application could take many shortcuts with regards to process startup and mitigate this general problem. Again, Mercurial is trending in the direction of making `hg` a Rust binary and distributing its own Python. Since we have to solve this packaging+distribution problem on multiple platforms, I'll try to keep an eye towards making whatever solution we concoct reusable by other projects.

Nathaniel Smith

unread,

May 2, 2018, 1:57:58 PM5/2/18

to Gregory Szorc, Python Dev

On Wed, May 2, 2018, 09:51 Gregory Szorc <gregor...@gmail.com> wrote:

Correct me if I'm wrong, but aren't there downsides with regards to C extension compatibility to not having a shared libpython? Or does all the packaging tooling "just work" without a libpython? (It's possible I have my wires crossed up with something else regarding a statically linked Python.)

IIRC, the rule on Linux is that if you build an extension on a statically built python, then it can be imported on a shared python, but not vice-versa. Manylinux wheels are therefore always built on a static python so that they'll work everywhere. (We should probably clean this up upstream at some point, but there's not a lot of appetite for touching this stuff – very obscure, very easy to break things without realizing it, not much upside.)

On Windows I don't think there is such a thing as a static build, because extensions have to link to the python dll to work at all. And on MacOS I'm not sure, though from knowing how their linker works my guess is that all extensions act like static extensions do on Linux.

-n

Neil Schemenauer

unread,

May 2, 2018, 4:26:47 PM5/2/18

to pytho...@python.org

Antoine:

> The overhead of importing is not in trying too many names, but in
> loading the module and executing its bytecode.

That was my conclusion as well when I did some profiling last fall
at the Python core sprint. My lazy execution experiments are an
attempt to solve this:

https://github.com/python/cpython/pull/6194

I expect that Mercurial is already doing a lot of tricks to make
execution more lazy. They have a lazy module import hook but they
probably do other things to not execute more bytecode at startup
then is needed. My lazy execution idea is that this could happen
more automatically. I.e. don't pay for something you don't use.
Right now, with eager module imports, you usually pay a price for
every bit of bytecode that your program potentially uses.

Another idea, suggested to me by Carl Shapiro, is to store
unmarshalled Python data in the heap section of the executable (or
in DLLs). Then, the OS page fault handling would take care of only
loading the data into RAM that is actually being used. The linker
would take care of fixing up pointer references. There are a lot of
details to work out with this idea but I have heard that Jeethu Rao
(Carl's colleague at Instagram) has a prototype implementation that
shows promise.

Regards,

Neil

Barry Warsaw

unread,

May 2, 2018, 5:15:38 PM5/2/18

to python-dev

Thanks for bringing this topic up again. At $day_job, this is a highly visible and important topic, since the majority of our command line tools are written in Python (of varying versions from 2.7 to 3.6). Some of those tools can take upwards of 5 seconds or more just to respond to —help, which causes lots of pain for developers, who complain (rightly so) up the management chain. ;)

We’ve done a fair bit of work to bring those numbers down without super radical workarounds. Often there are problems not strictly related to the Python interpreter that contribute to this. Python gets blamed, but it’s not always the interpreter’s fault. Common issues include:

* Modules that have import-time side effects, such as network access or expensive creation of data structures. Python 3.7’s `-X importtime` switch is a really wonderful way to identify the worst offenders. Once 3.7 is released, I do plan to spend some time using this to collect data internally so we can attack our own libraries, and perhaps put automated performance testing into our build stack, to identify start up time regressions.

* pkg_resources. When you have tons of entries on sys.path, pkg_resources does a lot of work at import time, and because of common patterns which tend to use pkg_resources namespace package support in __init__.py files, this just kills start up times. Of course, pkg_resources has other uses too, so even in a purely Python 3 world (where your namespace packages can omit the __init__.py), you’ll often get clobbered as soon as you want to use the Basic Resource Access API. This is also pretty common, and it’s the main reason why Brett and I created importlib.resources for 3.7 (with a standalone API-compatible library for older Pythons). That’s one less reason to use pkg_resources, but it doesn’t address the __init__.py use. Brett and I have been talking about addressing that for 3.8.

* pex - which we use as our single file zipapp tool. Especially the interaction between pex and pkg_resources introduces pretty significant overhead. My colleague Loren Carvalho created a tool called shiv which requires at least Python 3.6, avoids the use of pkg_resources, and implements other tricks to be much more performant than pex. Shiv is now open source and you can find it on RTD and GitHub.

The switch to shiv and importlib.resources can shave 25-50% off of warm cache start up times for zipapp style executables.

Another thing we’ve done, although I’m much less sanguine about them as a general approach, is to move imports into functions, but we’re trying to only use that trick on the most critical cases.

Some import time effects can’t be changed. Decorators come to mind, and click is a popular library for CLIs that provides some great features, but decorators do prevent a lazy loading approach.

> On May 1, 2018, at 20:26, Gregory Szorc <gregor...@gmail.com> wrote:

>> You might think "what's a few milliseconds matter". But if you run
>> hundreds of commands in a shell script it adds up. git's speed is one
>> of the few bright spots in its UX, and hg's comparative slowness here is
>> a palpable disadvantage.

Oh, for command line tools, milliseconds absolutely matter.

> As a concrete example, I recently landed a Mercurial patch [2] that
> stubs out zope.interface to prevent the import of 9 modules on every
> `hg` invocation.

I have a similar dastardly plan to provide a pkg_resources stub :).

> Mercurial provides a `chg` program that essentially spins up a daemon
> `hg` process running a "command server" so the `chg` program [written in
> C - no startup overhead] can dispatch commands to an already-running
> Python/`hg` process and avoid paying the startup overhead cost. When you
> run Mercurial's test suite using `chg`, it completes *minutes* faster.
> `chg` exists mainly as a workaround for slow startup overhead.

A couple of our developers demoed a similar approach for one of our CLIs that almost everyone uses. It’s a big application with lots of dependencies, so particularly vulnerable to pex and pkg_resources overhead. While it was just a prototype, it was darn impressive to see subsequent invocations produce output almost immediately. It’s unfortunate that we have to utilize all these tricks to get even moderately performant Python CLIs.

A few of us spent some time at last year’s core Python dev talking about other things we could do to improve Python’s start up time, not just with the interpreter itself, but within the larger context of the Python ecosystem. Many ideas seem promising until you dive into the details, so it’s definitely hard to imagine maintaining all of Python’s dynamic semantics and still making it an order of magnitude faster to start up. But that’s not an excuse to give up, and I’m hoping we can continue to attack the problem, both in the micro and the macro, for 3.8 and beyond, because the alternative is that Python becomes less popular as an implementation language for CLIs. That would be sad, and definitely has a long term impact on Python’s popularity.

Cheers,
-Barry

signature.asc

Barry Warsaw

unread,

May 2, 2018, 5:25:55 PM5/2/18

to python-dev

On May 2, 2018, at 09:42, Gregory Szorc <gregor...@gmail.com> wrote:

> As for things Python could do to make things better, one idea is for "package bundles." Instead of using .py, .pyc, .so, etc files as separate files on the filesystem, allow Python packages to be distributed as standalone "archive" files.

Of course, .so files have to be extracted to the file system, because we have to live with dlopen()’s API. In our first release of shiv, we had a loader that did exactly that for just .so files. We ended up just doing .pyz file unpacking unconditionally, ignoring zip-safe, mostly because too many packages still use __file__, which doesn’t work in a zipapp.

I’ll plug shiv and importlib.resources (and the standalone importlib_resources) again here. :)

> If you go this route, please don't require the use of zlib for file compression, as zlib is painfully slow compared to alternatives like lz4 and zstandard.

shiv works in a similar manner to pex, although it’s a completely new implementation that doesn’t suffer from huge sys.paths or the use of pkg_resources. shiv + importlib.resources saves us 25-50% of warm cache startup time. That makes things better but still not ideal. Ultimately though that means we don’t suffer from the slowness of zlib since we don’t count cold cache times (i.e. before the initial pyz unpacking operation).

Cheers,
-Barry

signature.asc

Gregory Szorc

unread,

May 2, 2018, 6:28:09 PM5/2/18

to Barry Warsaw, python-dev

On 5/2/18 2:24 PM, Barry Warsaw wrote:
> On May 2, 2018, at 09:42, Gregory Szorc <gregor...@gmail.com> wrote:
>
>> As for things Python could do to make things better, one idea is for "package bundles." Instead of using .py, .pyc, .so, etc files as separate files on the filesystem, allow Python packages to be distributed as standalone "archive" files.
>
> Of course, .so files have to be extracted to the file system, because we have to live with dlopen()’s API. In our first release of shiv, we had a loader that did exactly that for just .so files. We ended up just doing .pyz file unpacking unconditionally, ignoring zip-safe, mostly because too many packages still use __file__, which doesn’t work in a zipapp.

FWIW, Google has a patched glibc that implements dlopen_with_offset().
It allows you to do things like memory map the current binary and then
dlopen() a shared library embedded in an ELF section.

I've seen the code in the branch at
https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/google/grte/v4-2.19/master.
It likely exists elsewhere. An attempt to upstream it occurred at
https://sourceware.org/bugzilla/show_bug.cgi?id=11767. It is probably
well worth someone's time to pick up the torch and get this landed in
glibc so everyone can be a massive step closer to self-contained, single
binary applications. Of course, it will take years before you can rely
on a glibc version with this API being deployed universally. But the
sooner this lands...

>
> I’ll plug shiv and importlib.resources (and the standalone importlib_resources) again here. :)
>
>> If you go this route, please don't require the use of zlib for file compression, as zlib is painfully slow compared to alternatives like lz4 and zstandard.
>
> shiv works in a similar manner to pex, although it’s a completely new implementation that doesn’t suffer from huge sys.paths or the use of pkg_resources. shiv + importlib.resources saves us 25-50% of warm cache startup time. That makes things better but still not ideal. Ultimately though that means we don’t suffer from the slowness of zlib since we don’t count cold cache times (i.e. before the initial pyz unpacking operation).
>
> Cheers,
> -Barry
>
>
>

_______________________________________________

Barry Warsaw

unread,

May 2, 2018, 7:13:37 PM5/2/18

to python-dev

On May 2, 2018, at 15:24, Gregory Szorc <gregor...@gmail.com> wrote:
>
> FWIW, Google has a patched glibc that implements dlopen_with_offset().
> It allows you to do things like memory map the current binary and then
> dlopen() a shared library embedded in an ELF section.
>
> I've seen the code in the branch at
> https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/google/grte/v4-2.19/master.
> It likely exists elsewhere. An attempt to upstream it occurred at
> https://sourceware.org/bugzilla/show_bug.cgi?id=11767. It is probably
> well worth someone's time to pick up the torch and get this landed in
> glibc so everyone can be a massive step closer to self-contained, single
> binary applications. Of course, it will take years before you can rely
> on a glibc version with this API being deployed universally. But the
> sooner this lands...

Oh, I’m well aware of the history of this patch. :) I’d love to see it available on the platforms I use, and agree it’s well worth someone’s time to continue to shepherd this through the processes to make that happen. Even if it did take years to roll out, Python could use it with the proper compile-time checks.

-Barry

signature.asc

Benjamin Peterson

unread,

May 2, 2018, 11:28:11 PM5/2/18

to Gregory Szorc, Victor Stinner, python-dev

On Wed, May 2, 2018, at 09:42, Gregory Szorc wrote:
> The direction Mercurial is going in is that `hg` will likely become a Rust
> binary (instead of a #!python script) that will use an embedded Python
> interpreter. So we will have low-level control over the interpreter via the
> C API. I'd also like to see us distribute a copy of Python in our official
> builds. This will allow us to take various shortcuts, such as not having to
> probe various sys.path entries since certain packages can only exist in one
> place. I'd love to get to the state Google is at where they have
> self-contained binaries with ELF sections containing Python modules. But
> that requires a bit of very low-level hacking. We'll likely have a Rust
> binary (that possibly static links libpython) and a separate JAR/zip-like
> file containing resources.

I'm curious about the rust binary. I can see that would give you startup time benefits similar to the ones you could get hacking the interpreter directly; e.g., you can use a zipfile for everything and not have site.py. But it seems like the Python-side wins would stop there. Is this all a prelude to incrementally rewriting hg in rust? (Mercuric oxide?)

INADA Naoki

unread,

May 2, 2018, 11:59:45 PM5/2/18

to Python-Dev

Recently, I reported how stdlib slows down `import requests`.
https://github.com/requests/requests/issues/4315#issuecomment-385584974

For Python 3.8, my ideas for faster startup time are:

* Add lazy compiling API or flag in `re` module. The pattern is compiled
when first used.
* Add IntEnum and IntFlag alternative in C, like PyStructSequence for
namedtuple.
It will make importing `socket` and `ssl` module much faster. (Both
module has huge enum/flag).
* Add special casing for UTF-8 and ASCII in TextIOWrapper. When
application uses only
UTF-8 or ASCII, we can skip importing codecs and encodings package
entirely.
* Add faster and simpler http.parser (maybe, based on h11 [1]) and avoid
using email module in http module.

[1]: https://h11.readthedocs.io/en/latest/

I don't have significant estimate how they can make `import requests`
faster, but I believe most of these ideas
are worth enough.

Regards,

Terry Reedy

unread,

May 3, 2018, 12:03:01 AM5/3/18

to pytho...@python.org

On 5/2/2018 12:42 PM, Gregory Szorc wrote:

> I know this kinda/sorta exists with zipimporter. But zipimporter uses
> zlib (slow) and only allows .py/.pyc files. And I think some Python
> application distribution tools have also solved this problem. I'd
> *really* like to see a proper/robust solution in Python itself. Along
> that vein, it would be really nice if the "standalone Python
> application" story were a bit more formalized. From my perspective, it
> is insanely difficult to package and distribute an application that
> happens to use Python. It requires vastly different solutions for
> different platforms. I want to declare a minimal boilerplate somewhere
> (perhaps in setup.py) and run a command that produces an
> as-self-contained-as-possible application complete with platform-native
> installers.

I few years ago I helped my wife create a tutorial in the Renpy visual
storytelling engine. It is free and open source.
https://www.renpy.org
It is written in Python, while users write scripts in both Python and a
custom scripting language.

When we were done, we pressed a button and it generated self-contained
zip files for Windows, Linux, and Mac. This can be done from any of
the three platforms. After we tested all three files, she created a web
page with links to the three files for download. There have been no
complaints so far. Perhaps the file generators could be adapted to
packaging a project directory into a self-contained app.

--
Terry Jan Reedy

Gregory Szorc

unread,

May 3, 2018, 12:12:27 AM5/3/18

to Benjamin Peterson, python-dev

On Wed, May 2, 2018 at 8:26 PM, Benjamin Peterson <benj...@python.org> wrote:

On Wed, May 2, 2018, at 09:42, Gregory Szorc wrote:
> The direction Mercurial is going in is that `hg` will likely become a Rust
> binary (instead of a #!python script) that will use an embedded Python
> interpreter. So we will have low-level control over the interpreter via the
> C API. I'd also like to see us distribute a copy of Python in our official
> builds. This will allow us to take various shortcuts, such as not having to
> probe various sys.path entries since certain packages can only exist in one
> place. I'd love to get to the state Google is at where they have
> self-contained binaries with ELF sections containing Python modules. But
> that requires a bit of very low-level hacking. We'll likely have a Rust
> binary (that possibly static links libpython) and a separate JAR/zip-like
> file containing resources.

I'm curious about the rust binary. I can see that would give you startup time benefits similar to the ones you could get hacking the interpreter directly; e.g., you can use a zipfile for everything and not have site.py. But it seems like the Python-side wins would stop there. Is this all a prelude to incrementally rewriting hg in rust? (Mercuric oxide?)

The plans are recorded at https://www.mercurial-scm.org/wiki/OxidationPlan. tl;dr we want to write some low-level bits in Rust but we anticipate the bulk of the application logic remaining in Python.

Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years.

Glenn Linderman

unread,

May 3, 2018, 2:38:29 AM5/3/18

to pytho...@python.org

On 5/2/2018 8:56 PM, Gregory Szorc wrote:

Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years.

Have you considered simply rewriting CPython in Rust?

And yes, the 4th word in that question was intended to produce peals of shocked laughter. But why Rust? Why not Go? http://esr.ibiblio.org/?p=7724

Ryan Gonzalez

unread,

May 3, 2018, 8:43:45 AM5/3/18

to Glenn Linderman, pytho...@python.org

I'm hardly an expert, but AFAIK CPython's start-up issues are more due to a
mix of architectural issues and the fact that it's hard to optimize imports
while maintaining backwards compatibility with Python's dynamism.

--
Ryan (ライアン)
Yoko Shimomura, ryo (supercell/EGOIST), Hiroyuki Sawano >> everyone else
https://refi64.com/

> ----------

> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:

> https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com

Nick Coghlan

unread,

May 3, 2018, 10:31:42 AM5/3/18

to Glenn Linderman, python-dev

On 3 May 2018 at 15:56, Glenn Linderman <v+py...@g.nevcal.com> wrote:

On 5/2/2018 8:56 PM, Gregory Szorc wrote:

Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years.
Have you considered simply rewriting CPython in Rust?

FWIW, I'd actually like to see Rust approved as a language for writing stdlib extension modules, but actually ever making that change in policy would require a concrete motivating use case.

And yes, the 4th word in that question was intended to produce peals of shocked laughter. But why Rust? Why not Go?

Trying to get two different garbage collection engines to play nice with each other is a recipe for significant pain, since you can easily end up with uncollectable cycles that neither GC system has complete visibility into (all it needs is a loop from PyObject A -> Go Object B -> back to PyObject A).

Combining Python and Rust can still get into that kind of trouble when using reference counting on the Rust side, but it's a lot easier to avoid than it is in runtimes with mandatory GC.

Brett Cannon

unread,

May 3, 2018, 2:10:26 PM5/3/18

to Nick Coghlan, python-dev

On Thu, 3 May 2018 at 07:31 Nick Coghlan <ncog...@gmail.com> wrote:

On 3 May 2018 at 15:56, Glenn Linderman <v+py...@g.nevcal.com> wrote:

On 5/2/2018 8:56 PM, Gregory Szorc wrote:

Nobody in the project is seriously talking about a complete rewrite in Rust. Contributors to the project have varying opinions on how aggressively Rust should be utilized. People who contribute to the C code, low-level primitives (like storage, deltas, etc), and those who care about performance tend to want more Rust. One thing we almost universally agree on is that we want to rewrite all of Mercurial's C code in Rust. I anticipate that figuring out the balance between Rust and Python in Mercurial will be an ongoing conversation/process for the next few years.
Have you considered simply rewriting CPython in Rust?

FWIW, I'd actually like to see Rust approved as a language for writing stdlib extension modules, but actually ever making that change in policy would require a concrete motivating use case.

Eric Snow, Barry Warsaw, and I have actually discussed this as part of our weekly open source office hours as work where we tend to talk about massive ideas that would take multiple people full-time to accomplish. :)

And yes, the 4th word in that question was intended to produce peals of shocked laughter. But why Rust? Why not Go?

Trying to get two different garbage collection engines to play nice with each other is a recipe for significant pain, since you can easily end up with uncollectable cycles that neither GC system has complete visibility into (all it needs is a loop from PyObject A -> Go Object B -> back to PyObject A).

Combining Python and Rust can still get into that kind of trouble when using reference counting on the Rust side, but it's a lot easier to avoid than it is in runtimes with mandatory GC.

Rust supports RAII so it shouldn't be that bad.

Nathaniel Smith

unread,

May 3, 2018, 3:01:41 PM5/3/18

to INADA Naoki, Python Dev

On Wed, May 2, 2018, 20:59 INADA Naoki <songof...@gmail.com> wrote:

Recently, I reported how stdlib slows down `import requests`.
https://github.com/requests/requests/issues/4315#issuecomment-385584974

[...]

* Add faster and simpler http.parser (maybe, based on h11 [1]) and avoid
using email module in http module.

It's always risky making predictions, but hopefully by the time 3.8 is out, requests will have switched to using h11 directly instead of the http module. (Kenneth wants the big headline feature for the next major requests release to be async support, and that pretty much requires switching to something like h11.)

I don't know how fast importing h11 is though... It does currently compile a bunch of regexps at import time :-).

-n

Gregory P. Smith

unread,

May 3, 2018, 8:04:42 PM5/3/18

to Barry Warsaw, python-dev

Note that this kind of "trick" is not unique to Python. I see it used by large Java tools at work. In effect emacs has done similar things for many decades with its saved core-dump at build time. It saves a snapshot of initialized elisp interpreter state and loads that into memory instead of rerunning initialization to reproduce the state.

I don't know if anyone has looked at making a similar concept of saved post-startup interpreter state for rapid loading as a memory image possible in Python. I'm don't believe we're even at the point where all state can actually accurately be captured from CPython (extension modules can do anything). When you do that kind of trick things like hash randomization tend to complicate matters and might need to be disabled. That feature may not matter for all CLI tools.

-gps

A few of us spent some time at last year’s core Python dev talking about other things we could do to improve Python’s start up time, not just with the interpreter itself, but within the larger context of the Python ecosystem. Many ideas seem promising until you dive into the details, so it’s definitely hard to imagine maintaining all of Python’s dynamic semantics and still making it an order of magnitude faster to start up. But that’s not an excuse to give up, and I’m hoping we can continue to attack the problem, both in the micro and the macro, for 3.8 and beyond, because the alternative is that Python becomes less popular as an implementation language for CLIs. That would be sad, and definitely has a long term impact on Python’s popularity.

Cheers,
-Barry

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg%40krypto.org

Ray Donnelly

unread,

May 3, 2018, 8:23:56 PM5/3/18

to Nathaniel Smith, Gregory Szorc, Python-Dev

Yes, on Windows there's always a python?.dll.

macOS is an interesting one. For Anaconda 5.0 I read somewhere (how's that for a useless reference - and perhaps I got the wrong end of the stick) that Python for all Unixen should use a statically linked interpreter so I happily went ahead and did that. Of course I tested it against a good few wheels at the time and everything seemed fine (well, no worse than the usual binary compatibility woes at least) so I went ahead with it.

Now that Python 3.7 is around the corner we have a chance to re-evaluate this decision. We have received no binary compat. bugs whatsoever due to this change (we got a few bugs where people used python-config incorrectly either directly or via swig or CMake), were we just lucky?

Anyway, it is obviously safer for us to do what upstream does and I will try to post some benchmarks of static vs shared to the list so we can discuss it. I guess it is a little late in the release schedule to propose any such change for 3.7? If not I will try to prepare something. I will discuss it in depth with the rest of the AD team soon too.

>
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:

> https://mail.python.org/mailman/options/python-dev/mingw.android%40gmail.com
>

Lukasz Langa

unread,

May 3, 2018, 8:24:22 PM5/3/18

to INADA Naoki, Python-Dev

> On May 2, 2018, at 8:57 PM, INADA Naoki <songof...@gmail.com> wrote:
>
> Recently, I reported how stdlib slows down `import requests`.
> https://github.com/requests/requests/issues/4315#issuecomment-385584974
>
> For Python 3.8, my ideas for faster startup time are:
>
> * Add lazy compiling API or flag in `re` module. The pattern is compiled
> when first used.

How about go the other way and allow compiling at Python *compile*-time? That would actually make things faster instead of just moving the time spent around.

I do see value in being less eager in Python but I think the real wins are hiding behind ahead-of-time compilation.

- Ł

Gregory P. Smith

unread,

May 3, 2018, 8:45:38 PM5/3/18

to Lukasz Langa, Python-Dev

On Thu, May 3, 2018 at 5:22 PM, Lukasz Langa <luk...@langa.pl> wrote:

> On May 2, 2018, at 8:57 PM, INADA Naoki <songof...@gmail.com> wrote:
>
> Recently, I reported how stdlib slows down `import requests`.
> https://github.com/requests/requests/issues/4315#issuecomment-385584974
>
> For Python 3.8, my ideas for faster startup time are:
>
> * Add lazy compiling API or flag in `re` module. The pattern is compiled
> when first used.

How about go the other way and allow compiling at Python *compile*-time? That would actually make things faster instead of just moving the time spent around.

I do see value in being less eager in Python but I think the real wins are hiding behind ahead-of-time compilation.

Agreed in concept. We've got a lot of unused letters that could be new string prefixes... (ugh)

I'd also like to see this concept somehow extended to decorators so that the results of the decoration can be captured in the compiled pyc rather than requiring execution at import time. I realize that limits what decorators can do, but the evil things they could do that this would eliminate are things they just shouldn't be doing in most situations. meaning: there would probably be two types of decorators... colons seem to be all the rage these days so we could add an @: operator for that. :P ... Along with a from __future__ import to change the behavior or all decorators in a file from runtime to compile time by default.

from __future__ import compile_time_decorators # we'd be unlikely to ever change the default and break things, __future__ seems wrong

@this_happens_at_compile_time(3)

def ...

@:this_waits_until_runtime(5)

def ...

Just a not-so-wild idea, no idea if this should become a PEP for 3.8. (the : syntax is a joke - i'd prefer @@ so it looks like eyeballs)

If this were done to decorators, you can imagine extending that concept to something similar to allow compile time re.compile calls as some form of assignment decorator:

GREGS_RE = @re.compile(r'A regex compiled at compile time\. number = \d+')

-gps

- Ł

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg%40krypto.org

Chris Angelico

unread,

May 3, 2018, 8:57:29 PM5/3/18

to Python-Dev

On Fri, May 4, 2018 at 10:43 AM, Gregory P. Smith <gr...@krypto.org> wrote:
> I'd also like to see this concept somehow extended to decorators so that the
> results of the decoration can be captured in the compiled pyc rather than
> requiring execution at import time. I realize that limits what decorators
> can do, but the evil things they could do that this would eliminate are
> things they just shouldn't be doing in most situations. meaning: there
> would probably be two types of decorators... colons seem to be all the rage
> these days so we could add an @: operator for that. :P ... Along with a from
> __future__ import to change the behavior or all decorators in a file from
> runtime to compile time by default.
>
> from __future__ import compile_time_decorators # we'd be unlikely to ever
> change the default and break things, __future__ seems wrong
>
> @this_happens_at_compile_time(3)
> def ...
>
> @:this_waits_until_runtime(5)
> def ...
>
> Just a not-so-wild idea, no idea if this should become a PEP for 3.8. (the
> : syntax is a joke - i'd prefer @@ so it looks like eyeballs)

At this point, we're squarely in python-ideas territory, but there are
some possibilities. Imagine popping this line of code at the bottom of
your file:

import importlib; importlib.freeze_module()

as a declaration that the dictionary for this module is now locked in
and can be dumped out in whatever form is most efficient. Effectively,
you're stating that you do not need any sort of dynamism (that call
could be easily disabled for testing), and that, if the optimization
breaks anything, you accept responsibility for it.

How this would be implemented, I'm not sure, but that's no different
from the @: idea.

ChrisA

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Chris Jerdonek

unread,

May 3, 2018, 9:46:25 PM5/3/18

to Chris Angelico, Python-Dev

FYI, a lot of these ideas were discussed back in September and October of 2017 on this list if you search the subject lines for "startup" e.g. starting here and here:

https://mail.python.org/pipermail/python-dev/2017-September/149150.html

https://mail.python.org/pipermail/python-dev/2017-October/149670.html

At the end Guido kicked (at least part of) the discussion back to python-ideas.

--Chris

Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.jerdonek%40gmail.com

Antoine Pitrou

unread,

May 4, 2018, 6:02:02 AM5/4/18

to pytho...@python.org

On Fri, 04 May 2018 00:21:54 +0000
Ray Donnelly <mingw....@gmail.com> wrote:
>
> Yes, on Windows there's always a python?.dll.
>
> macOS is an interesting one. For Anaconda 5.0 I read somewhere (how's that
> for a useless reference - and perhaps I got the wrong end of the stick)
> that Python for all Unixen should use a statically linked interpreter so I
> happily went ahead and did that.

A statically linked Python can also be significantly faster (10 to 20%
IIRC, more perhaps on ARM). I think you already know about that :-)

> Anyway, it is obviously safer for us to do what upstream does and I will
> try to post some benchmarks of static vs shared to the list so we can
> discuss it.

I have no idea what our default builds do on macOS, I'll let Ned Deily
or another mac expert answer (changing the topic in the hope he notices
this subthread :-)).

Regards

Antoine.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Ray Donnelly

unread,

May 4, 2018, 8:12:38 AM5/4/18

to Antoine Pitrou, Python-Dev

On Fri, May 4, 2018 at 11:00 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> On Fri, 04 May 2018 00:21:54 +0000
> Ray Donnelly <mingw....@gmail.com> wrote:
>>
>> Yes, on Windows there's always a python?.dll.
>>
>> macOS is an interesting one. For Anaconda 5.0 I read somewhere (how's that
>> for a useless reference - and perhaps I got the wrong end of the stick)
>> that Python for all Unixen should use a statically linked interpreter so I
>> happily went ahead and did that.
>
> A statically linked Python can also be significantly faster (10 to 20%
> IIRC, more perhaps on ARM). I think you already know about that :-)
>

Indeed, and it worked out well on Intel too. Thanks for the recommendation.

>> Anyway, it is obviously safer for us to do what upstream does and I will
>> try to post some benchmarks of static vs shared to the list so we can
>> discuss it.
>
> I have no idea what our default builds do on macOS, I'll let Ned Deily
> or another mac expert answer (changing the topic in the hope he notices
> this subthread :-)).
>

And thanks for doing this. For the benchmarks I think I should build
Python 3.6.5 (or would 3.7.0b4 be better?) from pyperformance built
each way using the AD scripts and reply here with the results. If I do
not get it done today then I hope to get them ready by Monday.

> Regards
>
> Antoine.
>
>
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> https://mail.python.org/mailman/listinfo/python-dev

> Unsubscribe: https://mail.python.org/mailman/options/python-dev/mingw.android%40gmail.com

Ned Deily

unread,

May 4, 2018, 9:54:10 AM5/4/18

to Python-Dev, Antoine Pitrou

On May 4, 2018, at 08:10, Ray Donnelly <mingw....@gmail.com> wrote:
> On Fri, May 4, 2018 at 11:00 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
>> On Fri, 04 May 2018 00:21:54 +0000
>> Ray Donnelly <mingw....@gmail.com> wrote:
>>> Anyway, it is obviously safer for us to do what upstream does and I will
>>> try to post some benchmarks of static vs shared to the list so we can
>>> discuss it.
>> I have no idea what our default builds do on macOS, I'll let Ned Deily
>> or another mac expert answer (changing the topic in the hope he notices
>> this subthread :-)).
> And thanks for doing this. For the benchmarks I think I should build
> Python 3.6.5 (or would 3.7.0b4 be better?) from pyperformance built
> each way using the AD scripts and reply here with the results. If I do
> not get it done today then I hope to get them ready by Monday.

The macOS python interpreters provided by python.org binary installers have always (for a very long time of always) been built as shared, in particular the special macOS framework build configuration. It would be very interesting to do Apple to Apple comparisons of shared vs static builds on macOS. I would look forward to seeing any results you have, Ray, and your methodology. Static builds is on my list of things to look at for 3.8.

--
Ned Deily
n...@python.org -- []

Antoine Pitrou

unread,

May 7, 2018, 7:01:01 AM5/7/18

to pytho...@python.org

On Fri, 04 May 2018 00:21:54 +0000
Ray Donnelly <mingw....@gmail.com> wrote:
>

> Now that Python 3.7 is around the corner we have a chance to re-evaluate
> this decision. We have received no binary compat. bugs whatsoever due to
> this change (we got a few bugs where people used python-config incorrectly
> either directly or via swig or CMake), were we just lucky?

As a sidenote, it seems there may be issues when static linking against
Python to embed it:
https://bugs.python.org/issue33438

Regards

Antoine.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Neil Schemenauer

unread,

May 7, 2018, 12:30:37 PM5/7/18

to pytho...@python.org

On 2018-05-03, Lukasz Langa wrote:
> > On May 2, 2018, at 8:57 PM, INADA Naoki <songof...@gmail.com> wrote:
> > * Add lazy compiling API or flag in `re` module. The pattern is compiled
> > when first used.
>
> How about go the other way and allow compiling at Python
> *compile*-time? That would actually make things faster instead of
> just moving the time spent around.

Lisp has a special form 'eval-when'. It can be used to cause
evaluation of the body expression at compile time.

In Carl's "A fast startup patch" post, he talks about getting rid of
the unmarshal step and storing objects in the heap segment of the
executable. Those would be the objects necessary to evaluate code.
The marshal module has a limited number of types that it handle.
I believe they are: bool, bytes, code objects, complex, Ellipsis
float, frozenset, int, None, tuple and str.

If the same mechanism could handle more types, rather than storing
the code to be evaluated, we could store the objects created after
evaluation of the top-level module body. Or, have a mechanism to
mark which code should be evaluated at compile time (much like the
eval-when form).

For the re.compile example, the compiled regex could be what is
stored after compiling the Python module (i.e. the re.compile gets
run at compile time). The objects created by re.compile (e.g.
SRE_Pattern) would have to be something that the heap dumper could
handle.

Traditionally, Python has had the model "there is only runtime".
So, starting to do things at compile time complicates that model.

Regards,

Neil

Chris Barker - NOAA Federal via Python-Dev

unread,

May 11, 2018, 10:39:58 AM5/11/18

to Neil Schemenauer, pytho...@python.org

Inspired by chg:

Could one make a little startup utility that, when invoked the first
time, starts up a raw python interpreter, keeps it running somewhere,
and then forks it to run the actual python code.

Then every invocation after that would make a new fork. I presume
forking is a LOT faster than re-invoking the entire startup.

I suspect that many of the cases where startup time really matters is
when a command line utility is likely to be invoked many times — often
in the same shell instance.

So having a “pre-built” warm interpreter ready to go could really help.

This is way past my technical expertise to know if it’s possible, or
to try to prototype it, but I’m sure many of you would know.

-CHB

Sent from my iPhone

> Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov

Ryan Gonzalez

unread,

May 11, 2018, 11:06:58 AM5/11/18

to Chris Barker - NOAA Federal, Neil Schemenauer, Chris Barker - NOAA Federal via Python-Dev, pytho...@python.org

<plug> https://refi64.com/uprocd/ </plug>

On May 11, 2018 9:39:28 AM Chris Barker - NOAA Federal via Python-Dev
<pytho...@python.org> wrote:

> https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com

Oleg Broytman

unread,

May 11, 2018, 11:29:48 AM5/11/18

to pytho...@python.org

On Fri, May 11, 2018 at 07:38:05AM -0700, Chris Barker - NOAA Federal via Python-Dev <pytho...@python.org> wrote:
> Could one make a little startup utility that, when invoked the first
> time, starts up a raw python interpreter, keeps it running somewhere,
> and then forks it to run the actual python code.
>
> Then every invocation after that would make a new fork.

Used to be implemented (and discussed in this list) many times. Just
a few examples:

http://readyexec.sourceforge.net/
https://blogs.gnome.org/johan/2007/01/18/introducing-python-launcher/

Proven to be hard and never gain any traction.

a) you don't want the daemon to import all possible modules so you need
to run a separate copy of the daemon for every Python version, every
user and every client program;
b) you need to find "your" daemon - using TCP? unix sockets? named pipes?
b) need to redirect stdio to/from the daemon;
c) need to redirect signals and exceptions;
d) have problems with elevated privileges (how do you elevate the daemon
if the client was started with `sudo -H`?);
e) not portable (there is a popular GUI that cannot fork).

> -CHB
> Sent from my iPhone

Oleg.
--
Oleg Broytman http://phdru.name/ p...@phdru.name
Programmers don't die, they just GOSUB without RETURN.

Antoine Pitrou

unread,

May 11, 2018, 11:36:23 AM5/11/18

to pytho...@python.org

Yes, you don't want this to be a generic utility, rather a helper
library that people can integrate into their command-line applications
to enable such startup caching.

Regards

Antoine.

On Fri, 11 May 2018 17:27:35 +0200
Oleg Broytman <p...@phdru.name> wrote:
> On Fri, May 11, 2018 at 07:38:05AM -0700, Chris Barker - NOAA Federal via Python-Dev <pytho...@python.org> wrote:
> > Could one make a little startup utility that, when invoked the first
> > time, starts up a raw python interpreter, keeps it running somewhere,
> > and then forks it to run the actual python code.
> >
> > Then every invocation after that would make a new fork.
>
> Used to be implemented (and discussed in this list) many times. Just
> a few examples:
>
> http://readyexec.sourceforge.net/
> https://blogs.gnome.org/johan/2007/01/18/introducing-python-launcher/
>
> Proven to be hard and never gain any traction.
>
> a) you don't want the daemon to import all possible modules so you need
> to run a separate copy of the daemon for every Python version, every
> user and every client program;
> b) you need to find "your" daemon - using TCP? unix sockets? named pipes?
> b) need to redirect stdio to/from the daemon;
> c) need to redirect signals and exceptions;
> d) have problems with elevated privileges (how do you elevate the daemon
> if the client was started with `sudo -H`?);
> e) not portable (there is a popular GUI that cannot fork).
>
> > -CHB
> > Sent from my iPhone
>
> Oleg.

Guido van Rossum

unread,

May 11, 2018, 12:25:38 PM5/11/18

to Antoine Pitrou, Python-Dev

Indeed, we have an implementation of this specific to mypy.

Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

--

--Guido van Rossum (python.org/~guido)

Barry Warsaw

unread,

May 11, 2018, 11:58:54 PM5/11/18

to Python-Dev

On May 11, 2018, at 12:23, Guido van Rossum <gu...@python.org> wrote:
>
> Indeed, we have an implementation of this specific to mypy.

Is there anything in mypy’s implementation that can be generalized into a library?

-Barry

signature.asc

Guido van Rossum

unread,

May 12, 2018, 12:10:47 AM5/12/18

to Barry Warsaw, Python-Dev

Not sure, here's the code:

https://github.com/python/mypy/blob/master/mypy/dmypy.py

https://github.com/python/mypy/blob/master/mypy/dmypy_server.py

(also dmypy_util.py there)

Chris Barker via Python-Dev

unread,

May 14, 2018, 12:28:53 PM5/14/18

to Ryan Gonzalez, Chris Barker - NOAA Federal via Python-Dev

On Fri, May 11, 2018 at 11:05 AM, Ryan Gonzalez <rym...@gmail.com> wrote:

<plug> https://refi64.com/uprocd/ </plug>

very cool -- but *nix only, of course :-(

But it seems that there is a demand for this sort of thing, and a few major projects are rolling their own. So maybe it makes sense to put something into the standard library that everyone could contribute to and use.

With regard to forking -- is there another way? I don't have the expertise to have any idea if this is possible, but:

start up python

capture the entire runtime image as a single binary blob.

could that blob be simply loaded into memory and run?

(hmm -- probably not -- memory addresses would be hard-coded then, yes?) or is memory virtualized enough these days?

-CHB

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA 98115   (206) 526-6317   main reception

Chris....@noaa.gov

INADA Naoki

unread,

May 14, 2018, 12:35:28 PM5/14/18

to Chris Barker, Python-Dev

On Tue, May 15, 2018 at 1:29 AM Chris Barker via Python-Dev <
pytho...@python.org> wrote:

> On Fri, May 11, 2018 at 11:05 AM, Ryan Gonzalez <rym...@gmail.com> wrote:

>> <plug> https://refi64.com/uprocd/ </plug>

> very cool -- but *nix only, of course :-(

> But it seems that there is a demand for this sort of thing, and a few
major projects are rolling their own. So maybe it makes sense to put
something into the standard library that everyone could contribute to and
use.

> With regard to forking -- is there another way? I don't have the
expertise to have any idea if this is possible, but:

> start up python

> capture the entire runtime image as a single binary blob.

> could that blob be simply loaded into memory and run?

> (hmm -- probably not -- memory addresses would be hard-coded then, yes?)
or is memory virtualized enough these days?

> -CHB

It will broke hash randomization.

See also: https://www.cvedetails.com/cve/CVE-2017-11499/

Regards,

--
Inada Naoki

Chris Barker via Python-Dev

unread,

May 14, 2018, 12:40:53 PM5/14/18

to INADA Naoki, Python-Dev

On Mon, May 14, 2018 at 12:33 PM, INADA Naoki <songof...@gmail.com> wrote:

It will broke hash randomization.

See also: https://www.cvedetails.com/cve/CVE-2017-11499/

I'm not enough of a security expert to know how much that matters in this case, but I suppose one could do a bit of post-proccessing on the image to randomize the hashes? or is that just insane?

Also -- I wasn't thinking it would be a pre-build binary blob that everyone used -- but one built on the fly on an individual system, maybe once per reboot, or once per shell instance even. So if you are running, e.g. hg a bunch of times in a shell, does it matter that the instances are all identical?

-CHB

Antoine Pitrou

unread,

May 14, 2018, 12:59:41 PM5/14/18

to pytho...@python.org

On Tue, 15 May 2018 01:33:18 +0900
INADA Naoki <songof...@gmail.com> wrote:
>
> It will broke hash randomization.
>
> See also: https://www.cvedetails.com/cve/CVE-2017-11499/

I don't know why it would. The mechanism of pre-initializing a process
which is re-used accross many requests is how most server applications
of Python already work (you don't want to bear the cost of spawning
a new interpreter for each request, as antiquated CGI does). I have not
heard that it breaks hash randomization, so a similar mechanism on the
CLI side shouldn't break it either.

Regards

Antoine.

INADA Naoki

unread,

May 14, 2018, 1:14:30 PM5/14/18

to Antoine Pitrou, Python-Dev

I'm sorry, the word *will* may be stronger than I thought.

I meant if memory image dumped on disk is used casually,
it may make easier to make security hole.

For example, if `hg` memory image is reused, and it can be leaked in some
way,
hg serve will be hashdos weak.

I don't deny that it's useful and safe when it's used carefully.

Regards,

https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com

--
--
INADA Naoki <songof...@gmail.com>

Antoine Pitrou

unread,

May 14, 2018, 1:19:13 PM5/14/18

to Python-Dev

Le 14/05/2018 à 19:12, INADA Naoki a écrit :
> I'm sorry, the word *will* may be stronger than I thought.
>
> I meant if memory image dumped on disk is used casually,
> it may make easier to make security hole.
>
> For example, if `hg` memory image is reused, and it can be leaked in some
> way,
> hg serve will be hashdos weak.

This discussion subthread is not about having a memory image dumped on
disk, but a daemon utility that preloads a new Python process when you
first start up your CLI application. Each time a new process is
preloaded, it will by construction use a new hash seed.

(by contrast, the Node.js CVE issue you linked to is about having the
same hash seed accross a Node.js version; that's disastrous)

Also you add a reuse limit to ensure that the hash seed is rotated (e.g.
every 100 invocations).

Regards

Antoine.
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

INADA Naoki

unread,

May 14, 2018, 1:36:07 PM5/14/18

to Antoine Pitrou, Python-Dev

2018年5月15日(火) 2:17 Antoine Pitrou <ant...@python.org>:

Le 14/05/2018 à 19:12, INADA Naoki a écrit :
> I'm sorry, the word *will* may be stronger than I thought.
>
> I meant if memory image dumped on disk is used casually,
> it may make easier to make security hole.
>
> For example, if `hg` memory image is reused, and it can be leaked in some
> way,
> hg serve will be hashdos weak.

This discussion subthread is not about having a memory image dumped on
disk, but a daemon utility that preloads a new Python process when you
first start up your CLI application. Each time a new process is
preloaded, it will by construction use a new hash seed.

My reply was to:

> capture the entire runtime image as a single binary blob.

> could that blob be simply loaded into memory and run?

So I thought about reusing memory image undeterministic times.

Of course, prefork is much safer because hash initial vector is only in process ram.

Regards,

Oleg Broytman

unread,

May 14, 2018, 1:53:28 PM5/14/18

to pytho...@python.org

On Mon, May 14, 2018 at 12:26:19PM -0400, Chris Barker via Python-Dev <pytho...@python.org> wrote:
> With regard to forking -- is there another way? I don't have the expertise
> to have any idea if this is possible, but:
>
> start up python
>
> capture the entire runtime image as a single binary blob.
> could that blob be simply loaded into memory and run?

Like emacs unexec? https://www.google.com/search?q=emacs+unexec

> -CHB
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
>
> Chris....@noaa.gov

Oleg.
--
Oleg Broytman http://phdru.name/ p...@phdru.name
Programmers don't die, they just GOSUB without RETURN.

M.-A. Lemburg

unread,

May 14, 2018, 3:43:35 PM5/14/18

to Chris Barker, Ryan Gonzalez, Chris Barker - NOAA Federal via Python-Dev

On 14.05.2018 18:26, Chris Barker via Python-Dev wrote:
>
>
> On Fri, May 11, 2018 at 11:05 AM, Ryan Gonzalez <rym...@gmail.com
> <mailto:rym...@gmail.com>> wrote:
>
> <plug> https://refi64.com/uprocd/ </plug>
>
>
> very cool -- but *nix only, of course :-(
>
> But it seems that there is a demand for this sort of thing, and a few
> major projects are rolling their own. So maybe it makes sense to put
> something into the standard library that everyone could contribute to
> and use.
>
> With regard to forking -- is there another way? I don't have the
> expertise to have any idea if this is possible, but:
>
> start up python
>
> capture the entire runtime image as a single binary blob.
>
> could that blob be simply loaded into memory and run?
>
> (hmm -- probably not -- memory addresses would be hard-coded then, yes?)
> or is memory virtualized enough these days?

You might want to look into combining this with PyRun:

https://www.egenix.com/products/python/PyRun/

which takes care of mmap'ing the byte code of the stdlib into
memory.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/
>>> Python Database Interfaces ... http://products.egenix.com/
>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
http://www.malemburg.com/

Gregory Szorc

unread,

Oct 9, 2018, 5:06:07 PM10/9/18

to Larry Hastings, pytho...@python.org

On 5/1/2018 8:26 PM, Gregory Szorc wrote:
> On 7/19/2017 12:15 PM, Larry Hastings wrote:
>>
>>
>> On 07/19/2017 05:59 AM, Victor Stinner wrote:
>>> Mercurial startup time is already 45.8x slower than Git whereas tested
>>> Mercurial runs on Python 2.7.12. Now try to sell Python 3 to Mercurial
>>> developers, with a startup time 2x - 3x slower...
>>
>> When Matt Mackall spoke at the Python Language Summit some years back, I
>> recall that he specifically complained about Python startup time. He
>> said Python 3 "didn't solve any problems for [them]"--they'd already
>> solved their Unicode hygiene problems--and that Python's slow startup
>> time was already a big problem for them. Python 3 being /even slower/
>> to start was absolutely one of the reasons why they didn't want to upgrade.
>>
>> You might think "what's a few milliseconds matter". But if you run
>> hundreds of commands in a shell script it adds up. git's speed is one
>> of the few bright spots in its UX, and hg's comparative slowness here is
>> a palpable disadvantage.
>>
>>
>>> So please continue efforts for make Python startup even faster to beat
>>> all other programming languages, and finally convince Mercurial to
>>> upgrade ;-)
>>
>> I believe Mercurial is, finally, slowly porting to Python 3.
>>
>> https://www.mercurial-scm.org/wiki/Python3
>>
>> Nevertheless, I can't really be annoyed or upset at them moving slowly
>> to adopt Python 3, as Matt's objections were entirely legitimate.
>
> I just now found found this thread when searching the archive for
> threads about startup time. And I was searching for threads about
> startup time because Mercurial's startup time has been getting slower
> over the past few months and this is causing substantial pain.
>
> As I posted back in 2014 [1], CPython's startup overhead was >10% of the
> total CPU time in Mercurial's test suite. And when you factor in the
> time to import modules that get Mercurial to a point where it can run
> commands, it was more like 30%!
>
> Mercurial's full test suite currently runs `hg` ~25,000 times. Using
> Victor's startup time numbers of 6.4ms for 2.7 and 14.5ms for
> 3.7/master, Python startup overhead contributes ~160s on 2.7 and ~360s
> on 3.7/master. Even if you divide this by the number of available CPU
> cores, we're talking dozens of seconds of wall time just waiting for
> CPython to get to a place where Mercurial's first bytecode can execute.
>
> And the problem is worse when you factor in the time it takes to import
> Mercurial's own modules.
>
> As a concrete example, I recently landed a Mercurial patch [2] that
> stubs out zope.interface to prevent the import of 9 modules on every
> `hg` invocation. This "only" saved ~6.94ms for a typical `hg`
> invocation. But this decreased the CPU time required to run the test
> suite on my i7-6700K from ~4450s to ~3980s (~89.5% of original) - a
> reduction of almost 8 minutes of CPU time (and over 1 minute of wall time)!
>
> By the time CPython gets Mercurial to a point where we can run useful
> code, we've already blown most of or past the time budget where humans
> perceive an action/command as instantaneous. If you ignore startup
> overhead, Mercurial's performance compares quite well to Git's for many
> operations. But the reality is that CPython startup overhead makes it
> look like Mercurial is non-instantaneous before Mercurial even has the
> opportunity to execute meaningful code!
>
> Mercurial provides a `chg` program that essentially spins up a daemon
> `hg` process running a "command server" so the `chg` program [written in
> C - no startup overhead] can dispatch commands to an already-running
> Python/`hg` process and avoid paying the startup overhead cost. When you
> run Mercurial's test suite using `chg`, it completes *minutes* faster.
> `chg` exists mainly as a workaround for slow startup overhead.
>
> Changing gears, my day job is maintaining Firefox's build system. We use
> Python heavily in the build system. And again, Python startup overhead
> is problematic. I don't have numbers offhand, but we invoke likely a few
> hundred Python processes as part of building Firefox. It should be
> several thousand. But, we've had to "hack" parts of the build system to
> "batch" certain build actions in single process invocations in order to
> avoid Python startup overhead. This undermines the ability of some build
> tools to formulate a reasonable understanding of the DAG and it causes a
> bit of pain for build system developers and makes it difficult to
> achieve "no-op" and fast incremental builds because we're always
> invoking certain Python processes because we've had to move DAG
> awareness out of the build backend and into Python. At some point, we'll
> likely replace Python code with Rust so the build system is more "pure"
> and easier to maintain and reason about.
>
> I've seen posts in this thread and elsewhere in the CPython development
> universe that challenge whether milliseconds in startup time matter.
> Speaking as a Mercurial and Firefox build system developer,
> *milliseconds absolutely matter*. Going further, *fractions of
> milliseconds matter*. For Mercurial's test suite with its ~25,000 Python
> process invocations, 1ms translates to ~25s of CPU time. With 2.7,
> Mercurial can dispatch commands in ~50ms. When you load common
> extensions, it isn't uncommon to see process startup overhead of
> 100-150ms! A millisecond here. A millisecond there. Before you know it,
> we're talking *minutes* of CPU (and potentially wall) time in order to
> run Mercurial's test suite (or build Firefox, or ...).
>
> From my perspective, Python process startup and module import overhead
> is a severe problem for Python. I don't say this lightly, but in my mind
> the problem causes me to question the viability of Python for popular
> use cases, such as CLI applications. When choosing a programming
> language, I want one that will scale as a project grows. Vanilla process
> overhead has Python starting off significantly slower than compiled code
> (or even Perl) and adding module import overhead into the mix makes
> Python slower and slower as projects grow. As someone who has to deal
> with this slowness on a daily basis, I can tell you that it is extremely
> frustrating and it does matter. I hope that the importance of the
> problem will be acknowledged (milliseconds *do* matter) and that
> creative minds will band together to address it. Since I am
> disproportionately impacted by this issue, if there's anything I can do
> to help, let me know.

We were debugging abysmally slow execution of Mercurial's test harness
on macOS and we discovered a new wrinkle to the startup time problem.

It appears that APFS acquires some shared locks/mutexes in the kernel
when executing readdir() and other filesystem system calls. When you
have several Python processes all starting at the same time, I/O
attached to module importing (import.c:case_ok() by the looks of it for
Python 2.7) becomes a stress test of sorts for this lock acquisition. On
my 6+6 core MacBook Pro, ~75% of overall system CPU is spent in the
kernel when executing the test harness with 12 parallel tests.

If we run the test harness with the persistent `chg` command server
(which eliminates Python process startup overhead), wall execution time
drops from ~37:43s to ~9:06s.

This problem of shared locks on read-only operations appears to be
similar to that of AUFS, which I've blogged about [1].

It is pretty common for non-compiled languages (like Python, Ruby, PHP,
Perl, etc) to stat() the world as part of looking for modules to load.
Typically, the filesystem's stat cache will save you and the overhead
from hundreds or thousands of lookups is trivial (after first load). But
it appears APFS is quite sensitive to it. Any work to reduce the number
of filesystem API calls during Python startup will likely have a
profound impact on APFS when multiple Python processes are starting. A
"frozen" application where modules are in a shared container file is
probably ideal.

Python 3.7 doesn't exhibit as much of a problem. But it is still there.
A brief audit of the importer code and call stacks confirms it is the
same problem - just less prevalent. Wall time execution of the test
harness from Python 2.7 to Python 3.7 drops from ~37:43s to ~20:39.
Overall kernel CPU time drops from ~75% to ~19%. And that wall time
improvement is despite Python 3's slower process startup. So locking in
the kernel is really a killer on Python 2.7.

While we're here, CPython might want to look into getdirentriesattr() as
a replacement for readdir(). We switched to it in Mercurial several
years ago to make `hg status` operations significantly faster [2]. I'm
not sure if it will yield a speedup on APFS though. But it's worth a
try. (If it does, you could probably make
os.listdir()/os.scandir()/os.walk() significantly faster on macOS.)

I hope someone finds this information useful to further improving
[startup] performance. (And given that Python 3.7 is substantially
faster by avoiding excessive readdir(), I wouldn't be surprised if this
problem is already known!)

[1] https://gregoryszorc.com/blog/2017/12/08/good-riddance-to-aufs/
[2] https://www.mercurial-scm.org/repo/hg/rev/05ccfe6763f1

Antoine Pitrou

unread,

Oct 9, 2018, 5:26:08 PM10/9/18

to pytho...@python.org

Hi,

On Tue, 9 Oct 2018 14:02:02 -0700
Gregory Szorc <gregor...@gmail.com> wrote:
>
> Python 3.7 doesn't exhibit as much of a problem. But it is still there.
> A brief audit of the importer code and call stacks confirms it is the
> same problem - just less prevalent. Wall time execution of the test
> harness from Python 2.7 to Python 3.7 drops from ~37:43s to ~20:39.
> Overall kernel CPU time drops from ~75% to ~19%. And that wall time
> improvement is despite Python 3's slower process startup. So locking in
> the kernel is really a killer on Python 2.7.

Thanks for the detailed feedback.

> I hope someone finds this information useful to further improving
> [startup] performance. (And given that Python 3.7 is substantially
> faster by avoiding excessive readdir(), I wouldn't be surprised if this
> problem is already known!)

The macOS problem wasn't known, but the general problem of filesystem
calls was (in relation with e.g. networked filesystems).

Significant work went into improving Python 3 in that regard after the
import mechanism was rewritten in pure Python. Nowadays Python caches
the contents of all sys.path directories, so (once the cache is primed)
it's mostly a single stat() call per directory to check whether the
cache is up-to-date. This is not entirely free, but massively better
than what Python 2 did, which was to stat() many filename patterns in
each sys.path directory.

(of course, the fact that Python 3 imports many more modules at startup
mitigates the end result a bit)

As a sidenote, I was always shocked with how the Mercurial test suite
was architected. You're wasting so much time launching processes that
I wonder why you kept it that way for so long :-)

Regards

Antoine.

Ronald Oussoren via Python-Dev

unread,

Oct 10, 2018, 4:12:39 AM10/10/18

to Gregory Szorc, pytho...@python.org

On 9 Oct 2018, at 23:02, Gregory Szorc <gregor...@gmail.com> wrote:

While we're here, CPython might want to look into getdirentriesattr() as
a replacement for readdir(). We switched to it in Mercurial several
years ago to make `hg status` operations significantly faster [2]. I'm
not sure if it will yield a speedup on APFS though. But it's worth a
try. (If it does, you could probably make
os.listdir()/os.scandir()/os.walk() significantly faster on macOS.)

Note that getdirentriesattr is deprecated as of macOS 10.10, getattrlistbulk

is the non-deprecated replacement (introduced in 10.10).

Ronald

Reply all

Reply to author

Forward