Overcoming python performance penalty for multicore CPU

John Nagle

unread,

Feb 2, 2010, 6:02:49 PM2/2/10

to

I know there's a performance penalty for running Python on a
multicore CPU, but how bad is it? I've read the key paper
("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate
if the GIL just limited Python to running on one CPU at a time,
but it's worse than that; there's excessive overhead due to
a lame locking implementation. Running CPU-bound multithreaded
code on a dual-core CPU runs HALF AS FAST as on a single-core
CPU, according to Beasley.

My main server application, which runs "sitetruth.com"
has both multiple processes and multiple threads in each process.
The system rates web sites, which involves reading and parsing
up to 20 pages from each domain. Analysis of each domain is
performed in a separate process, but each process uses multiple
threads to read process several web pages simultaneously.

Some of the threads go compute-bound for a second or two at a time as
they parse web pages. Sometimes two threads (but never more than three)
in the same process may be parsing web pages at the same time, so
they're contending for CPU time.

So this is nearly the worst case for the lame GIL lock logic.
Has anyone tried using "affinity" ("http://pypi.python.org/pypi/affinity")
to lock each Python process to a single CPU? Does that help?

John Nagle

exa...@twistedmatrix.com

unread,

Feb 2, 2010, 6:24:59 PM2/2/10

to John Nagle, pytho...@python.org

On 11:02 pm, na...@animats.com wrote:
> I know there's a performance penalty for running Python on a
>multicore CPU, but how bad is it? I've read the key paper
>("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate
>if the GIL just limited Python to running on one CPU at a time,
>but it's worse than that; there's excessive overhead due to
>a lame locking implementation. Running CPU-bound multithreaded
>code on a dual-core CPU runs HALF AS FAST as on a single-core
>CPU, according to Beasley.

It's not clear that Beasley's performance numbers apply to any platform
except OS X, which has a particularly poor implementation of the
threading primitives CPython uses to implement the GIL.

You should check to see if it actually applies to your deployment
environment.

The GIL has been re-implemented recently. Python 3.2, I think, will
include the new implementation, which should bring OS X performance up
to the level of other platforms. It may also improve certain other
aspects of thread switching.

Jean-Paul

alex23

unread,

Feb 2, 2010, 9:02:09 PM2/2/10

to

On Feb 3, 9:02 am, John Nagle <na...@animats.com> wrote:
> I know there's a performance penalty for running Python on a
> multicore CPU, but how bad is it? I've read the key paper
> ("www.dabeaz.com/python/GIL.pdf"), of course.

It's a shame that Python 3.x is dead to you, otherwise you'd be able
to enjoy the new GIL implementation in 3.2: http://www.dabeaz.com/python/NewGIL.pdf

Actually, it looks like you probably still can:
+ patch for 2.5.4: http://thread.gmane.org/gmane.comp.python.devel/109929
+ patch for 2.7? http://bugs.python.org/issue7753

(Can't comment on affinity, though, sorry)

Terry Reedy

unread,

Feb 3, 2010, 1:33:20 AM2/3/10

to pytho...@python.org

The patch was rejected for 2.7 (and earlier) because it could break code
as explained in the discussion. One would have to apply and compile
their own binary.

Paul Rubin

unread,

Feb 3, 2010, 8:51:36 PM2/3/10

to

John Nagle <na...@animats.com> writes:
> Analysis of each domain is
> performed in a separate process, but each process uses multiple
> threads to read process several web pages simultaneously.
>
> Some of the threads go compute-bound for a second or two at a time as
> they parse web pages.

You're probably better off using separate processes for the different
pages. If I remember, you were using BeautifulSoup, which while very
cool, is pretty doggone slow for use on large volumes of pages. I don't
know if there's much that can be done about that without going off on a
fairly messy C or C++ coding adventure. Maybe someday someone will do
that.

John Nagle

unread,

Feb 3, 2010, 10:46:43 PM2/3/10

to

I already use separate processes for different domains. I could
live with Python's GIL as long as moving to a multicore server
doesn't make performance worse. That's why I asked about CPU dedication
for each process, to avoid thrashing at the GIL.

There's enough intercommunication between the threads working on
a single site that it's a pain to do them as subprocesses. And I
definitely don't want to launch subprocesses for each page; the
Python load time would be worse than the actual work. The
subprocess module assumes you're willing to launch a subprocess
for each transaction.

The current program organization is that there's a scheduler
process which gets requests, prioritizes them, and runs the requested
domains through the site evaluation mill. The scheduler maintains a
pool of worker processes which get work request via their input pipe, in Pickle
format, and return results, again in Pickle format. When not in
use, the worker processes sit there dormant, so there's no Python
launch cost for each transaction. If a worker process crashes, the
scheduler replaces it with a fresh one, and every few hundred uses,
each worker process is replaced with a fresh copy, in case Python
has a memory leak. It's a lot like the way
FCGI works.

Scheduling is managed using an in-memory
table in MySQL, so the load can be spread over a cluster if desired,
with a scheduler process on each machine.

So I already have a scalable architecture. The only problem
is excess overhead on multicore CPUs.

John Nagle

Steve Holden

unread,

Feb 3, 2010, 10:50:19 PM2/3/10

to pytho...@python.org

John Nagle wrote:
> Paul Rubin wrote:
>> John Nagle <na...@animats.com> writes:
>>> Analysis of each domain is
>>> performed in a separate process, but each process uses multiple
>>> threads to read process several web pages simultaneously.
>>>
>>> Some of the threads go compute-bound for a second or two at a time as
>>> they parse web pages.
>>
>> You're probably better off using separate processes for the different
>> pages. If I remember, you were using BeautifulSoup, which while very
>> cool, is pretty doggone slow for use on large volumes of pages. I don't
>> know if there's much that can be done about that without going off on a
>> fairly messy C or C++ coding adventure. Maybe someday someone will do
>> that.
>
> I already use separate processes for different domains. I could
> live with Python's GIL as long as moving to a multicore server
> doesn't make performance worse. That's why I asked about CPU dedication
> for each process, to avoid thrashing at the GIL.
>

I believe it's already been said that the GIL thrashing is mostly MacOS
specific. You might also find something in the affinity module

http://pypi.python.org/pypi/affinity/0.1.0

to ensure that each process in your pool runs on only one processor.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

John Nagle

unread,

Feb 4, 2010, 1:11:56 AM2/4/10

to

Steve Holden wrote:
> John Nagle wrote:
>> Paul Rubin wrote:
>>> John Nagle <na...@animats.com> writes:
>>>> Analysis of each domain is
>>>> performed in a separate process, but each process uses multiple
>>>> threads to read process several web pages simultaneously.
>>>>
>>>> Some of the threads go compute-bound for a second or two at a time as
>>>> they parse web pages.
>>> You're probably better off using separate processes for the different
>>> pages. If I remember, you were using BeautifulSoup, which while very
>>> cool, is pretty doggone slow for use on large volumes of pages. I don't
>>> know if there's much that can be done about that without going off on a
>>> fairly messy C or C++ coding adventure. Maybe someday someone will do
>>> that.
>> I already use separate processes for different domains. I could
>> live with Python's GIL as long as moving to a multicore server
>> doesn't make performance worse. That's why I asked about CPU dedication
>> for each process, to avoid thrashing at the GIL.
>>
> I believe it's already been said that the GIL thrashing is mostly MacOS
> specific. You might also find something in the affinity module

No, the original analysis was MacOS oriented, but the same mechanism
applies for fighting over the GIL on all platforms. There was some
pontification that it might be a MacOS-only issue, but no facts
were presented. It might be cheaper on C implementations with mutexes
that don't make system calls for the non-blocking cases.

John Nagle

Paul Rubin

unread,

Feb 4, 2010, 1:57:40 AM2/4/10

to

John Nagle <na...@animats.com> writes:
> There's enough intercommunication between the threads working on
> a single site that it's a pain to do them as subprocesses. And I
> definitely don't want to launch subprocesses for each page; the
> Python load time would be worse than the actual work. The
> subprocess module assumes you're willing to launch a subprocess
> for each transaction.

Why not just use socketserver and have something like a fastcgi?

Anh Hai Trinh

unread,

Feb 4, 2010, 6:13:58 AM2/4/10

to na...@animats.com

On Feb 4, 10:46 am, John Nagle <na...@animats.com> wrote:
>
> There's enough intercommunication between the threads working on
> a single site that it's a pain to do them as subprocesses. And I
> definitely don't want to launch subprocesses for each page; the
> Python load time would be worse than the actual work. The
> subprocess module assumes you're willing to launch a subprocess
> for each transaction.

You could perhaps use a process pool inside each domain worker to work
on the pages? There is multiprocessing.Pool and other
implementations.

For examples, in this library, you can s/ThreadPool/ProcessPool/g and
this example would work: <http://www.onideas.ws/stream.py/#retrieving-
web-pages-concurrently>.

If you want to DIY, with multiprocessing.Lock/Pipe/Queue, I don't
understand why it would be more of a pain to write your threads as
processes.

// aht
http://blog.onideas.ws

Stefan Behnel

unread,

Feb 8, 2010, 3:59:42 AM2/8/10

to

Paul Rubin, 04.02.2010 02:51:

Well, if multi-core performance is so important here, then there's a pretty
simple thing the OP can do: switch to lxml.

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan

Paul Rubin

unread,

Feb 8, 2010, 4:10:07 AM2/8/10

to

Stefan Behnel <stef...@behnel.de> writes:
> Well, if multi-core performance is so important here, then there's a pretty
> simple thing the OP can do: switch to lxml.
>
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
only works on well-formed XML. The point of Beautiful Soup is that it
works on all kinds of garbage hand-written legacy HTML with mismatched
tags and other sorts of errors. Beautiful Soup is slower because it's
full of special cases and hacks for that reason, and it is written in
Python. Writing something that complex in C to handle so much
potentially malicious input would be quite a lot of work to write at
all, and very difficult to ensure was really safe. Look at the many
browser vulnerabilities we've seen over the years due to that sort of
problem, for example. But, for web crawling, you really do need to
handle the messy and wrong HTML properly.

Antoine Pitrou

unread,

Feb 8, 2010, 9:28:41 AM2/8/10

to pytho...@python.org

Le Tue, 02 Feb 2010 15:02:49 -0800, John Nagle a écrit :
> I know there's a performance penalty for running Python on a multicore
> CPU, but how bad is it? I've read the key paper
> ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate if
> the GIL just limited Python to running on one CPU at a time, but it's
> worse than that; there's excessive overhead due to a lame locking
> implementation. Running CPU-bound multithreaded code on a dual-core CPU
> runs HALF AS FAST as on a single-core CPU, according to Beasley.

That's on certain types of workloads, and perhaps on certain OSes, so you
should try benching your own workload to see whether it applies.

Two closing remarks:
- this should (hopefully) be fixed in 3.2, as exarkun noticed
- instead of spawning one thread per Web page, you could use Twisted or
another event loop mechanism in order to process pages serially, in the
order of arrival

Regards

Antoine.

J Kenneth King

unread,

Feb 8, 2010, 11:21:07 AM2/8/10

to

Paul Rubin <no.e...@nospam.invalid> writes:

If the difference is great enough, you might get a benefit from
analyzing all pages with lxml and throwing invalid pages into a bucket
for later processing with BeautifulSoup.

John Krukoff

unread,

Feb 8, 2010, 11:43:10 AM2/8/10

to python-list

Actually, lxml has an HTML parser which does pretty well with the
standard level of broken one finds most often on the web. And, when it
falls down, it's easy to integrate BeautifulSoup as a slow backup for
when things go really wrong (as J Kenneth King mentioned earlier):

http://codespeak.net/lxml/lxmlhtml.html#parsing-html

At least in my experience, I haven't actually had to parse anything that
lxml couldn't handle yet, however.
--
John Krukoff <jkru...@ltgc.com>
Land Title Guarantee Company

Martin v. Loewis

unread,

Feb 21, 2010, 4:22:28 PM2/21/10

to John Nagle

John Nagle wrote:
> I know there's a performance penalty for running Python on a
> multicore CPU, but how bad is it? I've read the key paper
> ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate
> if the GIL just limited Python to running on one CPU at a time,
> but it's worse than that; there's excessive overhead due to
> a lame locking implementation. Running CPU-bound multithreaded
> code on a dual-core CPU runs HALF AS FAST as on a single-core
> CPU, according to Beasley.

I couldn't reproduce these results on Linux. Not sure what "HALF AS
FAST" is; I suppose it means "it runs TWICE AS LONG" - this is what I
couldn't reproduce.

If I run Beazley's program on Linux 2.6.26, on a 4 processor Xeon (3GHz)
machine, I get 30s for the sequential execution, 40s for the
multi-threaded case, and 32s for the multi-threaded case when pinning
the Python process to a single CPU (using taskset(1)).

So it's 6% overhead for threading, and 25% penalty for multicore CPUs -
far from the 100% you seem to expect.

Regards,
Martin

Ryan Kelly

unread,

Feb 21, 2010, 4:45:50 PM2/21/10

to pytho...@python.org

It's far from scientific, but I've seen behaviour that's close to a 100%
performance penalty on a dual-core linux system:

http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2

Short story: a particular test suite of mine used to run in around 25
seconds, but a bit of ctypes magic to set thread affinity dropped the
running time to under 13 seconds.

Cheers,

Ryan

--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ry...@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details

signature.asc

Martin v. Loewis

unread,

Feb 21, 2010, 5:05:44 PM2/21/10

to Ryan Kelly, pytho...@python.org

> It's far from scientific, but I've seen behaviour that's close to a 100%
> performance penalty on a dual-core linux system:
>
> http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2
>
> Short story: a particular test suite of mine used to run in around 25
> seconds, but a bit of ctypes magic to set thread affinity dropped the
> running time to under 13 seconds.

Indeed, it's not scientific - but with a few more details, you could
improve it quite a lot: what specific Linux distribution (the posting
doesn't even say it's Linux), what specific Python version had you been
using? (less important) what CPUs? If you can: what specific test suite?

A lot of science is about repeatability. Making a systematic study is
(IMO) over-valued - anecdotal reports are useful, too, as long as they
allow for repeatable experiments.

Regards,
Martin

Martin v. Loewis

unread,

Feb 21, 2010, 5:05:44 PM2/21/10

to Ryan Kelly, pytho...@python.org

> It's far from scientific, but I've seen behaviour that's close to a 100%
> performance penalty on a dual-core linux system:
>
> http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2
>
> Short story: a particular test suite of mine used to run in around 25
> seconds, but a bit of ctypes magic to set thread affinity dropped the
> running time to under 13 seconds.

Indeed, it's not scientific - but with a few more details, you could

Ryan Kelly

unread,

Feb 21, 2010, 5:39:27 PM2/21/10

to pytho...@python.org

On Sun, 2010-02-21 at 23:05 +0100, Martin v. Loewis wrote:
> > It's far from scientific, but I've seen behaviour that's close to a 100%
> > performance penalty on a dual-core linux system:
> >
> > http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2
> >
> > Short story: a particular test suite of mine used to run in around 25
> > seconds, but a bit of ctypes magic to set thread affinity dropped the
> > running time to under 13 seconds.
>
> Indeed, it's not scientific - but with a few more details, you could
> improve it quite a lot: what specific Linux distribution (the posting
> doesn't even say it's Linux), what specific Python version had you been
> using? (less important) what CPUs? If you can: what specific test suite?

I'm on Ubuntu Karmic, Python 2.6.4, an AMD Athlon 7750 dual core.

Unfortunately the test suite is for a proprietary application. I've
been able to reproduce similar behaviour with an open-source test suite,
using the current trunk of the "pyfilesystem" project:

http://code.google.com/p/pyfilesystem/

In this project "OSFS" is an object-oriented interface to the local
filesystem. The test case "TestOSFS.test_cases_in_separate_dirs" runs
three theads, each doing a bunch of IO in a different directory.

Running the tests normally:

rfk@durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS.test_cases_in_separate_dirs
.
----------------------------------------------------------------------
Ran 1 test in 9.787s

That's the best result from five runs - I saw it go as high as 12
seconds. Watching it in top, I see CPU usage at around 150%.

Now using threading2 to set the process cpu affinity at the start of the
test run:

rfk@durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS.test_cases_in_separate_dirs
.
----------------------------------------------------------------------
Ran 1 test in 3.792s

Again, best of five. The variability in times here is much lower - I
never saw it go above 4 seconds. CPU usage is consistently 100%.

signature.asc