Embedding Python, threading and scalability

Wenning Qiu

unread,

Jul 8, 2003, 5:54:22 PM7/8/03

to

I am researching issues related to emdedding Python in C++ for a
project.

My project will be running on an SMP box and requires scalability.
However, my test shows that Python threading has very poor performance
in terms of scaling. In fact it doesn't scale at all.

I wrote a simple test program to complete given number of iterations
of a simple loop. The total number of iterations can be divided evenly
among a number of threads. My test shows that as the number of threads
grows, the CPU usage grows and the response time gets longer. For
example, to complete the same amount of work, one thread takes 10
seconds, 2 threads take 20 seconds and 3 threads take 30 seconds.

The fundamental reason for lacking scalability is that Python uses a
global interpreter lock for thread safety. That global lock must be
held by a thread before it can safely access Python objects.

I thought I might be able to make embedded Python scalable by
embedding multiple interpreters and have them run independently in
different threads. However "Python/C API Reference Manual" chapter 8
says that "The global interpreter lock is also shared by all threads,
regardless of to which interpreter they belong". Therefore with
current implementation, even multiple interpreters do not provide
scalability.

Has anyone on this list run into the same problem that I have, or does
anyone know of any plan of totally insulating multiple embedded Python
interpreters?

Thanks,
Wenning Qiu

Andrew Dalke

unread,

Jul 8, 2003, 6:17:04 PM7/8/03

to

Wenning Qiu:

> I am researching issues related to emdedding Python in C++ for a
> project.

> Has anyone on this list run into the same problem that I have, or does

> anyone know of any plan of totally insulating multiple embedded Python
> interpreters?

Ahh, the Global Interpreter Lock (GIL).

Years ago, Greg Stein had a version of Python 1.4 running with no GIL.

http://www.python.org/ftp/python/contrib-09-Dec-1999/System/threading.README
Search for "free threading" to get more hits on this topic.

As I recalled, it slowed down the performance on
single-processor/single-threaded
machines, so the general take was to keep the GIL. In addition, see
http://groups.google.com/groups?selm=mailman.1008992607.2279.python-list%40p
ython.org&oe=UTF-8&output=gplain
Tim Peters:
] The prospects for another version of that grow dimmer. Everyone (incl.
] Greg) has noticed that CPython internals, over time, increase their
reliance
] on the thread-safety guarantees of the global interpreter lock.

The only solutions I know of are explicit multi-process solutions:
- a generic system, like XML-RPC/SOAP/PVM/MPI/CORBA, on
which you build your own messaging system
- use systems like Pyro or Twisted, which understand Python objects
and implement 'transparent' proxying via network communications
- use POSH, which does the proxying through shared memory (but
this uses Intel-specific assembly)

Andrew
da...@dalkescientific.com

Afanasiy

unread,

Jul 9, 2003, 9:14:06 PM7/9/03

to

On 8 Jul 2003 14:54:22 -0700, wenni...@csgsystems.com (Wenning Qiu)
wrote:

>I am researching issues related to emdedding Python in C++ for a
>project.
>
>My project will be running on an SMP box and requires scalability.
>However, my test shows that Python threading has very poor performance
>in terms of scaling. In fact it doesn't scale at all.
>
>I wrote a simple test program to complete given number of iterations
>of a simple loop. The total number of iterations can be divided evenly
>among a number of threads. My test shows that as the number of threads
>grows, the CPU usage grows and the response time gets longer. For
>example, to complete the same amount of work, one thread takes 10
>seconds, 2 threads take 20 seconds and 3 threads take 30 seconds.
>
>The fundamental reason for lacking scalability is that Python uses a
>global interpreter lock for thread safety. That global lock must be
>held by a thread before it can safely access Python objects.

I asked once and was told it was best fixed by removing the documentation
which mentioned it. Others also stated it was unlikely to be fixed.

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=53u1evk5jmdcgma5e8eupbe3tn45js302i%404ax.com&rnum=1&prev=/groups%3Fhl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26selm%3D53u1evk5jmdcgma5e8eupbe3tn45js302i%25404ax.com

However, I believe Lua since 4-work4, just before Lua 5, solved this.
Unfortunately Lua is not Python.

Another thing to consider if you care about SMP, is your C/C++ memory
management, assuming you aren't using something custom already, maybe a
shared heap. I have worked wonders with libhoard (and SmartHeap,
commercially). Some applications will run slower on SMP than if you
removed one of the processors.

www.hoard.org
www.microquill.com

mmm, graphs

-AB

Jeff Epler

unread,

Jul 9, 2003, 11:49:05 PM7/9/03

to

On Thu, Jul 10, 2003 at 11:14:47AM +0800, Simon Wittber (Maptek) wrote:
> Seriously though, this is an issue, which is a major hurdle Python *has*
> to cross if it is ever going to be seriously considered for use on large
> projects, on large SMP hardware.

Has anybody proposed "how to get there from here" for this problem
(useful multithreading of non-blocking pure Python code)? I'm not
bright enough to see how, that's for sure. Especially if you are
talking about an incremental approach, not the mythical "Python 3000".
What I mean is that you have to work this magic *and* somehow let
existing modules written in C work, with at worst a recompile. (as
someone who doesn't *need* threads, but works on a project with piles of
Python modules written in C, that's my bias anyway)

Someone nearby mentioned lua, but I don't know what it did for threading.
Perl and tcl both seem to have taken the approach of having each thread
be a distinct interpreter with nothing shared. While this means you
never have to worry about locking against a reader or modifier in another
thread, it means you might as well be using the processes that the angry
Unix Gods gave us in the first place. <1/3 overstatement> I'm pretty sure
that this approach has been explicitly ruled out by that other angry God,
Guido van Rossum, anyway.

I've written Python programs that use threads in a way that was expedient
to get a user interface running without long seizures, while I've never
written a thread in tcl or perl. OTOH if I'd had to treat everything
as explicit message passing (a la tcl threading), I'd have just found
another way (like "after idle" and "update" in tcl)

Jython will give you Java's thread model today, for Python code, won't
it? Back when I benchmarked it, Jython code and Python code ran at
fairly similar speeds (in pybench), if only you had a machine that could
keep the Jython interpreter from swapping...

Jeff

Simon Wittber (Maptek)

unread,

Jul 9, 2003, 11:14:47 PM7/9/03

to

>I asked once and was told it was best fixed by removing the
documentation
>which mentioned it. Others also stated it was unlikely to be fixed.

This can be fixed. Give Guido a SMP machine. I'm sure the threading
issue would shortly be resolved.

Seriously though, this is an issue, which is a major hurdle Python *has*
to cross if it is ever going to be seriously considered for use on large
projects, on large SMP hardware.

SimonW.

Donn Cave

unread,

Jul 10, 2003, 1:26:14 AM7/10/03

to

Quoth Jeff Epler <jep...@unpythonic.net>:

| On Thu, Jul 10, 2003 at 11:14:47AM +0800, Simon Wittber (Maptek) wrote:
|> Seriously though, this is an issue, which is a major hurdle Python *has*
|> to cross if it is ever going to be seriously considered for use on large
|> projects, on large SMP hardware.
|
| Has anybody proposed "how to get there from here" for this problem
| (useful multithreading of non-blocking pure Python code)? I'm not
| bright enough to see how, that's for sure. Especially if you are
| talking about an incremental approach, not the mythical "Python 3000".
| What I mean is that you have to work this magic *and* somehow let
| existing modules written in C work, with at worst a recompile. (as
| someone who doesn't *need* threads, but works on a project with piles of
| Python modules written in C, that's my bias anyway)

Ha - I knew I'd find an answer for this if I searched for "free threading" -
and among other hits, I found this very thread! So in Andrew Dalke's words:

> Ahh, the Global Interpreter Lock (GIL).
>
> Years ago, Greg Stein had a version of Python 1.4 running with no GIL.
>
> http://www.python.org/ftp/python/contrib-09-Dec-1999/System/threading.README
>
> Search for "free threading" to get more hits on this topic.

At any rate, it sure isn't because Guido can't scare up an SMP machine.

| I've written Python programs that use threads in a way that was expedient
| to get a user interface running without long seizures, while I've never
| written a thread in tcl or perl. OTOH if I'd had to treat everything
| as explicit message passing (a la tcl threading), I'd have just found
| another way (like "after idle" and "update" in tcl)

Hm, not sure what you're alluding to here. I actually kind of like an
I/O thread dispatch model myself, but I don't think Python cares either
way nor would it if the GIL were abolished.

| Jython will give you Java's thread model today, for Python code, won't
| it? Back when I benchmarked it, Jython code and Python code ran at
| fairly similar speeds (in pybench), if only you had a machine that could
| keep the Jython interpreter from swapping...

Don't know, but I noticed in another thread among the hits that Ype Kingma
asserts that Jython does indeed support fine grained multithreading.

Donn Cave, do...@drizzle.com

Afanasiy

unread,

Jul 10, 2003, 3:13:47 PM7/10/03

to

On Thu, 10 Jul 2003 05:26:14 -0000, "Donn Cave" <do...@drizzle.com> wrote:

>Quoth Jeff Epler <jep...@unpythonic.net>:

>At any rate, it sure isn't because Guido can't scare up an SMP machine.

Is this the absolute truth or just something people like to say?

If it is true I am very surprised... I think getting Guido an SMP
machine will not be very difficult, and could well be very cheap.

Peter Hansen

unread,

Jul 10, 2003, 3:36:04 PM7/10/03

to

Note the double negative in the phrase "isn't because Guido can't".

That means in effect "Guido can". Nobody's disagreeing with that.

-Peter

Aahz

unread,

Jul 10, 2003, 3:54:14 PM7/10/03

to

In article <ec23a1ae.0307...@posting.google.com>,

Wenning Qiu <wenni...@csgsystems.com> wrote:
>
>My project will be running on an SMP box and requires scalability.
>However, my test shows that Python threading has very poor performance
>in terms of scaling. In fact it doesn't scale at all.

That's true for pure Python code.

>The fundamental reason for lacking scalability is that Python uses a
>global interpreter lock for thread safety. That global lock must be
>held by a thread before it can safely access Python objects.

Correct. The problem is that the GIL makes Python more efficient in
many ways, because there's no need for fine-grained locking. You're
using Python inside-out for this purpose -- the way to scale Python in a
threaded environment is to call out to a C extension that releases the
GIL.

>Has anyone on this list run into the same problem that I have, or does
>anyone know of any plan of totally insulating multiple embedded Python
>interpreters?

Sure! Use multiple processes.

Other people have mentioned Perl and Tcl in this thread. I wonder how
they deal with the problem of loading DLLs with static data.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"Not everything in life has a clue in front of it...." --JMS

Afanasiy

unread,

Jul 10, 2003, 4:59:03 PM7/10/03

to

On Thu, 10 Jul 2003 15:36:04 -0400, Peter Hansen <pe...@engcorp.com>
wrote:

Yes, I read a previous post which said "Give Guido a SMP machine."
and read this one as the same, my mistake.

Jeff Epler

unread,

Jul 10, 2003, 6:54:28 PM7/10/03

to

On Thu, Jul 10, 2003 at 03:54:14PM -0400, Aahz wrote:
> Other people have mentioned Perl and Tcl in this thread. I wonder how
> they deal with the problem of loading DLLs with static data.

As far as I know, tcl enforces a one interpreter to one thread requirement.
An extension should have only thread-local data, using a Tcl-supplied API.

Jeff

Aahz

unread,

Jul 10, 2003, 7:48:57 PM7/10/03

to

In article <mailman.1057877718...@python.org>,

What happens when Tcl wants to interact with some 3rd-party DLL that is
*not* thread-safe?

Simon Wittber (Maptek)

unread,

Jul 10, 2003, 8:46:48 PM7/10/03

to

[snip]
>...the way to scale Python in a threaded environment is to call out to

a C >extension that releases the GIL.

[snip]

To write scalable applications in Python, one must write the
'scalabilty-required' parts n C.

Does anyone else see this as a problem?

Sw.

Jimmy Retzlaff

unread,

Jul 10, 2003, 8:36:10 PM7/10/03

to

Aahz (aa...@pythoncraft.com) wrote:

>Wenning Qiu <wenni...@csgsystems.com> wrote:
>>
>>Has anyone on this list run into the same problem that I have, or does
>>anyone know of any plan of totally insulating multiple embedded Python
>>interpreters?
>
>Sure! Use multiple processes.

This can have further scalability benefits. The first time I ran up
against the GIL, I grumbled my way through tweaking my code to use
multiple processes and an IPC mechanism (which was surprisingly easy).
After I finished, I realized that nothing about the IPC mechanism I
happened to choose limited my solution to one machine and suddenly my
app was running on 7 CPUs instead of the 2 CPUs I had originally hoped
for. On just 2 CPUs it was probably a little slower than it might have
been because of the IPC overhead, but 7 CPUs put my app in a totally
different league.

Jimmy

Donn Cave

unread,

Jul 10, 2003, 11:42:18 PM7/10/03

to

Quoth "Simon Wittber (Maptek)" <Simon....@perth.maptek.com.au>:

Is it a problem?

I don't know all the reasons why an application might want to
compute in parallel on multiple CPUs, but I am guessing that
ordinarily it's about processor intensive computations. That
isn't really Python's strongest point - you want to write the
heavy duty computing in C if you want speed.

If it's any consolation, I believe the ocaml Objective CAML
thread implementation has the same kind of global lock, even
though it does compile to efficient native code and would
otherwise be an attractive candidate for compute intensive
applications. Don't take my word for it, but that's how it
looks to me. Regardless of how grave the problem may be,
if there's no practical fix, we live with it.

Donn Cave, do...@drizzle.com

Alia Khouri

unread,

Jul 11, 2003, 2:40:23 AM7/11/03

to

[Simon Wittber]

> To write scalable applications in Python, one must write the
> 'scalabilty-required' parts n C.

I wonder if python could benefit from working closely with a language
like cilk, "a language for multithreaded parallel programming based on
ANSI C."

Has anybody has any experience with this?

Alia

Alan Kennedy

unread,

Jul 11, 2003, 6:27:49 AM7/11/03

to

[Aahz]

>>...the way to scale Python in a threaded environment is to call out
>> to a C extension that releases the GIL.

Simon Wittber (Maptek) wrote:

> To write scalable applications in Python, one must write the
> 'scalabilty-required' parts n C.

Or Pyrex? Or Boost?

Do either of these products release the GIL when control is
transferred to the C/C++ extensions?

> Does anyone else see this as a problem?

Comparatively, perhaps. But certainly less of a problem than writing
your entire program in C++, for example.

Perhaps easiest to write the whole thing in jython, if the jython free
threading claim is true? Ype? Aahz? I'd really like to know the truth
of the assertion that jython will utilise all processors in a
multi-processor box.

regards,

--
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/mailto/alan

Aahz

unread,

Jul 11, 2003, 10:23:58 AM7/11/03

to

In article <3F0E9125...@hotmail.com>,

Alan Kennedy <ala...@hotmail.com> wrote:
>Simon Wittber (Maptek) wrote:
>>
>> To write scalable applications in Python, one must write the
>> 'scalabilty-required' parts n C.
>
>Or Pyrex? Or Boost?
>
>Do either of these products release the GIL when control is
>transferred to the C/C++ extensions?

Not automatically, that's for certain; dunno if either provides any
features to make the job easier.

Aahz

unread,

Jul 11, 2003, 10:33:47 AM7/11/03

to

In article <mailman.105788455...@python.org>,

Simon Wittber (Maptek) <Simon....@perth.maptek.com.au> wrote:
>
>To write scalable applications in Python, one must write the
>'scalabilty-required' parts n C.
>
>Does anyone else see this as a problem?

Not particularly. Most threading at the application level is done for
one or more of three purposes:

* Allowing background work and fast response in a GUI application

* Scalable I/O

* Autonomous sections of code for algorithmic simplicity (e.g.
simulations)

Python does quite well at all three out of the box (the second because
all standard Python I/O modules release the GIL, as do most 3rd-party
extensions that deal with I/O (e.g. mxODBC)). The only thing Python
doesn't do is computational threading, and Python's overhead makes it a
poor choice for that purpose. Finally, there are so many distributed
computing solutions that multiple processes are a viable technique for
managing computational threading.

Paul Rubin

unread,

Jul 11, 2003, 11:52:14 AM7/11/03

to

aa...@pythoncraft.com (Aahz) writes:
> Not particularly. Most threading at the application level is done for
> one or more of three purposes:
>
> * Allowing background work and fast response in a GUI application
>
> * Scalable I/O
>
> * Autonomous sections of code for algorithmic simplicity (e.g.
> simulations)

Um, concurrent access by multiple clients for server applications?

> Python does quite well at all three out of the box (the second because
> all standard Python I/O modules release the GIL, as do most 3rd-party
> extensions that deal with I/O (e.g. mxODBC)). The only thing Python
> doesn't do is computational threading, and Python's overhead makes it a
> poor choice for that purpose. Finally, there are so many distributed
> computing solutions that multiple processes are a viable technique for
> managing computational threading.

Yeah, but now you need complicated and possibly slow mechanisms for
sharing global state between processes.

Irmen de Jong

unread,

Jul 11, 2003, 2:01:47 PM7/11/03

to

Paul Rubin wrote:
> aa...@pythoncraft.com (Aahz) writes:
>
>>Not particularly. Most threading at the application level is done for
>>one or more of three purposes:
>>
>>* Allowing background work and fast response in a GUI application
>>
>>* Scalable I/O
>>
>>* Autonomous sections of code for algorithmic simplicity (e.g.
>>simulations)
>
>
> Um, concurrent access by multiple clients for server applications?

Exactly, that is what I use threads for in Pyro. Without threads,
each of the client's remote method invocations has to wait in line
before being accepted and processed (there can be many clients for
one server object).

--Irmen

Aahz

unread,

Jul 11, 2003, 2:07:20 PM7/11/03

to

In article <7xvfu98...@ruckus.brouhaha.com>,

Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
>aa...@pythoncraft.com (Aahz) writes:
>>
>> Not particularly. Most threading at the application level is done for
>> one or more of three purposes:
>>
>> * Allowing background work and fast response in a GUI application
>>
>> * Scalable I/O
>>
>> * Autonomous sections of code for algorithmic simplicity (e.g.
>> simulations)
>
>Um, concurrent access by multiple clients for server applications?

That's functionally an I/O thing, combined possibly with background
work.

Jeff Epler

unread,

Jul 11, 2003, 12:44:27 PM7/11/03

to

On Thu, Jul 10, 2003 at 07:48:57PM -0400, Aahz wrote:
> In article <mailman.1057877718...@python.org>,
> Jeff Epler <jep...@unpythonic.net> wrote:
> >On Thu, Jul 10, 2003 at 03:54:14PM -0400, Aahz wrote:
> >>
> >> Other people have mentioned Perl and Tcl in this thread. I wonder how
> >> they deal with the problem of loading DLLs with static data.
> >
> >As far as I know, tcl enforces a one interpreter to one thread
> >requirement. An extension should have only thread-local data, using a
> >Tcl-supplied API.
>
> What happens when Tcl wants to interact with some 3rd-party DLL that is
> *not* thread-safe?

I guess you'd have to do your own locking. Tcl has standard C APIs for
Conditions, Mutexes, and thread-specific data, see the Thread(3) manpage.
You'd have to surround all non-reentrant calls with Tcl_MutexLock(m)
... Tcl_MutexUnlock(m). If two extensions wanted to use the same
non-thread-safe library, they'd have to cooperate in some way to use
the same 'm' to Tcl_Mutex*(). I don't know if there's a standard way to
do this, but I think that having the mutex defined in a shared lib they
both link might work.

Jeff

Aahz

unread,

Jul 11, 2003, 4:22:39 PM7/11/03

to

In article <mailman.1057941919...@python.org>,

Yup. And that's exactly why there has been little movement to remove
the GIL from Python. One of Python's core strengths is the ease with
which random DLLs can be used from Python.