Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

2.6, 3.0, and truly independent intepreters

17 views
Skip to first unread message

Andy

unread,
Oct 22, 2008, 12:32:23 PM10/22/08
to
Dear Python dev community,

I'm CTO at a small software company that makes music visualization
software (you can check us out at www.soundspectrum.com). About two
years ago we went with decision to use embedded python in a couple of
our new products, given all the great things about python. We were
close to using lua but for various reasons we decided to go with
python. However, over the last two years, there's been one area of
grief that sometimes makes me think twice about our decision to go
with python...

Some background first... Our software is used for entertainment and
centers around real time, high-performance graphics, so python's
performance, embedded flexibility, and stability are the most
important issues for us. Our software targets a large cross section
of hardware and we currently ship products for Win32, OS X, and the
iPhone and since our customers are end users, our products have to be
robust, have a tidy install footprint, and be foolproof. Basically,
we use embedded python and use it to wrap our high performance C++
class set which wraps OpenGL, DirectX and our own software renderer.
In addition to wrapping our C++ frameworks, we use python to perform
various "worker" tasks on worker thread (e.g. image loading and
processing). However, we require *true* thread/interpreter
independence so python 2 has been frustrating at time, to say the
least. Please don't start with "but really, python supports multiple
interpreters" because I've been there many many times with people.
And, yes, I'm aware of the multiprocessing module added in 2.6, but
that stuff isn't lightweight and isn't suitable at all for many
environments (including ours). The bottom line is that if you want to
perform independent processing (in python) on different threads, using
the machine's multiple cores to the fullest, then you're out of luck
under python 2.

Sadly, the only way we could get truly independent interpreters was to
put python in a dynamic library, have our installer make a *duplicate*
copy of it during the installation process (e.g. python.dll/.bundle ->
python2.dll/.bundle) and load each one explicitly in our app, so we
can get truly independent interpreters. In other words, we load a
fresh dynamic lib for each thread-independent interpreter (you can't
reuse the same dynamic library because the OS will just reference the
already-loaded one).

From what I gather from the python community, the basis for not
offering "real" muti-threaded support is that it'd add to much
internal overhead--and I couldn't agree more. As a high performance C
and C++ guy, I fully agree that thread safety should be at the high
level, not at the low level. BUT, the lack of truly independent
interpreters is what ultimately prevents using python in cool,
powerful ways. This shortcoming alone has caused game developers--
both large and small--to choose other embedded interpreters over
python (e.g. Blizzard chose lua over python). For example, Apple's
QuickTime API is powerful in that high-level instance objects can
leverage performance gains associated with multi-threaded processing.
Meanwhile, the QuickTime API simply lists the responsibilities of the
caller regarding thread safety and that's all its needs to do. In
other words, CPython doesn't need to step in an provide a threadsafe
environment; it just needs to establish the rules and make sure that
its own implementation supports those rules.

More than once, I had actually considered expending company resources
to develop a high performance, truly independent interpreter
implementation of the python core language and modules but in the end
estimated that the size of that project would just be too much, given
our company's current resources. Should such an implementation ever
be developed, it would be very attractive for companies to support,
fund, and/or license. The truth is, we just love python as a
language, but it's lack of true interpreter independence (in a
interpreter as well as in a thread sense) remains a *huge* liability.

So, my question becomes: is python 3 ready for true multithreaded
support?? Can we finally abandon our Frankenstein approach of loading
multiple identical dynamic libs to achieve truly independent
interpreters?? I've reviewed all the new python 3 C API module stuff,
and all I have to say is: whew--better late then never!! So, although
that solves modules offering truly independent interpreter support,
the following questions remain:

- In python 3, the C module API now supports true interpreter
independence, but have all the modules in the python codebase been
converted over? Are they all now truly compliant? It will only take
a single static/global state variable in a module to potentially cause
no end of pain in a multiple interpreter environment! Yikes!

- How close is python 3 really to true multithreaded use? The
assumption here is that caller ensures safety (e.g. ensuring that
neither interpreter is in use when serializing data from one to
another).

I believe that true python independent thread/interpreter support is
paramount and should become the top priority because this is the key
consideration used by developers when they're deciding which
interpreter to embed in their app. Until there's a hello world that
demonstrates running independent python interpreters on multiple app
threads, lua will remain the clear choice over python. Python 3 needs
true interpreter independence and multi-threaded support!


Thanks,
Andy O'Meara


Thomas Heller

unread,
Oct 22, 2008, 1:45:58 PM10/22/08
to
Andy schrieb:
> Dear Python dev community,
>
> [...] Basically,

> we use embedded python and use it to wrap our high performance C++
> class set which wraps OpenGL, DirectX and our own software renderer.
> In addition to wrapping our C++ frameworks, we use python to perform
> various "worker" tasks on worker thread (e.g. image loading and
> processing). However, we require *true* thread/interpreter
> independence so python 2 has been frustrating at time, to say the
> least.
[...]

>
> Sadly, the only way we could get truly independent interpreters was to
> put python in a dynamic library, have our installer make a *duplicate*
> copy of it during the installation process (e.g. python.dll/.bundle ->
> python2.dll/.bundle) and load each one explicitly in our app, so we
> can get truly independent interpreters. In other words, we load a
> fresh dynamic lib for each thread-independent interpreter (you can't
> reuse the same dynamic library because the OS will just reference the
> already-loaded one).

Interesting questions you ask.

A random note: py2exe also does something similar for executables build
with the 'bundle = 1' option. The python.dll and .pyd extension modules
in this case are not loaded into the process in the 'normal' way (with
some kind of windows LoadLibrary() call, instead they are loaded by code
in py2exe that /emulates/ LoadLibrary - the code segments are loaded into
memory, fixups are made for imported functions, and marked executable.

The result is that separate COM objects implemented as Python modules and
converted into separate dlls by py2exe do not share their interpreters even
if they are running in the same process. Of course this only works on windows.
In effect this is similar to using /statically/ linked python interpreters
in separate dlls. Can't you do something like that?

> So, my question becomes: is python 3 ready for true multithreaded
> support?? Can we finally abandon our Frankenstein approach of loading
> multiple identical dynamic libs to achieve truly independent
> interpreters?? I've reviewed all the new python 3 C API module stuff,
> and all I have to say is: whew--better late then never!! So, although
> that solves modules offering truly independent interpreter support,
> the following questions remain:
>
> - In python 3, the C module API now supports true interpreter
> independence, but have all the modules in the python codebase been
> converted over? Are they all now truly compliant? It will only take
> a single static/global state variable in a module to potentially cause
> no end of pain in a multiple interpreter environment! Yikes!

I don't think this is the case (currently). But you could submit patches
to Python so that at least the 'official' modules (builtin and extensions)
would behave corectly in the case of multiple interpreters. At least
this is a much lighter task than writing your own GIL-less interpreter.

My 2 cents,

Thomas

"Martin v. Löwis"

unread,
Oct 22, 2008, 2:14:09 PM10/22/08
to
> - In python 3, the C module API now supports true interpreter
> independence, but have all the modules in the python codebase been
> converted over?

No, none of them.

> Are they all now truly compliant? It will only take
> a single static/global state variable in a module to potentially cause
> no end of pain in a multiple interpreter environment! Yikes!

So you will have to suffer pain.

> - How close is python 3 really to true multithreaded use?

Python is as thread-safe as ever (i.e. completely thread-safe).

> I believe that true python independent thread/interpreter support is
> paramount and should become the top priority because this is the key
> consideration used by developers when they're deciding which
> interpreter to embed in their app. Until there's a hello world that
> demonstrates running independent python interpreters on multiple app
> threads, lua will remain the clear choice over python. Python 3 needs
> true interpreter independence and multi-threaded support!

So what patches to achieve that goal have you contributed so far?

In open source, pleas have nearly zero effect; code contributions is
what has effect.

I don't think any of the current committers has a significant interest
in supporting multiple interpreters (and I say that as the one who wrote
and implemented PEP 3121). To make a significant change, you need to
start with a PEP, offer to implement it once accepted, and offer to
maintain the feature for five years.

Regards,
Martin

Andy

unread,
Oct 22, 2008, 2:45:47 PM10/22/08
to

Hi Thomas -

I appreciate your thoughts and time on this subject.

>
> The result is that separate COM objects implemented as Python modules and
> converted into separate dlls by py2exe do not share their interpreters even
> if they are running in the same process.  Of course this only works on windows.
> In effect this is similar to using /statically/ linked python interpreters
> in separate dlls.  Can't you do something like that?

You're definitely correct that homebrew loading and linking would do
the trick. However, because our python stuff makes callbacks into our
C/C++, that complicates the linking process (if I understand you
correctly). Also, then there's the problem of OS X.


> > - In python 3, the C module API now supports true interpreter
> > independence, but have all the modules in the python codebase been
> > converted over?  Are they all now truly compliant?  It will only take
> > a single static/global state variable in a module to potentially cause
> > no end of pain in a multiple interpreter environment!  Yikes!
>
> I don't think this is the case (currently).  But you could submit patches
> to Python so that at least the 'official' modules (builtin and extensions)
> would behave corectly in the case of multiple interpreters.  At least
> this is a much lighter task than writing your own GIL-less interpreter.
>

I agree -- and I've been considering that (or rather, having our
company hire/pay part of the python dev community to do the work). To
consider that, the question becomes, how many modules are we talking
about do you think? 10? 100? I confess that I'm no familiar enough
with the full C python suite to have a good idea of how much work
we're talking about here.

Regards,
Andy


Andy

unread,
Oct 22, 2008, 3:26:44 PM10/22/08
to

> > - In python 3, the C module API now supports true interpreter
> > independence, but have all the modules in the python codebase been
> > converted over?
>
> No, none of them.

:^)

>
> > - How close is python 3 really to true multithreaded use?
>
> Python is as thread-safe as ever (i.e. completely thread-safe).
>

If you're referring to the fact that the GIL does that, then you're
certainly correct. But if you've got multiple CPUs/cores and actually
want to use them, that GIL means you might as well forget about them.
So please take my use of "true multithreaded" to mean "turning off"
the GIL and push the responsibility of object safety to the client/API
level (such as in my QuickTime API example).


> > I believe that true python independent thread/interpreter support is
> > paramount and should become the top priority because this is the key
> > consideration used by developers when they're deciding which
> > interpreter to embed in their app. Until there's a hello world that
> > demonstrates running independent python interpreters on multiple app
> > threads, lua will remain the clear choice over python. Python 3 needs
> > true interpreter independence and multi-threaded support!
>
> So what patches to achieve that goal have you contributed so far?
>
> In open source, pleas have nearly zero effect; code contributions is
> what has effect.
>

This is just my second email, please be a little patient. :^) But
more seriously, I do represent a company ready, able, and willing to
fund the development of features that we're looking for, so please
understand that I'm definitely not coming to the table empty-handed
here.


> I don't think any of the current committers has a significant interest
> in supporting multiple interpreters (and I say that as the one who wrote
> and implemented PEP 3121). To make a significant change, you need to
> start with a PEP, offer to implement it once accepted, and offer to
> maintain the feature for five years.
>

Nice to meet you! :^) Seriously though, thank you for all your work on
3121 and taking the initiative with it! It's definitely the first
step in what companies like ours attract us to embedded an interpreted
language. Specifically: unrestricted interpreter and thread-
independent use.

I would *love* for our company to be 10 times larger and be able to
add another zero to what we'd be able to hire/offer the python dev
community for work that we're looking for, but we unfortunately have
limits at the moment. And I would love to see python become the
leading choice when companies look to use an embedded interpreter, and
I offer my comments here to paint a picture of what can make python
more appealing to commercial software developers. Hopefully, the
python dev community doesn't underestimate the dev funding that could
potentially come in from companies if python grew in certain ways!

So, that said, I represent a company willing to fund the development
of features that move python towards thread-independent operation. No
software engineer can deny that we're entering a new era of
multithreaded processing where support frameworks (such as python)
need to be open minded with how they're used in a multi-threaded
environment--that's all I'm saying here.

Anyway, I can definitely tell you and anyone else interested that
we're willing to put our money where our wish-list is. As I mentioned
in my previous post to Thomas, the next step is to get an
understanding of the options available that will satisfy our needs.
We have a budget for this, but it's not astronomical (it's driven by
the cost associated with dropping python and going with lua--or,
making our own pared-down interpreter implementation). Please let me
be clear--I love python (as a language) and I don't want to switch.
BUT, we have to be able to run interpreters in different threads (and
get unhindered/full CPU core performance--ie. no GIL).

Thoughts? Also, please feel free to email me off-list if you prefer.

Oh, while I'm at it, if anyone in the python dev community (or anyone
that has put real work into python) is interested in our software,
email me and I'll hook you up with a complimentary copy of the
products that use python (music visuals for iTunes and WMP).

Regards,
Andy


"Martin v. Löwis"

unread,
Oct 22, 2008, 3:55:58 PM10/22/08
to
> I would *love* for our company to be 10 times larger and be able to
> add another zero to what we'd be able to hire/offer the python dev
> community for work that we're looking for, but we unfortunately have
> limits at the moment.

There is another thing about open source that you need to consider:
you don't have to do it all on your own.

It needs somebody to take the lead, start a project, define a plan,
and small steps to approach it. If it's really something that the
community desperately needs, and if you make it clear that you will
just lead, but get nowhere without contributions, then the
contributions will come in.

If there won't be any contributions, then the itch in the the
community isn't that strong that it needs scratching.

Regards,
Martin

Terry Reedy

unread,
Oct 22, 2008, 5:15:21 PM10/22/08
to pytho...@python.org
Andy wrote:

> I agree -- and I've been considering that (or rather, having our
> company hire/pay part of the python dev community to do the work). To
> consider that, the question becomes, how many modules are we talking
> about do you think? 10? 100?

In your Python directory, everything in Lib is Python, I believe.
Everything in DLLs is compiled C extensions. I see about 15 for Windows
3.0. These reflect two separate directories in the source tree. Builtin
classes are part of pythonxx.dll in the main directory. I have no idea
if things such as lists (from listobject.c), for instance, are a
potential problem for you.

You could start with the module of most interest to you, or perhaps a
small one, and see if it needs patching (from your viewpoint) and how
much effort it would take to meet your needs.

Terry Jan Reedy

Jesse Noller

unread,
Oct 22, 2008, 5:21:29 PM10/22/08
to Andy, pytho...@python.org
On Wed, Oct 22, 2008 at 12:32 PM, Andy <and...@gmail.com> wrote:
> And, yes, I'm aware of the multiprocessing module added in 2.6, but
> that stuff isn't lightweight and isn't suitable at all for many
> environments (including ours). The bottom line is that if you want to
> perform independent processing (in python) on different threads, using
> the machine's multiple cores to the fullest, then you're out of luck
> under python 2.

So, as the guy-on-the-hook for multiprocessing, I'd like to know what
you might suggest for it to make it more apt for your - and other
environments.

Additionally, have you looked at:
https://launchpad.net/python-safethread
http://code.google.com/p/python-safethread/w/list
(By Adam olsen)

-jesse

Terry Reedy

unread,
Oct 22, 2008, 5:34:17 PM10/22/08
to pytho...@python.org
Andy wrote:

> This is just my second email, please be a little patient. :^)

As a 10-year veteran, I welcome new contributors with new viewpoints and
information.

> more appealing to commercial software developers. Hopefully, the
> python dev community doesn't underestimate the dev funding that could
> potentially come in from companies if python grew in certain ways!

This seems to be something of a chicken-and-egg problem.

> So, that said, I represent a company willing to fund the development
> of features that move python towards thread-independent operation.

Perhaps you know of and can persuade other companies to contribute to
such focused effort.

> No
> software engineer can deny that we're entering a new era of
> multithreaded processing where support frameworks (such as python)
> need to be open minded with how they're used in a multi-threaded
> environment--that's all I'm saying here.

The *current* developers seem to be more interested in exploiting
multiple processors with multiprocessing. Note that Google choose that
route for Chrome (as I understood their comic introduction). 2.6 and 3.0
come with a new multiprocessing module that mimics the threading module
api fairly closely. It is now being backported to run with 2.5 and 2.4.

Advances in multithreading will probably require new ideas and
development energy.

Terry Jan Reedy

Jesse Noller

unread,
Oct 22, 2008, 5:49:32 PM10/22/08
to Terry Reedy, pytho...@python.org
On Wed, Oct 22, 2008 at 5:34 PM, Terry Reedy <tjr...@udel.edu> wrote:
> The *current* developers seem to be more interested in exploiting multiple
> processors with multiprocessing. Note that Google choose that route for
> Chrome (as I understood their comic introduction). 2.6 and 3.0 come with a
> new multiprocessing module that mimics the threading module api fairly
> closely. It is now being backported to run with 2.5 and 2.4.

That's not exactly correct. Multiprocessing was added to 2.6 and 3.0
as a *additional* method for parallel/concurrent programming that
allows you to use multiple cores - however, as I noted in the PEP:

" In the future, the package might not be as relevant should the
CPython interpreter enable "true" threading, however for some
applications, forking an OS process may sometimes be more
desirable than using lightweight threads, especially on those
platforms where process creation is fast and optimized."

Multiprocessing is not a replacement for a "free threading" future
(ergo my mentioning Adam Olsen's work) - it is a tool in the
"batteries included" box. I don't want my cheerleading and driving of
this to somehow implicate that the rest of Python-Dev thinks this is
the "silver bullet" or final answer in concurrency.

However, a free-threaded python has a lot of implications, and if we
were to do it, it requires we not only "drop" the GIL - it also
requires we consider the ramifications of enabling true threading ala
Java et al - just having "true threads" lying around is great if
you've spent a ton of time learning locking, avoiding shared data/etc,
stepping through and cursing poor debugger support for multiple
threads, etc.

This is why I've been a fan of Adam's approach - enabling free
threading via GIL removal is actually secondary to the project's
stated goal: Enable Safe Threading.

In any case, I've jumped the rails - let's just say there's room in
python for multiprocessing, threading and possible a concurrent
package ala java.util.concurrent - but it really does have to be
thought out and done right.

Speaking of which: If you wanted "real" threads, you could use a
combination of JCC (http://pypi.python.org/pypi/JCC/) and Jython. :)

-jesse

Rhamphoryncus

unread,
Oct 22, 2008, 6:06:10 PM10/22/08
to
On Oct 22, 10:32 am, Andy <and...@gmail.com> wrote:
> Dear Python dev community,
>
> I'm CTO at a small software company that makes music visualization
> software (you can check us out atwww.soundspectrum.com).  About two

> years ago we went with decision to use embedded python in a couple of
> our new products, given all the great things about python.  We were
> close to using lua but for various reasons we decided to go with
> python.  However, over the last two years, there's been one area of
> grief that sometimes makes me think twice about our decision to go
> with python...
>
> Some background first...   Our software is used for entertainment and
> centers around real time, high-performance graphics, so python's
> performance, embedded flexibility, and stability are the most
> important issues for us.  Our software targets a large cross section
> of hardware and we currently ship products for Win32, OS X, and the
> iPhone and since our customers are end users, our products have to be
> robust, have a tidy install footprint, and be foolproof.  Basically,
> we use embedded python and use it to wrap our high performance C++
> class set which wraps OpenGL, DirectX and our own software renderer.
> In addition to wrapping our C++ frameworks, we use python to perform
> various "worker" tasks on worker thread (e.g. image loading andprocessing).  However, we require *true* thread/interpreter

> independence so python 2 has been frustrating at time, to say the
> least.  Please don't start with "but really, python supports multiple
> interpreters" because I've been there many many times with people.
> And, yes, I'm aware of the multiprocessing module added in 2.6, but
> that stuff isn't lightweight and isn't suitable at all for many
> environments (including ours).  The bottom line is that if you want to
> perform independentprocessing (in python) on different threads, using

> the machine's multiple cores to the fullest, then you're out of luck
> under python 2.
>
> Sadly, the only way we could get truly independent interpreters was to
> put python in a dynamic library, have our installer make a *duplicate*
> copy of it during the installationprocess(e.g. python.dll/.bundle ->

What you describe, truly independent interpreters, is not threading at
all: it is processes, emulated at the application level, with all the
memory cost and none of the OS protections. True threading would
involve sharing most objects.

Your solution depends on what you need:
* Killable "threads" -> OS processes
* multicore usage (GIL removal) -> OS processes or alternative Python
implementations (PyPy/Jython/IronPython)
* Sane shared objects -> safethread

Andy

unread,
Oct 22, 2008, 9:04:30 PM10/22/08
to

>
> What you describe, truly independent interpreters, is not threading at
> all: it is processes, emulated at the application level, with all the
> memory cost and none of the OS protections.  True threading would
> involve sharing most objects.
>
> Your solution depends on what you need:
> * Killable "threads" -> OS processes
> * multicore usage (GIL removal) -> OS processes or alternative Python
> implementations (PyPy/Jython/IronPython)
> * Sane shared objects -> safethread


I realize what you're saying, but it's better said there's two issues
at hand:

1) Independent interpreters (this is the easier one--and solved, in
principle anyway, by PEP 3121, by Martin v. Löwis, but is FAR from
being carried through in modules as he pointed out). As you point
out, this doesn't directly relate to multi-threading BUT it is
intimately tied to the issue because if, in principle, every module
used instance data (rather than static data), then python would be
WELL on its way to "free threading" (as Jesse Noller calls it), or as
I was calling it "true multi-threading".

2) Barriers to "free threading". As Jesse describes, this is simply
just the GIL being in place, but of course it's there for a reason.
It's there because (1) doesn't hold and there was never any specs/
guidance put forward about what should and shouldn't be done in multi-
threaded apps (see my QuickTime API example). Perhaps if we could go
back in time, we would not put the GIL in place, strict guidelines
regarding multithreaded use would have been established, and PEP 3121
would have been mandatory for C modules. Then again--screw that, if I
could go back in time, I'd just go for the lottery tickets!! :^)

Anyway, I've been at this issue for quite a while now (we're
approaching our 3rd release cycle), so I'm pretty comfortable with the
principles at hand. I'd say the theme of your comments share the
theme of others here, so perhaps consider where end-user software
houses (like us) are coming from. Specifically, developing commercial
software for end users imposes some restrictions that open source
development communities aren't often as sensitive to, namely:

- Performance -- emulation is a no-go (e.g. Jython)
- Maturity and Licensing -- experimental/academic projects are no-go
(PyPy)
- Cross platform support -- love it or hate it, Win32 and OS X are all
that matter when you're talking about selling (and supporting)
software to the masses. I'm just the messenger here (ie. this is NOT
flamebait). We publish for OS X, so IronPython is therefore out.

Basically, our company is at a crossroads where we really need light,
clean "free threading" as Jesse calls it (e.g. on the iPhone, using
our python drawing wrapper to do primary drawing while running python
jobs on another thread doing image decoding and processing). In our
current iPhone app, we achieve this by using two python bundles
(dynamic libs) in the way I described in my initial post. Sure, thus
solves our problem, but it's pretty messy, sucks up resources, and has
been a pain to maintain.

Moving forward, please understand my posts here are also intended to
give the CPython dev community a glimpse of the issues that may not be
as visible to you guys (as they are for dev houses like us). For
example, it'd be pretty cool if Blizzard went with python instead of
lua, wouldn't you think? But some of the issues I've raised here no
doubt factor in to why end-user dev houses ultimately may have to pass
up python in favor of another interpreted language.

Bottom line: why give prospective devs any reason to turn down python--
there's just so many great things about python!

Regards,
Andy


Andy

unread,
Oct 22, 2008, 9:47:33 PM10/22/08
to
Jesse, Terry, Martin -

First off, thanks again for your time and interest in this matter.
It's definitely encouraging to know that time and real effort is being
put into the matter and I hope my posts on this subject are hopefully
an informative data point for everyone here.

Thanks for that link to Adam Olsen's work, Jesse--I'll definitely look
more closely at it. As I mentioned in my previous post, end-user devs
like me are programmed to get nervous around new mods but at first
glance there definitely seems to be interesting. My initial reaction,
as interesting as the project is, goes back to by previous post about
putting all the object safety responsibility on the shoulders of the
API client. That way, one gets the best of both worlds: free
threading and no unnecessary object locking/blocking (ie. the API
client will manage moving the synchronization req'd to move objects
from one interpreter to another). I could have it wrong, but it seems
like safethread inserts some thread-safety features but they come at
the cost of performance. I know I keep mentioning it, but I think the
QuickTime API (and its documentation) is a great model for how any API
should approach threading. Check out their docs to see how they
address it; conceptually speaking, there's not a single line of thread
safety in QuickTime:

http://developer.apple.com/technotes/tn/tn2125.html

In short: multiple thread is tricky; it's the responsibility of the
API client to not do hazardous things.

And for the record: the module multiprocessing is totally great answer
for python-level MP stuff--very nice work, Jesse!

I'd like to post and discuss more, but I'll pick it up tomorrow...
All this stuff is fun and interesting to talk about, but I have to get
to some other things and it unfortunately comes down to cost
analysis. Sadly, I look at it as I can allocate 2-3 man months (~
$40k) to build our own basic python interpreter implementation that
solves our need for free threading and increased performance (we've
built various internal interpreters over the years so we have good
experience in house, our tools are high performance, and we only use a
pretty small subset of python). Or, there's the more attractive
approach to work with the python dev community and put that dev
expenditure into a form everyone can benefit from.


Regards,
Andy

> Additionally, have you looked at:https://launchpad.net/python-safethreadhttp://code.google.com/p/python-safethread/w/list
> (By Adam olsen)
>
> -jesse

Rhamphoryncus

unread,
Oct 22, 2008, 10:06:18 PM10/22/08
to
On Oct 22, 7:04 pm, Andy <and...@gmail.com> wrote:
> > What you describe, truly independent interpreters, is not threading at
> > all: it is processes, emulated at the application level, with all the
> > memory cost and none of the OS protections.  True threading would
> > involve sharing most objects.
>
> > Your solution depends on what you need:
> > * Killable "threads" -> OS processes
> > * multicore usage (GIL removal) -> OS processes or alternative Python
> > implementations (PyPy/Jython/IronPython)
> > * Sane shared objects -> safethread
>
> I realize what you're saying, but it's better said there's two issues
> at hand:
>
> 1) Independent interpreters (this is the easier one--and solved, in
> principle anyway, by PEP 3121, by Martin v. Löwis, but is FAR from
> being carried through in modules as he pointed out).  As you point
> out, this doesn't directly relate to multi-threading BUT it is
> intimately tied to the issue because if, in principle, every module
> used instance data (rather than static data), then python would be
> WELL on its way to "free threading" (as Jesse Noller calls it), or as
> I was calling it "true multi-threading".

If you want processes, use *real* processes. Your arguments fail to
get transaction because you don't provide a good, justified reason why
they don't and can't work.

Although isolated interpreters would be convenient to you, it's a
specialized use case, and bad language design. There's far more use
cases that aren't isolated (actual threading), so why exclude them?


> 2) Barriers to "free threading".  As Jesse describes, this is simply
> just the GIL being in place, but of course it's there for a reason.
> It's there because (1) doesn't hold and there was never any specs/
> guidance put forward about what should and shouldn't be done in multi-
> threaded apps (see my QuickTime API example).  Perhaps if we could go
> back in time, we would not put the GIL in place, strict guidelines
> regarding multithreaded use would have been established, and PEP 3121
> would have been mandatory for C modules.  Then again--screw that, if I
> could go back in time, I'd just go for the lottery tickets!! :^)

You seem confused. PEP 3121 is for isolated interpreters (ie emulated
processes), not threading.

Getting threading right would have been a massive investment even back
then, and we probably wouldn't have as mature of a python we do
today. Make no mistake, the GIL has substantial benefits. It may be
old and tired, surrounded by young bucks, but it's still winning most
of the races.


> Anyway, I've been at this issue for quite a while now (we're
> approaching our 3rd release cycle), so I'm pretty comfortable with the
> principles at hand.  I'd say the theme of your comments share the
> theme of others here, so perhaps consider where end-user software
> houses (like us) are coming from.  Specifically, developing commercial
> software for end users imposes some restrictions that open source
> development communities aren't often as sensitive to, namely:
>
> - Performance -- emulation is a no-go (e.g. Jython)

Got some real benchmarks to back that up? How about testing it on a
16 core (or more) box and seeing how it scales?


> - Maturity and Licensing -- experimental/academic projects are no-go
> (PyPy)
> - Cross platform support -- love it or hate it, Win32 and OS X are all
> that matter when you're talking about selling (and supporting)
> software to the masses.  I'm just the messenger here (ie. this is NOT
> flamebait).  We publish for OS X, so IronPython is therefore out.

You might be able to use Java on one, IronPython on another, and PyPy
in between. Regardless, my point is that CPython will *never* remove
the GIL. It cannot be done in an effective, highly scalable fashion
without a total rewrite.


> Basically, our company is at a crossroads where we really need light,
> clean "free threading" as Jesse calls it (e.g. on the iPhone, using
> our python drawing wrapper to do primary drawing while running python
> jobs on another thread doing image decoding and processing).  In our
> current iPhone app, we achieve this by using two python bundles
> (dynamic libs) in the way I described in my initial post.  Sure, thus
> solves our problem, but it's pretty messy, sucks up resources, and has
> been a pain to maintain.

Is the iPhone multicore, or is it an issue of fairness (ie a soft
realtime app)?


> Moving forward, please understand my posts here are also intended to
> give the CPython dev community a glimpse of the issues that may not be
> as visible to you guys (as they are for dev houses like us).  For
> example, it'd be pretty cool if Blizzard went with python instead of
> lua, wouldn't you think?  But some of the issues I've raised here no
> doubt factor in to why end-user dev houses ultimately may have to pass
> up python in favor of another interpreted language.
>
> Bottom line: why give prospective devs any reason to turn down python--
> there's just so many great things about python!

I'd like to see python used more, but fixing these things properly is
not as easy as believed. Those in the user community see only their
immediate problem (threads don't use multicore). People like me see
much bigger problems. We need consensus on the problems, and how to
solve it, and a commitment to invest what's required.

Andy

unread,
Oct 23, 2008, 12:31:33 AM10/23/08
to

> You seem confused.  PEP 3121 is for isolated interpreters (ie emulated
> processes), not threading.

Please reread my points--inherently isolated interpreters (ie. the top
level object) are indirectly linked to thread independence. I don't
want to argue, but you seem hell-bent on not hearing what I'm trying
to say here.

>
> Got some real benchmarks to back that up?  How about testing it on a
> 16 core (or more) box and seeing how it scales?
>

I don't care to argue with you, and you'll have to take it on faith
that I'm not spouting hot air. But just to put this to rest, I'll
make it clear in this Jython case:

You can't sell software to end users and expect them have a recent,
working java distro. Look around you: no real commercial software
title that sells to soccer moms and gamers use java. There's method
to commercial software production, so please don't presume that you
know my job, product line, and customers better than me, ok?

Just to put things in perspective, I already have exposed my company
to more support and design liability than I knew I was getting into by
going with python (as a result of all this thread safety and
interpreter independence business). I love to go into that one, but
it's frankly just not a good use of my time right now. Please just
accept that when someone says an option is a deal breaker, then it's a
deal breaker. This isn't some dude's masters thesis project here--we
pay our RENT and put our KIDS through school because we sell and ship
software that works is meant to entertain people happy.

>
> I'd like to see python used more, but fixing these things properly is
> not as easy as believed.  Those in the user community see only their
> immediate problem (threads don't use multicore).  People like me see
> much bigger problems.  We need consensus on the problems, and how to
> solve it, and a commitment to invest what's required.

Well, you seem to come down pretty hard on people that at your
doorstep saying their WILLING and INTERESTED in supporting python
development. And, you're exactly right: users see only their
immediate problem--but that's the definition of being a user. If
users saw the whole picture from the dev side, then they be
developers, not users.

Please consider that you're representing the python dev community
here; I'm you're friend here, not your enemy.

Andy


Rhamphoryncus

unread,
Oct 23, 2008, 2:51:51 AM10/23/08
to
On Oct 22, 10:31 pm, Andy <and...@gmail.com> wrote:
> > You seem confused.  PEP 3121 is for isolated interpreters (ie emulated
> > processes), not threading.
>
> Please reread my points--inherently isolated interpreters (ie. the top
> level object) are indirectly linked to thread independence.  I don't
> want to argue, but you seem hell-bent on not hearing what I'm trying
> to say here.

I think the confusion is a matter of context. Your app, written in C
or some other non-python language, shares data between the threads and
thus treats them as real threads. However, from python's perspective
nothing is shared, and thus it is processes.

Although this contradiction is fine for embedding purposes, python is
a general purpose language, and needs to be capable of directly
sharing objects. Imagine you wanted to rewrite the bulk of your app
in python, with only a relatively small portion left in a C extension
module.


> > Got some real benchmarks to back that up?  How about testing it on a
> > 16 core (or more) box and seeing how it scales?
>
> I don't care to argue with you, and you'll have to take it on faith
> that I'm not spouting hot air.  But just to put this to rest, I'll
> make it clear in this Jython case:
>
> You can't sell software to end users and expect them have a recent,
> working java distro.  Look around you: no real commercial software
> title that sells to soccer moms and gamers use java.  There's method
> to commercial software production, so please don't presume that you
> know my job, product line, and customers better than me, ok?
>
> Just to put things in perspective, I already have exposed my company
> to more support and design liability than I knew I was getting into by
> going with python (as a result of all this thread safety and
> interpreter independence business).  I love to go into that one, but
> it's frankly just not a good use of my time right now.  Please just
> accept that when someone says an option is a deal breaker, then it's a
> deal breaker.  This isn't some dude's masters thesis project here--we
> pay our RENT and put our KIDS through school because we sell and ship
> software that works is meant to entertain people happy.

Consider it accepted. I understand that PyPy/Jython/IronPython don't
fit your needs. Likewise though, CPython cannot fit my needs. What
we both need simply does not exist today.


> > I'd like to see python used more, but fixing these things properly is
> > not as easy as believed.  Those in the user community see only their
> > immediate problem (threads don't use multicore).  People like me see
> > much bigger problems.  We need consensus on the problems, and how to
> > solve it, and a commitment to invest what's required.
>
> Well, you seem to come down pretty hard on people that at your
> doorstep saying their WILLING and INTERESTED in supporting python
> development.  And, you're exactly right:  users see only their
> immediate problem--but that's the definition of being a user.  If
> users saw the whole picture from the dev side, then they be
> developers, not users.
>
> Please consider that you're representing the python dev community
> here; I'm you're friend here, not your enemy.

I'm sorry if I came across harshly. My intent was merely to push you
towards supporting long-term solutions, rather than short-term ones.

Christian Heimes

unread,
Oct 23, 2008, 3:24:28 AM10/23/08
to pytho...@python.org
Andy wrote:
> 2) Barriers to "free threading". As Jesse describes, this is simply
> just the GIL being in place, but of course it's there for a reason.
> It's there because (1) doesn't hold and there was never any specs/
> guidance put forward about what should and shouldn't be done in multi-
> threaded apps (see my QuickTime API example). Perhaps if we could go
> back in time, we would not put the GIL in place, strict guidelines
> regarding multithreaded use would have been established, and PEP 3121
> would have been mandatory for C modules. Then again--screw that, if I
> could go back in time, I'd just go for the lottery tickets!! :^)

I'm very - not absolute, but very - sure that Guido and the initial
designers of Python would have added the GIL anyway. The GIL makes
Python faster on single core machines and more stable on multi core
machines. Other language designers think the same way. Ruby recently got
a GIL. The article
http://www.infoq.com/news/2007/05/ruby-threading-futures explains the
rationales for a GIL in Ruby. The article also holds a quote from Guido
about threading in general.

Several people inside and outside the Python community think that
threads are dangerous and don't scale. The paper
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf sums it
up nicely, It explains why modern processors are going to cause more and
more trouble with the Java approach to threads, too.

Python *must* gain means of concurrent execution of CPU bound code
eventually to survive on the market. But it must get the right means or
we are going to suffer the consequences.

Christian


Message has been deleted

Rhamphoryncus

unread,
Oct 23, 2008, 5:24:50 PM10/23/08
to
On Oct 23, 11:30 am, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:
> On approximately 10/23/2008 12:24 AM, came the following characters from
> the keyboard of Christian Heimes:

>
> > Andy wrote:
> >> 2) Barriers to "free threading".  As Jesse describes, this is simply
> >> just the GIL being in place, but of course it's there for a reason.
> >> It's there because (1) doesn't hold and there was never any specs/
> >> guidance put forward about what should and shouldn't be done in multi-
> >> threaded apps (see my QuickTime API example).  Perhaps if we could go
> >> back in time, we would not put the GIL in place, strict guidelines
> >> regarding multithreaded use would have been established, and PEP 3121
> >> would have been mandatory for C modules.  Then again--screw that, if I
> >> could go back in time, I'd just go for the lottery tickets!! :^)
>
> I've been following this discussion with interest, as it certainly seems
> that multi-core/multi-CPU machines are the coming thing, and many
> applications will need to figure out how to use them effectively.

>
> > I'm very - not absolute, but very - sure that Guido and the initial
> > designers of Python would have added the GIL anyway. The GIL makes
> > Python faster on single core machines and more stable on multi core
> > machines. Other language designers think the same way. Ruby recently
> > got a GIL. The article
> >http://www.infoq.com/news/2007/05/ruby-threading-futuresexplains the

> > rationales for a GIL in Ruby. The article also holds a quote from
> > Guido about threading in general.
>
> > Several people inside and outside the Python community think that
> > threads are dangerous and don't scale. The paper
> >http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdfsums
> > it up nicely, It explains why modern processors are going to cause
> > more and more trouble with the Java approach to threads, too.
>
> Reading this PDF paper is extremely interesting (albeit somewhat
> dependent on understanding abstract theories of computation; I have
> enough math background to follow it, sort of, and most of the text can
> be read even without fully understanding the theoretical abstractions).
>
> I have already heard people talking about "Java applications are
> buggy".  I don't believe that general sequential programs written in
> Java are any buggier than programs written in other languages... so I
> had interpreted that to mean (based on some inquiry) that complex,
> multi-threaded Java applications are buggy.  And while I also don't
> believe that complex, multi-threaded programs written in Java are any
> buggier than complex, multi-threaded programs written in other
> languages, it does seem to be true that Java is one of the currently
> popular languages in which to write complex, multi-threaded programs,
> because of its language support for threads and concurrency primitives.  
> These reports were from people that are not programmers, but are field
> IT people, that have bought and/or support software and/or hardware with
> drivers, that are written in Java, and seem to have non-ideal behavior,
> (apparently only) curable by stopping/restarting the application or
> driver, or sometimes requiring a reboot.
>
> The paper explains many traps that lead to complex, multi-threaded
> programs being buggy, and being hard to test.  I have worked with
> parallel machines, applications, and databases for 25 years, and can
> appreciate the succinct expression of the problems explained within the
> paper, and can, from experience, agree with its premises and
> conclusions.  Parallel applications only have been commercial successes
> when the parallelism is tightly constrained to well-controlled patterns
> that could be easily understood.  Threads, especially in "cooperation"
> with languages that use memory pointers, have the potential to get out
> of control, in inexplicable ways.

Although the paper is correct in many ways, I find it fails to
distinguish the core of the problem from the chaff surrounding it, and
thus is used to justify poor language designs.

For example, the amount of interaction may be seen as a spectrum: at
one end is C or Java threads, with complicated memory models, and a
tendency to just barely control things using locks. At the other end
would be completely isolated processes with no form of IPC. The later
is considered the worst possible, while the latter is the best
possible (purely sequential).

However, the latter is too weak for many uses. At a minimum we'd like
some pipes to communicate. Helps, but it's still too weak. What if
you have a large amount of data to share, created at startup but
otherwise not modified? So we add some read only types and ways to
define your own read only types. A couple of those types need a
process associated with them, so we make sure process handles are
proper objects too.

What have we got now? It's more on the thread end of the spectrum
than the process end, but it's definitely not a C or Java thread, and
it's definitely not an OS process. What is it? Does it have the
problems in the paper? Only some? Which?

Another peeve I have is his characterization of the observer pattern.
The generalized form of the problem exists in both single-threaded
sequential programs, in the form of unexpected reentrancy, and message
passing, with infinite CPU usage or infinite number of pending
messages.

Perhaps threading makes it much worse; I've heard many anecdotes that
would support that. Or perhaps it's the lack of automatic deadlock
detection, giving a clear and diagnosable error for you to fix.
Certainly, the mystery and extremeness of a deadlock could explain how
much it scales people. Either way the paper says nothing.


> > Python *must* gain means of concurrent execution of CPU bound code
> > eventually to survive on the market. But it must get the right means
> > or we are going to suffer the consequences.
>

> This statement, after reading the paper, seems somewhat in line with the
> author's premise that language acceptability requires that a language be
> self-contained/monolithic, and potentially sufficient to implement
> itself.  That seems to also be one of the reasons that Java is used
> today for threaded applications.  It does seem to be true, given current
> hardware trends, that _some mechanism_ must be provided to obtain the
> benefit of multiple cores/CPUs to a single application, and that Python
> must either implement or interface to that mechanism to continue to be a
> viable language for large scale application development.
>
> Andy seems to want an implementation of independent Python processes
> implemented as threads within a single address space, that can be
> coordinated by an outer application.  This actually corresponds to the
> model promulgated in the paper as being most likely to succeed.  Of
> course, it maps nicely into a model using separate processes,
> coordinated by an outer process, also.  The differences seem to be:
>
> 1) Most applications are historically perceived as corresponding to
> single processes.  Language features for multi-processing are rare, and
> such languages are not in common use.
>
> 2) A single address space can be convenient for the coordinating outer
> application.  It does seem simpler and more efficient to simply "copy"
> data from one memory location to another, rather than send it in a
> message, especially if the data are large.  On the other hand,
> coordination of memory access between multiple cores/CPUs effectively
> causes memory copies from one cache to the other, and if memory is
> accessed from multiple cores/CPUs regularly, the underlying hardware
> implements additional synchronization and copying of data, potentially
> each time the memory is accessed.  Being forced to do message passing of
> data between processes can actually be more efficient than access to
> shared memory at times.  I should note that in my 25 years of parallel
> development, all the systems created used a message passing paradigm,
> partly because the multiple CPUs often didn't share the same memory
> chips, much less the same address space, and that a key feature of all
> the successful systems of that nature was an efficient inter-CPU message
> passing mechanism.  I should also note that Herb Sutter has a recent
> series of columns in Dr Dobbs regarding multi-core/multi-CPU parallelism
> and a variety of implementation pitfalls, that I found to be very
> interesting reading.

Try looking at it on another level: when your CPU wants to read from a
bit of memory controlled by another CPU it sends them a message
requesting they get it for us. They send back a message containing
that memory. They also note we have it, in case they want to modify
it later. We also note where we got it, in case we want to modify it
(and not wait for them to do modifications for us).

Message passing vs shared memory isn't really a yes/no question. It's
about ratios, usage patterns, and tradeoffs. *All* programs will
share data, but in what way? If it's just the code itself you can
move the cache validation into software and simplify the CPU, making
it faster. If the shared data is a lot more than that, and you use it
to coordinate accesses, then it'll be faster to have it in hardware.

It's quite possible they'll come up with something that seems quite
different, but in reality is the same sort of rearrangement. Add
hardware support for transactions, move the caching partly into
software, etc.

>
> I have noted the multiprocessing module that is new to Python 2.6/3.0
> being feverishly backported to Python 2.5, 2.4, etc... indicating that
> people truly find the model/module useful... seems that this is one way,
> in Python rather than outside of it, to implement the model Andy is
> looking for, although I haven't delved into the details of that module
> yet, myself.  I suspect that a non-Python application could load one
> embedded Python interpreter, and then indirectly use the multiprocessing
> module to control other Python interpreters in other processors.  I
> don't know that multithreading primitives such as described in the paper
> are available in the multiprocessing module, but perhaps they can be
> implemented in some manner using the tools that are provided; in any
> case, some interprocess communication primitives are provided via this
> new Python module.
>
> There could be opportunity to enhance Python with process creation and
> process coordination operations, rather than have it depend on
> easy-to-implement-incorrectly coordination patterns or
> easy-to-use-improperly libraries/modules of multiprocessing primitives
> (this is not a slam of the new multiprocessing module, which appears to
> be filling a present need in rather conventional ways, but just to point
> out that ideas promulgated by the paper, which I suspect 2 years later
> are still research topics, may be a better abstraction than the
> conventional mechanisms).
>
> One thing Andy hasn't yet explained (or I missed) is why any of his
> application is coded in a language other than Python.  I can think of a
> number of possibilities:
>
> A) (Historical) It existed, then the desire for extensions was seen, and
> Python was seen as a good extension language.
>
> B) Python is inappropriate (performance?) for some of the algorithms
> (but should they be coded instead as Python extensions, with the core
> application being in Python?)
>
> C) Unavailability of Python wrappers for particularly useful 3rd-party
> libraries
>
> D) Other?

"It already existed" is definitely the original reason, but now it
includes single-threaded performance and multi-threaded scalability.
Although the idea of "just write an extension that releases the GIL"
is a common suggestion, it needs to be fairly coarse to be effective,
and ensure little of the CPU time is left in python. If the apps
spreads around it's CPU time it is likely impossible to use python
effectively.

greg

unread,
Oct 24, 2008, 2:12:43 AM10/24/08
to
Andy wrote:

> 1) Independent interpreters (this is the easier one--and solved, in
> principle anyway, by PEP 3121, by Martin v. Löwis

Something like that is necessary for independent interpreters,
but not sufficient. There are also all the built-in constants
and type objects to consider. Most of these are statically
allocated at the moment.

> 2) Barriers to "free threading". As Jesse describes, this is simply
> just the GIL being in place, but of course it's there for a reason.
> It's there because (1) doesn't hold and there was never any specs/
> guidance put forward about what should and shouldn't be done in multi-
> threaded apps

No, it's there because it's necessary for acceptable performance
when multiple threads are running in one interpreter. Independent
interpreters wouldn't mean the absence of a GIL; it would only
mean each interpreter having its own GIL.

--
Greg

"Martin v. Löwis"

unread,
Oct 24, 2008, 3:07:09 AM10/24/08
to Rhamphoryncus
> You seem confused. PEP 3121 is for isolated interpreters (ie emulated
> processes), not threading.

Just a small remark: this wasn't the primary objective of the PEP.
The primary objective was to support module cleanup in a reliable
manner, to allow eventually to get modules garbage-collected properly.
However, I also kept the isolated interpreters feature in mind there.

Regards,
Martin

sturlamolden

unread,
Oct 24, 2008, 9:35:39 AM10/24/08
to

Instead of "appdomains" (one interpreter per thread), or free
threading, you could use multiple processes. Take a look at the new
multiprocessing module in Python 2.6. It has roughly the same
interface as Python's threading and queue modules, but uses processes
instead of threads. Processes are scheduled independently by the
operating system. The objects in the multiprocessing module also tend
to have much better performance than their threading and queue
counterparts. If you have a problem with threads due to the GIL, the
multiprocessing module with most likely take care of it.

There is a fundamental problem with using homebrew loading of multiple
(but renamed) copies of PythonXX.dll that is easily overlooked. That
is, extension modules (.pyd) are DLLs as well. Even if required by two
interpreters, they will only be loaded into the process image once.
Thus you have to rename all of them as well, or you will get havoc
with refcounts. Not to speak of what will happen if a Windows HANDLE
is closed by one interpreter while still needed by another. It is
almost guaranteed to bite you, sooner or later.

There are other options as well:

- Use IronPython. It does not have a GIL.

- Use Jython. It does not have a GIL.

- Use pywin32 to create isolated outproc COM servers in Python. (I'm
not sure what the effect of inproc servers would be.)

- Use os.fork() if your platform supports it (Linux, Unix, Apple,
Cygwin, Windows Vista SUA). This is the standard posix way of doing
multiprocessing. It is almost unbeatable if you have a fast copy-on-
write implementation of fork (that is, all platforms except Cygwin).


Andy O'Meara

unread,
Oct 24, 2008, 9:58:06 AM10/24/08
to
On Oct 24, 9:35 am, sturlamolden <sturlamol...@yahoo.no> wrote:
> Instead of "appdomains" (one interpreter per thread), or free
> threading, you could use multiple processes. Take a look at the new
> multiprocessing module in Python 2.6.

That's mentioned earlier in the thread.

>
> There is a fundamental problem with using homebrew loading of multiple
> (but renamed) copies of PythonXX.dll that is easily overlooked. That
> is, extension modules (.pyd) are DLLs as well.

Tell me about it--there's all kinds of problems and maintenance
liabilities with our approach. That's why I'm here talking about this
stuff.

> There are other options as well:
>
> - Use IronPython. It does not have a GIL.
>
> - Use Jython. It does not have a GIL.
>
> - Use pywin32 to create isolated outproc COM servers in Python. (I'm
> not sure what the effect of inproc servers would be.)
>
> - Use os.fork() if your platform supports it (Linux, Unix, Apple,
> Cygwin, Windows Vista SUA). This is the standard posix way of doing
> multiprocessing. It is almost unbeatable if you have a fast copy-on-
> write implementation of fork (that is, all platforms except Cygwin).

This is discussed earlier in the thread--they're unfortunately all
out.

Stefan Behnel

unread,
Oct 24, 2008, 10:19:36 AM10/24/08
to
Terry Reedy wrote:
> Everything in DLLs is compiled C extensions. I see about 15 for Windows
> 3.0.

Ah, weren't that wonderful times back in the days of Win3.0, when DLL-hell was
inhabited by only 15 libraries? *sigh*

... although ... wait, didn't Win3.0 have more than that already? Maybe you
meant Windows 1.0?

SCNR-ly,

Stefan

sturlamolden

unread,
Oct 24, 2008, 10:32:46 AM10/24/08
to
On Oct 24, 3:58 pm, "Andy O'Meara" <and...@gmail.com> wrote:

> This is discussed earlier in the thread--they're unfortunately all
> out.

It occurs to me that tcl is doing what you want. Have you ever thought
of not using Python?

That aside, the fundamental problem is what I perceive a fundamental
design flaw in Python's C API. In Java JNI, each function takes a
JNIEnv* pointer as their first argument. There is nothing the
prevents you from embedding several JVMs in a process. Python can
create embedded subinterpreters, but it works differently. It swaps
subinterpreters like a finite state machine: only one is concurrently
active, and the GIL is shared. The approach is fine, except it kills
free threading of subinterpreters. The argument seems to be that
Apache's mod_python somehow depends on it (for reasons I don't
understand).

Andy O'Meara

unread,
Oct 24, 2008, 10:40:52 AM10/24/08
to
On Oct 24, 2:12 am, greg <g...@cosc.canterbury.ac.nz> wrote:
> Andy wrote:
> > 1) Independent interpreters (this is the easier one--and solved, in
> > principle anyway, by PEP 3121, by Martin v. Löwis
>
> Something like that is necessary for independent interpreters,
> but not sufficient. There are also all the built-in constants
> and type objects to consider. Most of these are statically
> allocated at the moment.
>

Agreed--I was just trying to speak generally. Or, put another way,
there's no hope for independent interpreters without the likes of PEP
3121. Also, as Martin pointed out, there's the issue of module
cleanup some guys here may underestimate (and I'm glad Martin pointed
out the importance of it). Without the module cleanup, every time a
dynamic library using python loads and unloads you've got leaks. This
issue is a real problem for us since our software is loaded and
unloaded many many times in a host app (iTunes, WMP, etc). I hadn't
raised it here yet (and I don't want to turn the discussion to this),
but lack of multiple load and unload support has been another painful
issue that we didn't expect to encounter when we went with python.


> > 2) Barriers to "free threading".  As Jesse describes, this is simply
> > just the GIL being in place, but of course it's there for a reason.
> > It's there because (1) doesn't hold and there was never any specs/
> > guidance put forward about what should and shouldn't be done in multi-
> > threaded apps
>
> No, it's there because it's necessary for acceptable performance
> when multiple threads are running in one interpreter. Independent
> interpreters wouldn't mean the absence of a GIL; it would only
> mean each interpreter having its own GIL.
>

I see what you're saying, but let's note that what you're talking
about at this point is an interpreter containing protection from the
client level violating (supposed) direction put forth in python
multithreaded guidelines. Glenn Linderman's post really gets at
what's at hand here. It's really important to consider that it's not
a given that python (or any framework) has to be designed against
hazardous use. Again, I refer you to the diagrams and guidelines in
the QuickTime API:

http://developer.apple.com/technotes/tn/tn2125.html

They tell you point-blank what you can and can't do, and it's that's
simple. Their engineers can then simply create the implementation
around those specs and not weigh any of the implementation down with
sync mechanisms. I'm in the camp that simplicity and convention wins
the day when it comes to an API. It's safe to say that software
engineers expect and assume that a thread that doesn't have contact
with other threads (except for explicit, controlled message/object
passing) will run unhindered and safely, so I raise an eyebrow at the
GIL (or any internal "helper" sync stuff) holding up an thread's
performance when the app is designed to not need lower-level global
locks.

Anyway, let's talk about solutions. My company looking to support
python dev community endeavor that allows the following:

- an app makes N worker threads (using the OS)

- each worker thread makes its own interpreter, pops scripts off a
work queue, and manages exporting (and then importing) result data to
other parts of the app. Generally, we're talking about CPU-bound work
here.

- each interpreter has the essentials (e.g. math support, string
support, re support, and so on -- I realize this is open-ended, but
work with me here).

Let's guesstimate about what kind of work we're talking about here and
if this is even in the realm of possibility. If we find that it *is*
possible, let's figure out what level of work we're talking about.
From there, I can get serious about writing up a PEP/spec, paid
support, and so on.

Regards,
Andy

Andy O'Meara

unread,
Oct 24, 2008, 10:58:52 AM10/24/08
to

> That aside, the fundamental problem is what I perceive a fundamental
> design flaw in Python's C API. In Java JNI, each function takes a
> JNIEnv* pointer as their first argument. There  is nothing the
> prevents you from embedding several JVMs in a process. Python can
> create embedded subinterpreters, but it works differently. It swaps
> subinterpreters like a finite state machine: only one is concurrently
> active, and the GIL is shared.

Bingo, it seems that you've hit it right on the head there. Sadly,
that's why I regard this thread largely futile (but I'm an optimist
when it comes to cool software communities so here I am). I've been
afraid to say it for fear of getting mauled by everyone here, but I
would definitely agree if there was a context (i.e. environment)
object passed around then perhaps we'd have the best of all worlds.
*winces*


>
> > This is discussed earlier in the thread--they're unfortunately all
> > out.
>
> It occurs to me that tcl is doing what you want. Have you ever thought
> of not using Python?

Bingo again. Our research says that the options are tcl, perl
(although it's generally untested and not recommended by the
community--definitely dealbreakers for a commercial user like us), and
lua. Also, I'd rather saw off my own right arm than adopt perl, so
that's out. :^)

As I mentioned, we're looking to either (1) support a python dev
community effort, (2) make our own high-performance python interpreter
(that uses an env object as you described), or (3) drop python and go
to lua. I'm favoring them in the order I list them, but the more I
discuss the issue with folks here, the more people seem to be
unfortunately very divided on (1).

Andy

Patrick Stinson

unread,
Oct 24, 2008, 11:26:09 AM10/24/08
to pytho...@python.org
I'm not finished reading the whole thread yet, but I've got some
things below to respond to this post with.

On Thu, Oct 23, 2008 at 9:30 AM, Glenn Linderman <v+py...@g.nevcal.com> wrote:
> On approximately 10/23/2008 12:24 AM, came the following characters from the
> keyboard of Christian Heimes:
>>
>> Andy wrote:
>>>

>>> 2) Barriers to "free threading". As Jesse describes, this is simply
>>> just the GIL being in place, but of course it's there for a reason.
>>> It's there because (1) doesn't hold and there was never any specs/
>>> guidance put forward about what should and shouldn't be done in multi-
>>> threaded apps (see my QuickTime API example). Perhaps if we could go
>>> back in time, we would not put the GIL in place, strict guidelines
>>> regarding multithreaded use would have been established, and PEP 3121
>>> would have been mandatory for C modules. Then again--screw that, if I
>>> could go back in time, I'd just go for the lottery tickets!! :^)
>
>

> I've been following this discussion with interest, as it certainly seems
> that multi-core/multi-CPU machines are the coming thing, and many
> applications will need to figure out how to use them effectively.
>
>> I'm very - not absolute, but very - sure that Guido and the initial
>> designers of Python would have added the GIL anyway. The GIL makes Python
>> faster on single core machines and more stable on multi core machines. Other
>> language designers think the same way. Ruby recently got a GIL. The article

>> http://www.infoq.com/news/2007/05/ruby-threading-futures explains the


>> rationales for a GIL in Ruby. The article also holds a quote from Guido
>> about threading in general.
>>
>> Several people inside and outside the Python community think that threads
>> are dangerous and don't scale. The paper

>> http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf sums it up

We develop virtual instrument plugins for music production using
AudioUnit, VST, and RTAS on Windows and OS X. While our dsp engine's
code has to be written in C/C++ for performance reasons, the gui could
have been written in python. But, we didn't because:

1) Our project lead didn't know python, and the project began with
little time for him to learn it.
2) All of our third-party libs (for dsp, plugin-wrappers, etc) are
written in C++, so it would far easier to write and debug our app if
written in the same language. Could I do it now? yes. Could we do it
then? No.

** Additionally **, we would have run into this problem, which is very
appropriate to this thread:

3) Adding python as an audio scripting language in the audio thread
would have caused concurrency issues if our GUI had been written in
python, since audio threads are not allowed to make blockign calls
(f.ex. acquiring the GIL).

OK, I'll continue reading the thread now :)

>
> --
> Glenn -- http://nevcal.com/
> ===========================
> A protocol is complete when there is nothing left to remove.
> -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Andy O'Meara

unread,
Oct 24, 2008, 11:42:58 AM10/24/08
to

Glenn, great post and points!

>
> Andy seems to want an implementation of independent Python processes
> implemented as threads within a single address space, that can be
> coordinated by an outer application.  This actually corresponds to the
> model promulgated in the paper as being most likely to succeed.

Yeah, that's the idea--let the highest levels run and coordinate the
show.

>
> It does seem simpler and more efficient to simply "copy"
> data from one memory location to another, rather than send it in a
> message, especially if the data are large.

That's the rub... In our case, we're doing image and video
manipulation--stuff not good to be messaging from address space to
address space. The same argument holds for numerical processing with
large data sets. The workers handing back huge data sets via
messaging isn't very attractive.

> One thing Andy hasn't yet explained (or I missed) is why any of his
> application is coded in a language other than Python.  

Our software runs in real time (so performance is paramount),
interacts with other static libraries, depends on worker threads to
perform real-time image manipulation, and leverages Windows and Mac OS
API concepts and features. Python's performance hits have generally
been a huge challenge with our animators because they often have to go
back and massage their python code to improve execution performance.
So, in short, there are many reasons why we use python as a part
rather than a whole.

The other area of pain that I mentioned in one of my other posts is
that what we ship, above all, can't be flaky. The lack of module
cleanup (intended to be addressed by PEP 3121), using a duplicate copy
of the python dynamic lib, and namespace black magic to achieve
independent interpreters are all examples that have made using python
for us much more challenging and time-consuming then we ever
anticipated.

Again, if it turns out nothing can be done about our needs (which
appears to be more and more like the case), I think it's important for
everyone here to consider the points raised here in the last week.
Moreover, realize that the python dev community really stands to gain
from making python usable as a tool (rather than a monolith). This
fact alone has caused lua to *rapidly* rise in popularity with
software companies looking to embed a powerful, lightweight
interpreter in their software.

As a python language fan an enthusiast, don't let lua win! (I say
this endearingly of course--I have the utmost respect for both
communities and I only want to see CPython be an attractive pick when
a company is looking to embed a language that won't intrude upon their
app's design).


Andy

Patrick Stinson

unread,
Oct 24, 2008, 12:01:33 PM10/24/08
to Andy O'Meara, pytho...@python.org
We are in the same position as Andy here.

I think that something that would help people like us produce
something in code form is a collection of information outlining the
problem and suggested solutions, appropriate parts of the CPython's
current threading API, and pros and cons of the many various proposed
solutions to the different levels of the problem. The most valuable
information I've found is contained in the many (lengthy!) discussions
like this one, a few related PEP's, and the CPython docs, but has
anyone condensed the state of the problem into a wiki or something
similar? Maybe we should start one?

For example, Guido's post here
http://www.artima.com/weblogs/viewpost.jsp?thread=214235describes some
possible solutions to the problem, like interpreter-specific locks, or
fine-grained object locks, and he also mentions the primary
requirement of not harming from the performance of single-threaded
apps. As I understand it, that requirement does not rule out new build
configurations that provide some level of concurrency, as long as you
can still compile python so as to perform as well on single-threaded
apps.

To add to the heap of use cases, the most important thing to us is to
simple have the python language and the sip/PyQt modules available to
us. All we wanted to do was embed the interpreter and language core as
a local scripting engine, so had we patched python to provide
concurrent execution, we wouldn't have cared about all of the other
unsuppported extension modules since our scripts are quite
application-specific.

It seems to me that the very simplest move would be to remove global
static data so the app could provide all thread-related data, which
Andy suggests through references to the QuickTime API. This would
suggest compiling python without thread support so as to leave it up
to the application.

Anyway, I'm having fun reading all of these papers and news postings,
but it's true that code talks, and it could be a little easier if the
state of the problems was condensed. This could be an intense and fun
project, but frankly it's a little tough to keep it all in my head. Is
there a wiki or something out there or should we start one, or do I
just need to read more code?

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Patrick Stinson

unread,
Oct 24, 2008, 12:08:37 PM10/24/08
to Andy O'Meara, pytho...@python.org
As a side note to the performance question, we are executing python
code in an audio thread that is used in all of the top-end music
production environments. We have found the language to perform
extremely well when executed at control-rate frequency, meaning we
aren't doing DSP computations, just responding to less-frequent events
like user input and MIDI messages.

So we are sitting this music platform with unimaginable possibilities
in the music world (of which python does not play a role), but those
little CPU spikes caused by the GIL at low latencies won't let us have
it. AFAIK, there is no music scripting language out there that would
come close, and yet we are sooooo close! This is a big deal.

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Terry Reedy

unread,
Oct 24, 2008, 12:09:21 PM10/24/08
to pytho...@python.org

Is that the equivalent of a smilely? or did you really not understand
what I wrote?

Jesse Noller

unread,
Oct 24, 2008, 12:30:18 PM10/24/08
to Andy O'Meara, pytho...@python.org

Point of order! Just for my own sanity if anything :) I think some
minor clarifications are in order.

What are "threads" within Python:

Python has built in support for POSIX light weight threads. This is
what most people are talking about when they see, hear and say
"threads" - they mean Posix Pthreads
(http://en.wikipedia.org/wiki/POSIX_Threads) this is not what you
(Adam) seem to be asking for. PThreads are attractive due to the fact
they exist within a single interpreter, can share memory all "willy
nilly", etc.

Python does in fact, use OS-Level pthreads when you request multiple threads.

The Global Interpreter Lock is fundamentally designed to make the
interpreter easier to maintain and safer: Developers do not need to
worry about other code stepping on their namespace. This makes things
thread-safe, inasmuch as having multiple PThreads within the same
interpreter space modifying global state and variable at once is,
well, bad. A c-level module, on the other hand, can sidestep/release
the GIL at will, and go on it's merry way and process away.

POSIX Threads/pthreads/threads as we get from Java, allow unsafe
programming styles. These programming styles are of the "shared
everything deadlock lol" kind. The GIL *partially* protects against
some of the pitfalls. You do not seem to be asking for pthreads :)

http://www.python.org/doc/faq/library/#can-t-we-get-rid-of-the-global-interpreter-lock
http://en.wikipedia.org/wiki/Multi-threading

However, then there are processes.

The difference between threads and processes is that they do *not
share memory* but they can share state via shared queues/pipes/message
passing - what you seem to be asking for - is the ability to
completely fork independent Python interpreters, with their own
namespace and coordinate work via a shared queue accessed with pipes
or some other communications mechanism. Correct?

Multiprocessing, as it exists within python 2.6 today actually forks
(see trunk/Lib/multiprocessing/forking.py) a completely independent
interpreter per process created and then construct pipes to
inter-communicate, and queue to do work coordination. I am not
suggesting this is good for you - I'm trying to get to exactly what
you're asking for.

Fundamentally, allowing total free-threading with Posix threads, using
the same Java-Model for control is a recipe for pain - we're just
repeating mistakes instead of solving a problem, ergo - Adam Olsen's
work. Monitors, Actors, etc have all been discussed, proposed and are
being worked on.

So, just to clarify - Andy, do you want one interpreter, $N threads
(e.g. PThreads) or the ability to fork multiple "heavyweight"
processes?

Other bits for reading:
http://www.boddie.org.uk/python/pprocess.html (as an alternative the
multiprocessing)
http://smparkes.net/tag/dramatis/
http://osl.cs.uiuc.edu/parley/
http://candygram.sourceforge.net/

Jesse Noller

unread,
Oct 24, 2008, 12:32:33 PM10/24/08
to Andy O'Meara, pytho...@python.org
On Fri, Oct 24, 2008 at 12:30 PM, Jesse Noller <jno...@gmail.com> wrote:
> On Fri, Oct 24, 2008 at 10:40 AM, Andy O'Meara <and...@gmail.com> wrote:

I almost forgot:

http://www.kamaelia.org/Home

Message has been deleted

Andy O'Meara

unread,
Oct 24, 2008, 3:17:21 PM10/24/08
to

>
> The Global Interpreter Lock is fundamentally designed to make the
> interpreter easier to maintain and safer: Developers do not need to
> worry about other code stepping on their namespace. This makes things
> thread-safe, inasmuch as having multiple PThreads within the same
> interpreter space modifying global state and variable at once is,
> well, bad. A c-level module, on the other hand, can sidestep/release
> the GIL at will, and go on it's merry way and process away.

...Unless part of the C module execution involves the need do CPU-
bound work on another thread through a different python interpreter,
right? (even if the interpreter is 100% independent, yikes). For
example, have a python C module designed to programmatically generate
images (and video frames) in RAM for immediate and subsequent use in
animation. Meanwhile, we'd like to have a pthread with its own
interpreter with an instance of this module and have it dequeue jobs
as they come in (in fact, there'd be one of these threads for each
excess core present on the machine). As far as I can tell, it seems
CPython's current state can't CPU bound parallelization in the same
address space (basically, it seems that we're talking about the
"embarrassingly parallel" scenario raised in that paper). Why does it
have to be in same address space? Convenience and simplicity--the
same reasons that most APIs let you hang yourself if the app does dumb
things with threads. Also, when the data sets that you need to send
to and from each process is large, using the same address space makes
more and more sense.


> So, just to clarify - Andy, do you want one interpreter, $N threads
> (e.g. PThreads) or the ability to fork multiple "heavyweight"
> processes?

Sorry if I haven't been clear, but we're talking the app starting a
pthread, making a fresh/clean/independent interpreter, and then being
responsible for its safety at the highest level (with the payoff of
each of these threads executing without hinderance). No different
than if you used most APIs out there where step 1 is always to make
and init a context object and the final step is always to destroy/take-
down that context object.

I'm a lousy writer sometimes, but I feel bad if you took the time to
describe threads vs processes. The only reason I raised IPC with my
"messaging isn't very attractive" comment was to respond to Glenn
Linderman's points regarding tradeoffs of shared memory vs no.


Andy

Jesse Noller

unread,
Oct 24, 2008, 3:48:57 PM10/24/08
to Andy O'Meara, pytho...@python.org
On Fri, Oct 24, 2008 at 3:17 PM, Andy O'Meara <and...@gmail.com> wrote:

> I'm a lousy writer sometimes, but I feel bad if you took the time to
> describe threads vs processes. The only reason I raised IPC with my
> "messaging isn't very attractive" comment was to respond to Glenn
> Linderman's points regarding tradeoffs of shared memory vs no.
>

I actually took the time to bring anyone listening in up to speed, and
to clarify so I could better understand your use case. Don't feel bad,
things in the thread are moving fast and I just wanted to clear it up.

Ideally, we all want to improve the language, and the interpreter.
However trying to push it towards a particular use case is dangerous
given the idea of "general use".

-jesse

Rhamphoryncus

unread,
Oct 24, 2008, 4:09:46 PM10/24/08
to
On Oct 24, 1:02 pm, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:
> On approximately 10/24/2008 8:42 AM, came the following characters from
> the keyboard of Andy O'Meara:

>
> > Glenn, great post and points!
>
> Thanks. I need to admit here that while I've got a fair bit of
> professional programming experience, I'm quite new to Python -- I've not
> learned its internals, nor even the full extent of its rich library. So
> I have some questions that are partly about the goals of the
> applications being discussed, partly about how Python is constructed,
> and partly about how the library is constructed. I'm hoping to get a
> better understanding of all of these; perhaps once a better
> understanding is achieved, limitations will be understood, and maybe
> solutions be achievable.
>
> Let me define some speculative Python interpreters; I think the first is
> today's Python:
>
> PyA: Has a GIL. PyA threads can run within a process; but are
> effectively serialized to the places where the GIL is obtained/released.
> Needs the GIL because that solves lots of problems with non-reentrant
> code (an example of non-reentrant code, is code that uses global (C
> global, or C static) variables – note that I'm not talking about Python
> vars declared global... they are only module global). In this model,
> non-reentrant code could include pieces of the interpreter, and/or
> extension modules.
>
> PyB: No GIL. PyB threads acquire/release a lock around each reference to
> a global variable (like "with" feature). Requires massive recoding of
> all code that contains global variables. Reduces performance
> significantly by the increased cost of obtaining and releasing locks.
>
> PyC: No locks. Instead, recoding is done to eliminate global variables
> (interpreter requires a state structure to be passed in). Extension
> modules that use globals are prohibited... this eliminates large
> portions of the library, or requires massive recoding. PyC threads do
> not share data between threads except by explicit interfaces.
>
> PyD: (A hybrid of PyA & PyC). The interpreter is recoded to eliminate
> global variables, and each interpreter instance is provided a state
> structure. There is still a GIL, however, because globals are
> potentially still used by some modules. Code is added to detect use of
> global variables by a module, or some contract is written whereby a
> module can be declared to be reentrant and global-free. PyA threads will
> obtain the GIL as they would today. PyC threads would be available to be
> created. PyC instances refuse to call non-reentrant modules, but also
> need not obtain the GIL... PyC threads would have limited module support
> initially, but over time, most modules can be migrated to be reentrant
> and global-free, so they can be used by PyC instances. Most 3rd-party
> libraries today are starting to care about reentrancy anyway, because of
> the popularity of threads.

PyE: objects are reclassified as shareable or non-shareable, many
types are now only allowed to be shareable. A module and its classes
become shareable with the use of a __future__ import, and their
shareddict uses a read-write lock for scalability. Most other
shareable objects are immutable. Each thread is run in its own
private monitor, and thus protected from the normal threading memory
module nasties. Alas, this gives you all the semantics, but you still
need scalable garbage collection.. and CPython's refcounting needs the
GIL.


> > Our software runs in real time (so performance is paramount),
> > interacts with other static libraries, depends on worker threads to
> > perform real-time image manipulation, and leverages Windows and Mac OS
> > API concepts and features.  Python's performance hits have generally
> > been a huge challenge with our animators because they often have to go
> > back and massage their python code to improve execution performance.
> > So, in short, there are many reasons why we use python as a part
> > rather than a whole.

[...]


> > As a python language fan an enthusiast, don't let lua win!  (I say
> > this endearingly of course--I have the utmost respect for both
> > communities and I only want to see CPython be an attractive pick when
> > a company is looking to embed a language that won't intrude upon their
> > app's design).

I agree with the problem, and desire to make python fill all niches,
but let's just say I'm more ambitious with my solution. ;)

Andy O'Meara

unread,
Oct 24, 2008, 4:51:10 PM10/24/08
to

Another great post, Glenn!! Very well laid-out and posed!! Thanks for
taking the time to lay all that out.

>
> Questions for Andy: is the type of work you want to do in independent
> threads mostly pure Python? Or with libraries that you can control to
> some extent? Are those libraries reentrant? Could they be made
> reentrant? How much of the Python standard library would need to be
> available in reentrant mode to provide useful functionality for those
> threads? I think you want PyC
>

I think you've defined everything perfectly, and you're you're of
course correct about my love for for the PyC model. :^)

Like any software that's meant to be used without restrictions, our
code and frameworks always use a context object pattern so that
there's never and non-const global/shared data). I would go as far to
say that this is the case with more performance-oriented software than
you may think since it's usually a given for us to have to be parallel
friendly in as many ways as possible. Perhaps Patrick can back me up
there.

As to what modules are "essential"... As you point out, once
reentrant module implementations caught on in PyC or hybrid world, I
think we'd start to see real effort to whip them into compliance--
there's just so much to be gained imho. But to answer the question,
there's the obvious ones (operator, math, etc), string/buffer
processing (string, re), C bridge stuff (struct, array), and OS basics
(time, file system, etc). Nice-to-haves would be buffer and image
decompression (zlib, libpng, etc), crypto modules, and xml. As far as
I can imagine, I have to believe all of these modules already contain
little, if any, global data, so I have to believe they'd be super easy
to make "PyC happy". Patrick, what would you see you guys using?


> > That's the rub...  In our case, we're doing image and video
> > manipulation--stuff not good to be messaging from address space to
> > address space.  The same argument holds for numerical processing with
> > large data sets.  The workers handing back huge data sets via
> > messaging isn't very attractive.
>

> In the module multiprocessing environment could you not use shared
> memory, then, for the large shared data items?
>

As I understand things, the multiprocessing puts stuff in a child
process (i.e. a separate address space), so the only to get stuff to/
from it is via IPC, which can include a shared/mapped memory region.
Unfortunately, a shared address region doesn't work when you have
large and opaque objects (e.g. a rendered CoreVideo movie in the
QuickTime API or 300 megs of audio data that just went through a
DSP). Then you've got the hit of serialization if you're got
intricate data structures (that would normally would need to be
serialized, such as a hashtable or something). Also, if I may speak
for commercial developers out there who are just looking to get the
job done without new code, it's usually always preferable to just a
single high level sync object (for when the job is complete) than to
start a child processes and use IPC. The former is just WAY less
code, plain and simple.


Andy


Glenn Linderman

unread,
Oct 24, 2008, 4:59:26 PM10/24/08
to Rhamphoryncus, pytho...@python.org
On approximately 10/24/2008 1:09 PM, came the following characters from
the keyboard of Rhamphoryncus:

Hmm. So I think your PyE is an instance is an attempt to be more
explicit about what I said above in PyC: PyC threads do not share data
between threads except by explicit interfaces. I consider your
definitions of shared data types somewhat orthogonal to the types of
threads, in that both PyA and PyC threads could use these new shared
data items.

I think/hope that you meant that "many types are now only allowed to be
non-shareable"? At least, I think that should be the default; they
should be within the context of a single, independent interpreter
instance, so other interpreters don't even know they exist, much less
how to share them. If so, then I understand most of the rest of your
paragraph, and it could be a way of providing shared objects, perhaps.

I don't understand the comment that CPython's refcounting needs the
GIL... yes, it needs the GIL if multiple threads see the object, but not
for private objects... only one threads uses the private objects... so
today's refcounting should suffice... with each interpreter doing its
own refcounting and collecting its own garbage.

Shared objects would have to do refcounting in a protected way, under
some lock. One "easy" solution would be to have just two types of
objects; non-shared private objects in a thread, and global shared
objects; access to global shared objects would require grabbing the GIL,
and then accessing the object, and releasing the GIL. An interface
could allow for grabbing releasing the GIL around a block of accesses to
shared objects (with GIL:) This could reduce the number of GIL
acquires. Then the reference counting for those objects would also be
done under the GIL, and the garbage collecting? By another PyA thread,
perhaps, that grabs the GIL by default? Or a PyC one that explicitly
grabs the GIL and does a step of global garbage collection?

A more complex, more parallel solution would allow for independent
groups of shared objects. Of course, once there is more than one lock
involved, there is more potential for deadlock, but it also provides for
more parallelism. So a shared object might inherit from a "concurrency
group" which would have a lock that could be acquired (with conc_group:)
for access to those data items. Again, the reference counting would be
done under that lock for that group of objects, and garbage collecting
those objects would potentially require that lock as well...

The solution with multiple concurrency groups allows for such groups to
contain a single shared object, or many (probably related) shared
objects. So the application gets a choice of the granularity of sharing
and locking, and can choose the number of locks to optimize performance
and achieve correctness. This sort of shared data among threads,
though, suffers in the limit from all the problems described in the
Berkeley paper. More reliable programs might be achieved by using
straight PyC threads, and some very limited "data ports" that can be
combined using a higher-order flow control concept, as outlined in the
paper.

While Python might be extended with these flow control concepts, they
could be added gradually over time, and in the embedded case, could be
implemented in some other language.


--
Glenn
------------------------------------------------------------------------

. _|_|_| _|
. _| _| _|_| _|_|_| _|_|_|
. _| _|_| _| _|_|_|_| _| _| _| _|
. _| _| _| _| _| _| _| _|
. _|_|_| _| _|_|_| _| _| _| _|

------------------------------------------------------------------------
Obstacles are those frightful things you see when you take your eyes off
of the goal. --Henry Ford

Message has been deleted

Jesse Noller

unread,
Oct 24, 2008, 5:02:15 PM10/24/08
to Andy O'Meara, pytho...@python.org
On Fri, Oct 24, 2008 at 4:51 PM, Andy O'Meara <and...@gmail.com> wrote:

>> In the module multiprocessing environment could you not use shared
>> memory, then, for the large shared data items?
>>
>
> As I understand things, the multiprocessing puts stuff in a child
> process (i.e. a separate address space), so the only to get stuff to/
> from it is via IPC, which can include a shared/mapped memory region.
> Unfortunately, a shared address region doesn't work when you have
> large and opaque objects (e.g. a rendered CoreVideo movie in the
> QuickTime API or 300 megs of audio data that just went through a
> DSP). Then you've got the hit of serialization if you're got
> intricate data structures (that would normally would need to be
> serialized, such as a hashtable or something). Also, if I may speak
> for commercial developers out there who are just looking to get the
> job done without new code, it's usually always preferable to just a
> single high level sync object (for when the job is complete) than to
> start a child processes and use IPC. The former is just WAY less
> code, plain and simple.
>

Are you familiar with the API at all? Multiprocessing was designed to
mimic threading in about every way possible, the only restriction on
shared data is that it must be serializable, but event then you can
override or customize the behavior.

Also, inter process communication is done via pipes. It can also be
done with messages if you want to tweak the manager(s).

-jesse

Message has been deleted

Rhamphoryncus

unread,
Oct 24, 2008, 5:15:01 PM10/24/08
to
On Oct 24, 2:59 pm, Glenn Linderman <gl...@nevcal.com> wrote:
> On approximately 10/24/2008 1:09 PM, came the following characters from
> the keyboard of Rhamphoryncus:
> > PyE: objects are reclassified as shareable or non-shareable, many
> > types are now only allowed to be shareable.  A module and its classes
> > become shareable with the use of a __future__ import, and their
> > shareddict uses a read-write lock for scalability.  Most other
> > shareable objects are immutable.  Each thread is run in its own
> > private monitor, and thus protected from the normal threading memory
> > module nasties.  Alas, this gives you all the semantics, but you still
> > need scalable garbage collection.. and CPython's refcounting needs the
> > GIL.
>
> Hmm.  So I think your PyE is an instance is an attempt to be more
> explicit about what I said above in PyC: PyC threads do not share data
> between threads except by explicit interfaces.  I consider your
> definitions of shared data types somewhat orthogonal to the types of
> threads, in that both PyA and PyC threads could use these new shared
> data items.

Unlike PyC, there's a *lot* shared by default (classes, modules,
function), but it requires only minimal recoding. It's as close to
"have your cake and eat it too" as you're gonna get.


> I think/hope that you meant that "many types are now only allowed to be
> non-shareable"?  At least, I think that should be the default; they
> should be within the context of a single, independent interpreter
> instance, so other interpreters don't even know they exist, much less
> how to share them.  If so, then I understand most of the rest of your
> paragraph, and it could be a way of providing shared objects, perhaps.

There aren't multiple interpreters under my model. You only need
one. Instead, you create a monitor, and run a thread on it. A list
is not shareable, so it can only be used within the monitor it's
created within, but the list type object is shareable.

I've no interest in *requiring* a C/C++ extension to communicate
between isolated interpreters. Without that they're really no better
than processes.

Rhamphoryncus

unread,
Oct 24, 2008, 5:16:50 PM10/24/08
to
On Oct 24, 3:02 pm, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:
> On approximately 10/23/2008 2:24 PM, came the following characters from the
> keyboard of Rhamphoryncus:
>>
>> On Oct 23, 11:30 am, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:
>>
>>>
>>> On approximately 10/23/2008 12:24 AM, came the following characters from
>>> the keyboard of Christian Heimes

>>>>
>>>> Andy wrote:
>>>> I'm very - not absolute, but very - sure that Guido and the initial
>>>> designers of Python would have added the GIL anyway. The GIL makes
>>>> Python faster on single core machines and more stable on multi core
>>>> machines.
>
> Actually, the GIL doesn't make Python faster; it is a design decision that
> reduces the overhead of lock acquisition, while still allowing use of global
> variables.
>
> Using finer-grained locks has higher run-time cost; eliminating the use of
> global variables has a higher programmer-time cost, but would actually run
> faster and more concurrently than using a GIL. Especially on a
> multi-core/multi-CPU machine.

Those "globals" include classes, modules, and functions. You can't
have *any* objects shared. Your interpreters are entirely isolated,
much like processes (and we all start wondering why you don't use
processes in the first place.)

Or use safethread. It imposes safe semantics on shared objects, so
you can keep your global classes, modules, and functions. Still need
garbage collection though, and on CPython that means refcounting and
the GIL.


>> Another peeve I have is his characterization of the observer pattern.
>> The generalized form of the problem exists in both single-threaded
>> sequential programs, in the form of unexpected reentrancy, and message
>> passing, with infinite CPU usage or infinite number of pending
>> messages.
>>
>
> So how do you get reentrancy is a single-threaded sequential program? I
> think only via recursion? Which isn't a serious issue for the observer
> pattern. If you add interrupts, then your program is no longer sequential.

Sorry, I meant recursion. Why isn't it a serious issue for
single-threaded programs? Just the fact that it's much easier to
handle when it does happen?


>> Try looking at it on another level: when your CPU wants to read from a
>> bit of memory controlled by another CPU it sends them a message
>> requesting they get it for us. They send back a message containing
>> that memory. They also note we have it, in case they want to modify
>> it later. We also note where we got it, in case we want to modify it
>> (and not wait for them to do modifications for us).
>>
>
> I understand that level... one of my degrees is in EE, and I started college
> wanting to design computers (at about the time the first microprocessor chip
> came along, and they, of course, have now taken over). But I was side-lined
> by the malleability of software, and have mostly practiced software during
> my career.
>
> Anyway, that is the level that Herb Sutter was describing in the Dr Dobbs
> articles I mentioned. And the overhead of doing that at the level of a cache
> line is high, if there is lots of contention for particular memory locations
> between threads running on different cores/CPUs. So to achieve concurrency,
> you must not only limit explicit software locks, but must also avoid memory
> layouts where data needed by different cores/CPUs are in the same cache
> line.

I suspect they'll end up redesigning the caching to use a size and
alignment of 64 bits (or smaller). Same cache line size, but with
masking.

You still need to minimize contention of course, but that should at
least be more predictable. Having two unrelated mallocs contend could
suck.


>> Message passing vs shared memory isn't really a yes/no question. It's
>> about ratios, usage patterns, and tradeoffs. *All* programs will
>> share data, but in what way? If it's just the code itself you can
>> move the cache validation into software and simplify the CPU, making
>> it faster. If the shared data is a lot more than that, and you use it
>> to coordinate accesses, then it'll be faster to have it in hardware.
>>
>
> I agree there are tradeoffs... unfortunately, the hardware architectures
> vary, and the languages don't generally understand the hardware. So then it
> becomes an OS API, which adds the overhead of an OS API call to the cost of
> the synchronization... It could instead be (and in clever applications is) a
> non-portable assembly level function that wraps on OS locking or waiting
> API.

In practice I highly doubt we'll see anything that doesn't extend
traditional threading (posix threads, whatever MS has, etc).


> Nonetheless, while putting the shared data accesses in hardware might be
> more efficient per unit operation, there are still tradeoffs: A software
> solution can group multiple accesses under a single lock acquisition; the
> hardware probably doesn't have enough smarts to do that. So it may well
> require many more hardware unit operations for the same overall concurrently
> executed function, and the resulting performance may not be any better.

Speculative ll/sc? ;)


> Sidestepping the whole issue, by minimizing shared data in the application
> design, avoiding not only software lock calls, and hardware cache
> contention, is going to provide the best performance... it isn't the things
> you do efficiently that make software fast — it is the things you don't do
> at all.

Minimizing contention, certainly. Minimizing the shared data itself
is iffier though.

Message has been deleted
Message has been deleted
Message has been deleted

Andy O'Meara

unread,
Oct 24, 2008, 7:50:26 PM10/24/08
to

> Are you familiar with the API at all? Multiprocessing was designed to
> mimic threading in about every way possible, the only restriction on
> shared data is that it must be serializable, but event then you can
> override or customize the behavior.
>
> Also, inter process communication is done via pipes. It can also be
> done with messages if you want to tweak the manager(s).
>

I apologize in advance if I don't understand something correctly, but
as I understand them, everything has to be serialized in order to go
through IPC. So when you're talking about thousands of objects,
buffers, and/or large OS opaque objects (e.g. memory-resident video
and images), that seems like a pretty rough hit of run-time resources.

Please don't misunderstand my comments to suggest that multiprocessing
isn't great stuff. On the contrary, it's very impressive and it
singlehandedly catapults python *way* closer to efficient CPU bound
processing than it ever was before. All I mean to say is that in the
case where using a shared address space with a worker pthread per
spare core to do CPU bound work, it's a really big win not to have to
serialize stuff. And in the case of hundreds of megs of data and/or
thousands of data structure instances, it's a deal breaker to
serialize and unserialize everything just so that it can be sent
though IPC. It's a deal breaker for most performance-centric apps
because of the unnecessary runtime resource hit and because now all
those data structures being passed around have to have accompanying
serialization code written (and maintained) for them. That's
actually what I meant when I made the comment that a high level sync
object in a shared address space is "better" then sending it all
through IPC (when the data sets are wild and crazy). From a C/C++
point of view, I would venture to say that it's always a huge win to
just stick those "embarrassingly easy" parallelization cases into the
thread with a sync object than forking and using IPC and having to
write all the serialization code. And in the case of huge data types--
such as video or image rendering--it makes me nervous to think of
serializing it all just so it can go through IPC when it could just be
passed using a pointer change and a single sync object.

So, if I'm missing something and there's a way so pass data structures
without serialization, then I'd definitely like to learn more (sorry
in advance if I missed something there). When I took a look at
multiprocessing my concerns where:
- serialization (discussed above)
- maturity (are we ready to bet the farm that mp is going to work
properly on the platforms we need it to?)

Again, I'm psyched that multiprocessing appeared in 2.6 and it's a
huge huge step in getting everyone to unlock the power of python!
But, then some of the tidbits described above are additional data
points for you and others to chew on. I can tell you they're pretty
important points for any performance-centric software provider (us,
game developers--from EA to Ambrosia, and A/V production app
developers like Patrick).

Andy


Adam Olsen

unread,
Oct 24, 2008, 8:59:52 PM10/24/08
to Glenn Linderman, pytho...@python.org
On Fri, Oct 24, 2008 at 4:48 PM, Glenn Linderman <v+py...@g.nevcal.com> wrote:
> On approximately 10/24/2008 2:15 PM, came the following characters from the

> keyboard of Rhamphoryncus:
>>
>> On Oct 24, 2:59 pm, Glenn Linderman <gl...@nevcal.com> wrote:
>>
>>>
>>> On approximately 10/24/2008 1:09 PM, came the following characters from
>>> the keyboard of Rhamphoryncus:
>>>
>>>>
>>>> PyE: objects are reclassified as shareable or non-shareable, many
>>>> types are now only allowed to be shareable. A module and its classes
>>>> become shareable with the use of a __future__ import, and their
>>>> shareddict uses a read-write lock for scalability. Most other
>>>> shareable objects are immutable. Each thread is run in its own
>>>> private monitor, and thus protected from the normal threading memory
>>>> module nasties. Alas, this gives you all the semantics, but you still
>>>> need scalable garbage collection.. and CPython's refcounting needs the
>>>> GIL.
>>>>
>>>
>>> Hmm. So I think your PyE is an instance is an attempt to be more
>>> explicit about what I said above in PyC: PyC threads do not share data
>>> between threads except by explicit interfaces. I consider your
>>> definitions of shared data types somewhat orthogonal to the types of
>>> threads, in that both PyA and PyC threads could use these new shared
>>> data items.
>>>
>>
>> Unlike PyC, there's a *lot* shared by default (classes, modules,
>> function), but it requires only minimal recoding. It's as close to
>> "have your cake and eat it too" as you're gonna get.
>>
>
> Yes, but I like my cake frosted with performance; Guido's non-acceptance of
> granular locks in the blog entry someone referenced was due to the slowdown
> acquired with granular locking and shared objects. Your PyE model, with
> highly granular sharing, will likely suffer the same fate.

No, my approach includes scalable performance. Typical paths will
involve *no* contention (ie no locking). classes and modules use
shareddict, which is based on a read-write lock built into the
interpreter, so it's uncontended for read-only usage patterns. Pretty
much everything else is immutable.

Of course that doesn't include the cost of garbage collection.
CPython's refcounting can't scale.


> The independent threads model, with only slight locking for a few explicitly
> shared objects, has a much better chance of getting better performance
> overall. With one thread running, it would be the same as today; with
> multiple threads, it should scale at the same rate as the system... minus
> any locking done at the higher level.

So use processes with a little IPC for these expensive-yet-"shared"
objects. multiprocessing does it already.


>>> I think/hope that you meant that "many types are now only allowed to be
>>> non-shareable"? At least, I think that should be the default; they
>>> should be within the context of a single, independent interpreter
>>> instance, so other interpreters don't even know they exist, much less
>>> how to share them. If so, then I understand most of the rest of your
>>> paragraph, and it could be a way of providing shared objects, perhaps.
>>>
>>
>> There aren't multiple interpreters under my model. You only need
>> one. Instead, you create a monitor, and run a thread on it. A list
>> is not shareable, so it can only be used within the monitor it's
>> created within, but the list type object is shareable.
>>
>

> The python interpreter code should be sharable, having been written in C,
> and being/becoming reentrant. So in that sense, there is only one
> interpreter. Similarly, any other reentrant C extensions would be that way.
> On the other hand, each thread of execution requires its own interpreter
> context, so that would have to be independent for the threads to be
> independent. It is the combination of code+context that I call an
> interpreter, and there would be one per thread for PyC threads. Bytecode
> for loaded modules could potentially be shared, if it is also immutable.
> However, that could be in my mental "phase 2", as it would require an extra
> level of complexity in the interpreter as it creates shared bytecode...
> there would be a memory savings from avoiding multiple copies of shared
> bytecode, likely, and maybe also a compilation performance savings. So it
> sounds like a win, but it is a win that can deferred for initial simplicity,
> to prove the concept is or is not workable.
>
> A monitor allows a single thread to run at a time; that is the same
> situation as the present GIL. I guess I don't fully understand your model.

To use your terminology, each monitor is a context. Each thread
operates in a different monitor. As you say, most C functions are
already thread-safe (reentrant). All I need to do is avoid letting
multiple threads modify a single mutable object (such as a list) at a
time, which I do by containing it within a single monitor (context).


--
Adam Olsen, aka Rhamphoryncus

Adam Olsen

unread,
Oct 24, 2008, 9:07:04 PM10/24/08
to Glenn Linderman, pytho...@python.org
On Fri, Oct 24, 2008 at 5:38 PM, Glenn Linderman <v+py...@g.nevcal.com> wrote:
> On approximately 10/24/2008 2:16 PM, came the following characters from the
> Indeed; isolated, independent interpreters are one of the goals. It is,
> indeed, much like processes, but in a single address space. It allows the
> master process (Python or C for the embedded case) to be coded using memory
> references and copies and pointer swaps instead of using semaphores, and
> potentially multi-megabyte message transfers.
>
> It is not clear to me that with the use of shared memory between processes,
> that the application couldn't use processes, and achieve many of the same
> goals. On the other hand, the code to create and manipulate processes and
> shared memory blocks is harder to write and has more overhead than the code
> to create and manipulate threads, which can, when told, access any memory
> block in the process. This allows the shared memory to be resized more
> easily, or more blocks of shared memory created more easily. On the other
> hand, the creation of shared memory blocks shouldn't be a high-use operation
> in a program that has sufficient number crunching to do to be able to
> consume multiple cores/CPUs.

>
>> Or use safethread. It imposes safe semantics on shared objects, so
>> you can keep your global classes, modules, and functions. Still need
>> garbage collection though, and on CPython that means refcounting and
>> the GIL.
>>
>
> Sounds like safethread has 35-40% overhead. Sounds like too much, to me.

The specific implementation of safethread, which attempts to remove
the GIL from CPython, has significant overhead and had very limited
success at being scalable.

The monitor design proposed by safethread has no inherent overhead and
is completely scalable.

"Martin v. Löwis"

unread,
Oct 24, 2008, 9:52:58 PM10/24/08
to
>> A c-level module, on the other hand, can sidestep/release
>> the GIL at will, and go on it's merry way and process away.
>
> ...Unless part of the C module execution involves the need do CPU-
> bound work on another thread through a different python interpreter,
> right?

Wrong.

> (even if the interpreter is 100% independent, yikes).

Again, wrong.

> For
> example, have a python C module designed to programmatically generate
> images (and video frames) in RAM for immediate and subsequent use in
> animation. Meanwhile, we'd like to have a pthread with its own
> interpreter with an instance of this module and have it dequeue jobs
> as they come in (in fact, there'd be one of these threads for each
> excess core present on the machine).

I don't understand how this example involves multiple threads. You
mention a single thread (running the module), and you mention designing
a module. Where is the second thread?

Let's assume there is another thread producing jobs, and then
a thread that generates the images. The structure would be this

while 1:
job = queue.get()
processing_module.process(job)

and in process:

PyArg_ParseTuple(args, "s", job_data);
result = PyString_New(bufsize);
buf = PyString_AsString(result);
Py_BEGIN_ALLOW_THREADS
compute_frame(job_data, buf);
Py_END_ALLOW_THREADS
return PyString_FromString(buf);

All these compute_frames could happily run in parallel.

> As far as I can tell, it seems
> CPython's current state can't CPU bound parallelization in the same

> address space.

That's not true.

Regards,
Martin

"Martin v. Löwis"

unread,
Oct 24, 2008, 9:40:13 PM10/24/08
to
> It seems to me that the very simplest move would be to remove global
> static data so the app could provide all thread-related data, which
> Andy suggests through references to the QuickTime API. This would
> suggest compiling python without thread support so as to leave it up
> to the application.

I'm not sure whether you realize that this is not simple at all.
Consider this fragment

if (string == Py_None || index >= state->lastmark ||
!state->mark[index] || !state->mark[index+1]) {
if (empty)
/* want empty string */
i = j = 0;
else {
Py_INCREF(Py_None);
return Py_None;

Py_None here is a global variable. How would you replace it?
It's used in thousands of places.

For another example, consider

PyErr_SetString(PyExc_ValueError,
"Empty module name");
or

dp = PyObject_New(dbmobject, &Dbmtype);

There are tons of different variables denoting exceptions and
other types which all somehow need to be rewritten (likely with
undesirable effects on readability).

So I don't think that this is a simple solution. It's the right
one, but it will take five or ten years to implement.

Regards,
Martin

Message has been deleted
Message has been deleted

Terry Reedy

unread,
Oct 24, 2008, 11:39:16 PM10/24/08
to pytho...@python.org
Glenn Linderman wrote:

> For example, Python presently has a rather stupid algorithm for string
> concatenation.

Python the language has syntax and semantics. Python implementations
have algorithms that fulfill the defined semantics.

> It allocates only the exactly necessary space for the
> concatenated string. This is a brilliant move, when you realize that
> strings are immutable, and once allocated can never change, but the
> operation
>
> for line in mylistofstrings:
> string = string + line
>
> is basically O(N-squared) as a result. The better algorithm would
> double the size of memory allocated for string each time there is not
> enough room to add the next line, and that reduces the cost of the
> algorithm to O(N).

If there is more than one reference to a guaranteed immutable object,
such as a string, the 'stupid' algorithm seem necessary to me. In-place
modification of a shared immutable would violate semantics.

However, if you do

string = ''
for line in strings:
string =+ line

so that there is only one reference and you tell the interpreter that
you don't mind the old value being updated, then I believe in 2.6, if
not before, CPython does overallocation and in-place extension. (I am
not sure about s=s+l.) But this is just ref-counted CPython.

Terry Jan Reedy

Message has been deleted

greg

unread,
Oct 25, 2008, 1:26:18 AM10/25/08
to
Andy O'Meara wrote:

> I would definitely agree if there was a context (i.e. environment)
> object passed around then perhaps we'd have the best of all worlds.

Moreover, I think this is probably the *only* way that
totally independent interpreters could be realized.

Converting the whole C API to use this strategy would be
a very big project. Also, on the face of it, it seems like
it would render all existing C extension code obsolete,
although it might be possible to do something clever with
macros to create a compatibility layer.

Another thing to consider is that passing all these extra
pointers around everywhere is bound to have some effect
on performance. The idea mightn't go down too well if it
slows things significantly in the case where you're only
using one interpreter.

--
Greg

greg

unread,
Oct 25, 2008, 1:54:04 AM10/25/08
to
Andy O'Meara wrote:

> - each worker thread makes its own interpreter, pops scripts off a
> work queue, and manages exporting (and then importing) result data to
> other parts of the app.

I hope you realize that starting up one of these interpreters
is going to be fairly expensive. It will have to create its
own versions of all the builtin constants and type objects,
and import its own copy of all the modules it uses.

One wonders if it wouldn't be cheaper just to fork the
process. Shared memory can be used to transfer large lumps
of data if needed.

--
Greg

greg

unread,
Oct 25, 2008, 2:16:59 AM10/25/08
to
Glenn Linderman wrote:

> If Py_None corresponds to None in Python syntax ... then
> it is a fixed constant and could be left global, probably.

No, it couldn't, because it's a reference-counted object
like any other Python object, and therefore needs to be
protected against simultaneous refcount manipulation by
different threads. So each interpreter would need its own
instance of Py_None.

The same goes for all the other built-in constants and
type objects -- there are dozens of these.

> The cost is one more push on every function call,

Which sounds like it could be a rather high cost! If
(just a wild guess) each function has an average of 2
parameters, then this is increasing the amount of
argument pushing going on by 50%...

> On many platforms, there is the concept of TLS, or thread-local storage.

That's another possibility, although doing it that
way would require you to have a separate thread for
each interpreter, which you mightn't always want.

--
Greg

greg

unread,
Oct 25, 2008, 2:19:19 AM10/25/08
to
Andy O'Meara wrote:

> In our case, we're doing image and video
> manipulation--stuff not good to be messaging from address space to
> address space.

Have you considered using shared memory?

Using mmap or equivalent, you can arrange for a block of
memory to be shared between processes. Then you can dump
the big lump of data to be transferred in there, and send
a short message through a pipe to the other process to
let it know it's there.

--
Greg

greg

unread,
Oct 25, 2008, 2:29:52 AM10/25/08
to
Rhamphoryncus wrote:
> A list
> is not shareable, so it can only be used within the monitor it's
> created within, but the list type object is shareable.

Type objects contain dicts, which allow arbitrary values
to be stored in them. What happens if one thread puts
a private object in there? It becomes visible to other
threads using the same type object. If it's not safe
for sharing, bad things happen.

Python's data model is not conducive to making a clear
distinction between "private" and "shared" objects,
except at the level of an entire interpreter.

--
Greg

"Martin v. Löwis"

unread,
Oct 25, 2008, 3:01:29 AM10/25/08
to pytho...@python.org
> If Py_None corresponds to None in Python syntax (sorry I'm not familiar
> with Python internals yet; glad you are commenting, since you are), then

> it is a fixed constant and could be left global, probably.

If None remains global, then type(None) also remains global, and
type(None),__bases__[0]. Then type(None).__bases__[0].__subclasses__()
will yield "interesting" results. This is essentially the status quo.

> But if we
> want a separate None for each interpreter, or if we just use Py_None as
> an example global variable to use to answer the question then here goes

There are a number of problems with that approach. The biggest one is
that it is theoretical. Of course I'm aware of thread-local variables,
and the abstract possibility of collecting all global variables in
a single data structure (in fact, there is already an interpreter
structure and per-interpreter state in Python). I wasn't claiming that
it was impossible to solve that problem - just that it is not simple.
If you want to find out what all the problems are, please try
implementing it for real.

Regards,
Martin

Michael Sparks

unread,
Oct 25, 2008, 7:35:58 AM10/25/08
to
Hi Andy,


Andy wrote:

> However, we require true thread/interpreter
> independence so python 2 has been frustrating at time, to say the
> least.  Please don't start with "but really, python supports multiple
> interpreters" because I've been there many many times with people.
> And, yes, I'm aware of the multiprocessing module added in 2.6, but
> that stuff isn't lightweight and isn't suitable at all for many
> environments (including ours).

This is a very conflicting set of statements and whilst you appear to be
extremely clear on what you want here, and why multiprocessing, and
associated techniques are not appropriate, this does sound very
conflicting. I'm guessing I'm not the only person who finds this a
little odd.

Based on the size of the thread, having read it all, I'm guessing also
that you're not going to have an immediate solution but a work around.
However, also based on reading it, I think it's a usecase that would be
generally useful in embedding python.

So, I'll give it a stab as to what I think you're after.

The scenario as I understand it is this:
* You have an application written in C,C++ or similar.
* You've been providing users the ability to script it or customise it
in some fashion using scripts.

Based on the conversation:
* This worked well, and you really liked the results, but...
* You only had one interpreter embedded in the system
* You were allowing users to use multiple scripts

Suddenly you go from: Single script, single memory space.
To multiple scripts, unconstrained shared shared memory space.

That then causes pain for you and your users. So as a result, you decided to
look for this scenario:
* A mechanism that allows each script to think it's the only script
running on the python interpreter.
* But to still have only one embedded instance of the interpreter.
* With the primary motivation to eliminate the unconstrained shared
memory causing breakage to your software.

So, whilst the multiprocessing module gives you this:
* With the primary motivation to eliminate the unconstrained shared
memory causing breakage to your software.

It's (for whatever reason) too heavyweight for you, due to the multiprocess
usage. At a guess the reason for this is because you allow the user to run
lots of these little scripts.

Essentially what this means is that you want "green processes".

One workaround of achieving that may be to find a way to force threads in
python to ONLY be allowed access to (and only update) thread local values,
rather than default to shared values.

The reason I say that, is because the closest you get to green processes in
python at the moment is /inside/ a python generator. It's nowhere near the
level you want, but it's what made me think of the idea of green processes.

Specifically if you have the canonical example of a python generator:

def fib():
a,b = 1,1
while 1:
a,b = b, a+b
yield 1

Then no matter how many times I run that, the values are local, and can't
impact each other. Now clearly this isn't what you want, but on some level
it's *similar*.

You want to be able to do:
run(this_script)

and then when (this_script) is running only use a local environment.

Now, if you could change the threading API, such that there was a means of
forcing all value lookups to look in thread local store before looking
outside the thread local store [1], then this would give you a much greater
level of safety.

[1] I don't know if there is or isn't I've not been sufficiently interested
to look...

I suspect that this would also be a very nice easy win for many
multi-threaded applications as well, reducing accidental data sharing.

Indeed, reversing things such that rather than doing this:
myLocal = threading.local()
myLocal.X = 5

Allowing a thread to force the default to be the other way round:
systemGlobals = threading.globals()
systemGlobals = 5

Would make a big difference. Furthermore, it would also mean that the
following:
import MyModule
from MyOtherModule import whizzy thing

I don't know if such a change would be sufficient to stop the python
interpreter going bang for extension modules though :-)

I suspect also that this change, whilst potentially fraught with
difficulties, would be incredibly useful in python implementations
that are GIL-free (such as Jython or IronPython)

Now, this for me is entirely theoretical because I don't know much about
python's threading implementation (because I've never needed to), but it
does seem to me to be the easier win than looking for truly independent
interpreters...

It would also be more generally useful, since it would make accidental
sharing of data (which is where threads really hurt people most) much
harder.

Since it was raised in the thread, I'd like to say "use Kamaelia", but your
usecase is slightly different as I understand it. You want to take existing
stuff that won't be written in any particular way, to encourage it to be
safely reusable in a shared environment. We do do that to an extent, but I'm
guessing not quite as unconstrained as you. (We specifically require usage
of things in a lightly constrained manner)

I suspect though that this hypothetical ability to switch a thread to search
thread locals (or only have thread locals) first would itself be incredibly
useful as time goes on.

Kamaelia implements the kind of model that this paper referenced in the
thread advocates:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf

As you'll see from this recent Pycon UK presentation:
http://tinyurl.com/KamaeliaPyconUK

It goes a stage further though by actively providing metaphors based around
components built using inboxes/outboxes designed *specifically* to encourage
safe concurrency. (heritage wise, kamaelia owes more to occam & CSP than
anything else)

After all we've found times when concurrency using generators is good
which is most of the time - it's probably the most fundamental unit of
concurrency you can get, followed by true coroutines (greenlets). Next
up is threads (you can put generators into threads, but not vice versa).
Next up is processes (you can put threads in processes, but not vice
versa).

Finishing on a random note:

The interesting thing from my perspective is you essentially want something
half way between threads and processes, which I called green processes for
want of a decent phrase. Now that's akin to sandboxing, but I suspect leaky
sandboxing might be sufficient for you. (ie a sandbox where you have to try
hard to break out the box as oppose to it being trivial) I'd be pretty
certain that something like green processes, or "thread local only" would
be useful in the future.

After all, that along with decent sandboxing would be the sort of thing
necessary to allow python to be embedded in a browser. (If flash used
multiple processes, it'd kill most people's systems after all, and if they
don't have something like green processes, flash would make web pages even
worse...)

Indeed, thread local only and globals accessed via STM [1] would be
incredibly handy. (I say that because generator globals and globals accessed
via a CAT (which is kamaelia specific thing, but similar conceptually),
works extremely well)

[1] even something as lightweight as http://www.kamaelia.org/STM

If a "search thread local" approach or "thread local only" approach
sounds reasonable, then it may be a "leaky sandbox" approach is perhaps
worth investigating. After all, a leaky sandbox may be doable.

Tuppence-worthy-ly-yours,.


Michael.
--
http://www.kamaelia.org/GetKamaelia

Michael Sparks

unread,
Oct 25, 2008, 7:50:27 AM10/25/08
to
Andy O'Meara wrote:

> Yeah, that's the idea--let the highest levels run and coordinate the
> show.

Yes, this works really well in python and it's lots of fun. We've found so
far you need at minimum the following parts to a co-ordination little
language:

Pipeline
Graphline
Carousel
Seq
OneShot
PureTransformer
TPipe
Filter
Backplane
PublishTo
SubscribeTo

The interesting thing to me about this is in most systems these would be
patterns of behaviour in activities, whereas in python/kamaelia these are
concrete things you can drop things into. As you'd expect this all becomes
highly declarative.

In practice the world is slightly messier than a theoretical document would
like to suggest, primarily because if you consider things like pygame,
sometimes you have only have a resource instantiated once in a single
process. So you do need a mechanism for advertising services inside a
process and looking those up. (The Backplane idea though helps with
wrapping those up a lot I admit, for certain sorts of service :)

And sometimes you do need to just share data, and when you do that's when
STM is useful.

But concurrent python systems are fun to build :-)


Michael.
--
http://www.kamaelia.org/GetKamaelia

Michael Sparks

unread,
Oct 25, 2008, 7:53:18 AM10/25/08
to
Glenn Linderman wrote:

> In the module multiprocessing environment could you not use shared
> memory, then, for the large shared data items?

If the poshmodule had a bit of TLC, it would be extremely useful for this,
since it does (surprisingly) still work with python 2.5, but does need a
bit of TLC to make it usable.

http://poshmodule.sourceforge.net/


Michael
--
http://www.kamaelia.org/GetKamaelia

Michael Sparks

unread,
Oct 25, 2008, 8:14:41 AM10/25/08
to
Andy O'Meara wrote:

> basically, it seems that we're talking about the
> "embarrassingly parallel" scenario raised in that paper

We build applications in Kamaelia and then discover afterwards that they're
embarrassingly parallel and just work. (we have an introspector that can
look inside running systems and show us the structure that's going on -
very useful for debugging)

My current favourite example of this is a tool created to teaching small
children to read and write:
http://www.kamaelia.org/SpeakAndWrite

Uses gesture recognition and speech synthesis, has a top level view of
around 15 concurrent components, with signficant numbers of nested ones.

(OK, that's not embarrasingly parallel since it's only around 50 things, but
the whiteboard with around 200 concurrent things, is)

The trick is to stop viewing concurrency as the problem, but to find a way
to use it as a tool for making it easier to write code. That program was a
10 hour or so hack. You end up focussing on the problem you want to solve,
and naturally gain a concurrent friendly system.

Everything else (GIL's, shared memory etc) then "just" becomes an
optimisation problem - something only to be done if you need it.

My previous favourite examples were based around digital TV, or user
generated content transcode pipelines.

My reason for preferring the speak and write at the moment is because its a
problem you wouldn't normally think of as benefitting from concurrency,
when in this case it benefitted by being made easier to write in the first
place.

Regards,

Michael
--
http://www.kamaelia.org/GetKamaelia

Michael Sparks

unread,
Oct 25, 2008, 8:33:25 AM10/25/08
to
Jesse Noller wrote:

> http://www.kamaelia.org/Home

Thanks for the mention :)

I don't think it's a good fit for the original poster's question, but a
solution to the original poster's question would be generally useful IMO,
_especially_ on python implementations without a GIL (where threads are the
more natural approach to using multiple processes & multiple processors).

The approach I think would be useful would perhaps by allowing python to
have some concept of "green processes" - that is threads that can only see
thread local values or they search/update thread local space before
checking globals, ie flipping

X = threading.local()
X.foo = "bar"

To something like:
X = greenprocesses.shared()
X.foo = "bar"

Or even just changing the search for values from:
* Search local context
* Search global context

To:
* Search thread local context
* Search local context
* Search global context

Would probably be quite handy, and eliminate whole classes of bugs for
people using threads. (Probably introduce all sorts of new ones of course,
but perhaps easier to isolate ones)

However, I suspect this is also *a lot* easier to say than to implement :-)

(that said, I did hack on the python internals once (cf pep 318) so it might
be quite pleasant to try)

It's also independent of any discussions regarding the GIL of course since
it would just make life generally safer for people.

BTW, regarding Kamaelia - regarding something you said on your blog - whilst
the components list on /Components looks like a large amount of extra stuff
you have to comprehend to use, you don't. (The interdependency between
components is actually very low.)

The core that someone needs to understand is the contents of this:
http://www.kamaelia.org/MiniAxon/

Which is sufficient to get someone started. (based on testing with a couple
of dozen novice developers now :)

If someone doesn't want to rewrite their app to be kamaelia based, they can
cherry pick stuff, by running kamaelia's scheduler in the background and
using components in a file-handle like fashion:
* http://www.kamaelia.org/AxonHandle

The reason /Components contains all those things isn't because we're trying
to make it into a swiss army knife, it's because it's been useful in
domains that have generated those components which are generally
reusable :-)

Michael.
--
http://www.kamaelia.org/GetKamaelia

M.-A. Lemburg

unread,
Oct 25, 2008, 9:46:15 AM10/25/08
to Glenn Linderman, Andy O'Meara, pytho...@python.org
These discussion pop up every year or so and I think that most of them
are not really all that necessary, since the GIL isn't all that bad.

Some pointers into the past:

* http://effbot.org/pyfaq/can-t-we-get-rid-of-the-global-interpreter-lock.htm
Fredrik on the GIL

* http://mail.python.org/pipermail/python-dev/2000-April/003605.html
Greg Stein's proposal to move forward on free threading

* http://www.sauria.com/~twl/conferences/pycon2005/20050325/Python%20at%20Google.notes
(scroll down to the Q&A section)
Greg Stein on whether the GIL really does matter that much

Furthermore, there are lots of ways to tune the CPython VM to make
it more or less responsive to thread switches via the various sys.set*()
functions in the sys module.

Most computing or I/O intense C extensions, built-in modules and object
implementations already release the GIL for you, so it usually doesn't
get in the way all that often.

So you have the option of using a single process with multiple
threads, allowing efficient sharing of data. Or you use multiple
processes and OS mechanisms to share data (shared memory, memory
mapped files, message passing, pipes, shared file descriptors, etc.).

Both have their pros and cons.

There's no general answer to the
problem of how to make best use of multi-core processors, multiple
linked processors or any of the more advanced parallel processing
mechanisms (http://en.wikipedia.org/wiki/Parallel_computing).
The answers will always have to be application specific.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 25 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611

Terry Reedy

unread,
Oct 25, 2008, 12:54:43 PM10/25/08
to pytho...@python.org
Glenn Linderman wrote:
> On approximately 10/24/2008 8:39 PM, came the following characters from
> the keyboard of Terry Reedy:

>> Glenn Linderman wrote:
>>
>>> For example, Python presently has a rather stupid algorithm for
>>> string concatenation.

Yes, CPython2.x, x<=5 did.

>> Python the language has syntax and semantics. Python implementations
>> have algorithms that fulfill the defined semantics.
>

> I can buy that, but when Python is not qualified, CPython should be
> assumed, as it predominates.

People do that, and it sometimes leads to unnecessary confusion. As to
the present discussion, is it about
* changing Python, the language
* changing all Python implementations
* changing CPython, the leading implementation
* branching CPython with a compiler switch, much as there was one for
including Unicode or not.
* forking CPython
* modifying an existing module
* adding a new module
* making better use of the existing facilities
* some combination of the above

> Of course, the latest official release
> should probably also be assumed, but that is so recent,

People do that, and it sometimes leads to unnecessary confusion. People
routine posted version specific problems and questions without
specifying the version (or platform when relevant). In a month or so,
there will be *2* latest official releases. There will be more
confusion without qualification.

> few have likely
> upgraded as yet... I should have qualified the statement.

* Is the target of this discussion 2.7 or 3.1 (some changes would be 3.1
only).

[diversion to the side topic]

>> If there is more than one reference to a guaranteed immutable object,
>> such as a string, the 'stupid' algorithm seem necessary to me.
>> In-place modification of a shared immutable would violate semantics.
>

> Absolutely. But after the first iteration, there is only one reference
> to string.

Which is to say, 'string' is the only reference to its object it refers
too. You are right, so I presume that the optimization described would
then kick in. But I have not read the code, and CPython optimizations
are not part of the *language* reference.

[back to the main topic]

There is some discussion/debate/confusion about how much of the stdlib
is 'standard Python library' versus 'standard CPython library'. [And
there is some feeling that standard Python modules should have a default
Python implementation that any implementation can use until it
optionally replaces it with a faster compiled version.] Hence my
question about the target of this discussion and the first three options
listed above.

Terry Jan Reedy

Philip Semanchuk

unread,
Oct 25, 2008, 1:55:03 PM10/25/08
to pytho...@python.org

On Oct 25, 2008, at 7:53 AM, Michael Sparks wrote:

> Glenn Linderman wrote:
>
>> In the module multiprocessing environment could you not use shared
>> memory, then, for the large shared data items?
>
> If the poshmodule had a bit of TLC, it would be extremely useful for
> this,
> since it does (surprisingly) still work with python 2.5, but does
> need a
> bit of TLC to make it usable.
>
> http://poshmodule.sourceforge.net/

Last time I checked that was Windows-only. Has that changed?

The only IPC modules for Unix that I'm aware of are one which I
adopted (for System V semaphores & shared memory) and one which I
wrote (for POSIX semaphores & shared memory).

http://NikitaTheSpider.com/python/shm/
http://semanchuk.com/philip/posix_ipc/


If anyone wants to wrap POSH cleverness around them, go for it! If
not, maybe I'll make the time someday.

Cheers
Philip

Message has been deleted
Message has been deleted

"Martin v. Löwis"

unread,
Oct 25, 2008, 3:42:40 PM10/25/08
to pytho...@python.org
>> There are a number of problems with that approach. The biggest one is
>> that it is theoretical.
>
> Not theoretical. Used successfully in Perl.

Perhaps it is indeed what Perl does, I know nothing about that.
However, it *is* theoretical for Python. Please trust me that
there are many many many many pitfalls in it, each needing a
separate solution, most likely with no equivalent in Perl.

If you had a working patch, *then* it would be practical.

> Granted Perl is quite a
> different language than Python, but then there are some basic
> similarities in the concepts.

Yes - just as much as both are implemented in C :-(

> Perhaps you should list the problems, instead of vaguely claiming that
> there are a number of them. Hard to respond to such a vague claim.

As I said: go implement it, and you will find out. Unless you are
really going at an implementation, I don't want to spend my time
explaining it to you.

> But the approach is sound; nearly any monolithic
> program can be turned into a multithreaded program containing one
> monolith per thread using such a technique.

I'm not debating that. I just claim that it is far from simple.

Regards,
Martin


Andy O'Meara

unread,
Oct 25, 2008, 4:01:04 PM10/25/08
to
On Oct 24, 9:52 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> >> A c-level module, on the other hand, can sidestep/release
> >> the GIL at will, and go on it's merry way and process away.
>
> > ...Unless part of the C module execution involves the need do CPU-
> > bound work on another thread through a different python interpreter,
> > right?
>
> Wrong.
>
> > (even if the interpreter is 100% independent, yikes).
>
> Again, wrong.
>
> > For
> > example, have a python C module designed to programmatically generate
> > images (and video frames) in RAM for immediate and subsequent use in
> > animation.  Meanwhile, we'd like to have a pthread with its own
> > interpreter with an instance of this module and have it dequeue jobs
> > as they come in (in fact, there'd be one of these threads for each
> > excess core present on the machine).
>
> I don't understand how this example involves multiple threads. You
> mention a single thread (running the module), and you mention designing
> a  module. Where is the second thread?

Glenn seems to be following me here... The point is to have any many
threads as the app wants, each in it's own world, running without
restriction (performance wise). Maybe the app wants to run a thread
for each extra core on the machine.

Perhaps the disconnect here is that when I've been saying "start a
thread", I mean the app starts an OS thread (e.g. pthread) with the
given that any contact with other threads is managed at the app level
(as opposed to starting threads through python). So, as far as python
knows, there's zero mention or use of threading in any way,
*anywhere*.


> > As far as I can tell, it seems
> > CPython's current state can't CPU bound parallelization in the same
> > address space.
>
> That's not true.
>

Um... So let's say you have a opaque object ref from the OS that
represents hundreds of megs of data (e.g. memory-resident video). How
do you get that back to the parent process without serialization and
IPC? What should really happen is just use the same address space so
just a pointer changes hands. THAT's why I'm saying that a separate
address space is generally a deal breaker when you have large or
intricate data sets (ie. when performance matters).

Andy


Andy O'Meara

unread,
Oct 25, 2008, 4:23:31 PM10/25/08
to
On Oct 24, 9:40 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> > It seems to me that the very simplest move would be to remove global
> > static data so the app could provide all thread-related data, which
> > Andy suggests through references to the QuickTime API. This would
> > suggest compiling python without thread support so as to leave it up
> > to the application.
>
> I'm not sure whether you realize that this is not simple at all.
> Consider this fragment
>
>     if (string == Py_None || index >= state->lastmark ||
> !state->mark[index] || !state->mark[index+1]) {
>         if (empty)
>             /* want empty string */
>             i = j = 0;
>         else {
>             Py_INCREF(Py_None);
>             return Py_None;
>


The way to think about is that, ideally in PyC, there are never any
global variables. Instead, all "globals" are now part of a context
(ie. a interpreter) and it would presumably be illegal to ever use
them in a different context. I'd say this is already the expectation
and convention for any modern, industry-grade software package
marketed as extension for apps. Industry app developers just want to
drop in a 3rd party package, make as many contexts as they want (in as
many threads as they want), and expect to use each context without
restriction (since they're ensuring contexts never interact with each
other). For example, if I use zlib, libpng, or libjpg, I can make as
many contexts as I want and put them in whatever threads I want. In
the app, the only thing I'm on the hook for is to: (a) never use
objects from one context in another context, and (b) ensure that I'm
never make any calls into a module from more than one thread at the
same time. Both of these requirements are trivial to follow in the
"embarrassingly easy" parallelization scenarios, and that's why I
started this thread in the first place. :^)

Andy

Andy O'Meara

unread,
Oct 25, 2008, 4:43:58 PM10/25/08
to
On Oct 24, 10:24 pm, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:
>
> > And in the case of hundreds of megs of data
>
> ... and I would be surprised at someone that would embed hundreds of
> megs of data into an object such that it had to be serialized... seems
> like the proper design is to point at the data, or a subset of it, in a
> big buffer.  Then data transfers would just transfer the offset/length
> and the reference to the buffer.
>
> > and/or thousands of data structure instances,
>
> ... and this is another surprise!  You have thousands of objects (data
> structure instances) to move from one thread to another?

Heh, no, we're actually in agreement here. I'm saying that in the
case where the data sets are large and/or intricate, a single top-
level pointer changing hands is *always* the way to go rather than
serialization. For example, suppose you had some nifty python code
and C procs that were doing lots of image analysis, outputting tons of
intricate and rich data structures. Once the thread is done with that
job, all that output is trivially transferred back to the appropriate
thread by a pointer changing hands.

>
> Of course, I know that data get large, but typical multimedia streams
> are large, binary blobs.  I was under the impression that processing
> them usually proceeds along the lines of keeping offsets into the blobs,
> and interpreting, etc.  Editing is usually done by making a copy of a
> blob, transforming it or a subset in some manner during the copy
> process, resulting in a new, possibly different-sized blob.

No, you're definitely right-on, with the the additional point that the
representation of multimedia usually employs intricate and diverse
data structures (imagine the data structure representation of a movie
encoded in modern codec, such as H.264, complete with paths, regions,
pixel flow, geometry, transformations, and textures). As we both
agree, that's something that you *definitely* want to move around via
a single pointer (and not in a serialized form). Hence, my position
that apps that use python can't be forced to go through IPC or else:
(a) there's a performance/resource waste to serialize and unserialize
large or intricate data sets, and (b) they're required to write and
maintain serialization code that otherwise doesn't serve any other
purpose.

Andy

Andy O'Meara

unread,
Oct 25, 2008, 5:26:18 PM10/25/08
to

> Andy O'Meara wrote:
> > I would definitely agree if there was a context (i.e. environment)
> > object passed around then perhaps we'd have the best of all worlds.
>
> Moreover, I think this is probably the *only* way that
> totally independent interpreters could be realized.
>
> Converting the whole C API to use this strategy would be
> a very big project. Also, on the face of it, it seems like
> it would render all existing C extension code obsolete,
> although it might be possible to do something clever with
> macros to create a compatibility layer.
>
> Another thing to consider is that passing all these extra
> pointers around everywhere is bound to have some effect
> on performance.


Good points--I would agree with you on all counts there. On the
"passing a context everywhere" performance hit, perhaps one idea is
that all objects could have an additional field that would point back
to their parent context (ie. their interpreter). So the only
prototypes that would have to be modified to contain the context ptr
would be the ones that inherently don't take any objects. This would
conveniently and generally correspond to procs associated with
interpreter control (e.g. importing modules, shutting down modules,
etc).


> Andy O'Meara wrote:
> > - each worker thread makes its own interpreter, pops scripts off a
> > work queue, and manages exporting (and then importing) result data to
> > other parts of the app.
>
> I hope you realize that starting up one of these interpreters
> is going to be fairly expensive.

Absolutely. I had just left that issue out in an effort to keep the
discussion pointed, but it's a great point to raise. My response is
that, like any 3rd party industry package, I'd say this is the
expectation (that context startup and shutdown is non-trivial and to
should be minimized for performance reasons). For simplicity, my
examples didn't talk about this issue but in practice, it'd be typical
for apps to have their "worker" interpreters persist as they chew
through jobs.


Andy


Rhamphoryncus

unread,
Oct 25, 2008, 6:22:49 PM10/25/08
to

shareable type objects (enabled by a __future__ import) use a
shareddict, which requires all keys and values to themselves be
shareable objects.

Although it's a significant semantic change, in many cases it's easy
to deal with: replace mutable (unshareable) global constants with
immutable ones (ie list -> tuple, set -> frozenset). If you've got
some global state you move it into a monitor (which doesn't scale, but
that's your design). The only time this really fails is when you're
deliberately storing arbitrary mutable objects from any thread, and
later inspecting them from any other thread (such as our new ABC
system's cache). If you want to store an object, but only to give it
back to the original thread, I've got a way to do that.

greg

unread,
Oct 25, 2008, 7:58:44 PM10/25/08
to
Glenn Linderman wrote:
> On approximately 10/25/2008 12:01 AM, came the following characters from
> the keyboard of Martin v. Löwis:

>
>> If None remains global, then type(None) also remains global, and
>> type(None),__bases__[0]. Then type(None).__bases__[0].__subclasses__()
>> will yield "interesting" results. This is essentially the status quo.
>
> I certainly don't grok the implications of what you say above,
> as I barely grok the semantics of it.

Not only is there a link from a class to its base classes, there
is a link to all its subclasses as well.

Since every class is ultimately a subclass of 'object', this means
that starting from *any* object, you can work your way up the
__bases__ chain until you get to 'object', then walk the sublass
hierarchy and find every class in the system.

This means that if any object at all is shared, then all class
objects, and any object reachable from them, are shared as well.

--
Greg

"Martin v. Löwis"

unread,
Oct 26, 2008, 5:57:07 AM10/26/08
to
>>> As far as I can tell, it seems
>>> CPython's current state can't CPU bound parallelization in the same
>>> address space.
>> That's not true.
>>
>
> Um... So let's say you have a opaque object ref from the OS that
> represents hundreds of megs of data (e.g. memory-resident video). How
> do you get that back to the parent process without serialization and
> IPC?

What parent process? I thought you were talking about multi-threading?

> What should really happen is just use the same address space so
> just a pointer changes hands. THAT's why I'm saying that a separate
> address space is generally a deal breaker when you have large or
> intricate data sets (ie. when performance matters).

Right. So use a single address space, multiple threads, and perform the
heavy computations in C code. I don't see how Python is in the way at
all. Many people do that, and it works just fine. That's what
Jesse (probably) meant with his remark

>> A c-level module, on the other hand, can sidestep/release
>> the GIL at will, and go on it's merry way and process away.

Please reconsider this; it might be a solution to your problem.

Regards,
Martin

Andy O'Meara

unread,
Oct 26, 2008, 8:57:02 PM10/26/08
to

Grrr... I posted a ton of lengthy replies to you and other recent
posts here using Google and none of them made it, argh. Poof. There's
nothing that fires more up more than lost work, so I'll have to
revert short and simple answers for the time being. Argh, damn.


On Oct 25, 1:26 am, greg <g...@cosc.canterbury.ac.nz> wrote:
> Andy O'Meara wrote:
> > I would definitely agree if there was a context (i.e. environment)
> > object passed around then perhaps we'd have the best of all worlds.
>
> Moreover, I think this is probably the *only* way that
> totally independent interpreters could be realized.
>
> Converting the whole C API to use this strategy would be
> a very big project. Also, on the face of it, it seems like
> it would render all existing C extension code obsolete,
> although it might be possible to do something clever with
> macros to create a compatibility layer.
>
> Another thing to consider is that passing all these extra
> pointers around everywhere is bound to have some effect
> on performance.


I'm with you on all counts, so no disagreement there. On the "passing
a ptr everywhere" issue, perhaps one idea is that all objects could


have an additional field that would point back to their parent context
(ie. their interpreter). So the only prototypes that would have to be

modified to contain the context ptr would be the ones that don't
inherently operate on objects (e.g. importing a module).


On Oct 25, 1:54 am, greg <g...@cosc.canterbury.ac.nz> wrote:
> Andy O'Meara wrote:
> > - each worker thread makes its own interpreter, pops scripts off a
> > work queue, and manages exporting (and then importing) result data to
> > other parts of the app.
>
> I hope you realize that starting up one of these interpreters
> is going to be fairly expensive. It will have to create its
> own versions of all the builtin constants and type objects,
> and import its own copy of all the modules it uses.
>

Yeah, for sure. And I'd say that's a pretty well established
convention already out there for any industry package. The pattern
I'd expect to see is where the app starts worker threads, starts
interpreters in one or more of each, and throws jobs to different ones
(and the interpreter would persist to move on to subsequent jobs).

> One wonders if it wouldn't be cheaper just to fork the
> process. Shared memory can be used to transfer large lumps
> of data if needed.
>

As I mentioned, wen you're talking about intricate data structures, OS
opaque objects (ie. that have their own internal allocators), or huge
data sets, even a shared memory region unfortunately can't fit the
bill.


Andy

Andy O'Meara

unread,
Oct 26, 2008, 9:33:30 PM10/26/08
to and...@gmail.com
On Oct 24, 9:52 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> >> A c-level module, on the other hand, can sidestep/release
> >> the GIL at will, and go on it's merry way and process away.
>
> > ...Unless part of the C module execution involves the need do CPU-
> > bound work on another thread through a different python interpreter,
> > right?
>
> Wrong.


Let's take a step back and remind ourselves of the big picture. The
goal is to have independent interpreters running in pthreads that the
app starts and controls. Each interpreter never at any point is doing
any thread-related stuff in any way. For example, each script job
just does meat an potatoes CPU work, using callbacks that, say,
programatically use OS APIs to edit and transform frame data.

So I think the disconnect here is that maybe you're envisioning
threads being created *in* python. To be clear, we're talking out
making threads at the app level and making it a given for the app to
take its safety in its own hands.

>
> > As far as I can tell, it seems
> > CPython's current state can't CPU bound parallelization in the same
> > address space.
>
> That's not true.
>

Well, when you're talking about large, intricate data structures
(which include opaque OS object refs that use process-associated
allocators), even a shared memory region between the child process and
the parent can't do the job. Otherwise, please describe in detail how
I'd get an opaque OS object (e.g. an OS ref that refers to memory-
resident video) from the child process back to the parent process.

Again, the big picture that I'm trying to plant here is that there
really is a serious need for truly independent interpreters/contexts
in a shared address space. Consider stuff like libpng, zlib, ipgjpg,
or whatever, the use pattern is always the same: make a context
object, do your work in the context, and take it down. For most
industry-caliber packages, the expectation and convention (unless
documented otherwise) is that the app can make as many contexts as its
wants in whatever threads it wants because the convention is that the
app is must (a) never use one context's objects in another context,
and (b) never use a context at the same time from more than one
thread. That's all I'm really trying to look at here.


Andy


Andy O'Meara

unread,
Oct 26, 2008, 10:03:31 PM10/26/08
to and...@gmail.com

> > And in the case of hundreds of megs of data
>
> ... and I would be surprised at someone that would embed hundreds of
> megs of data into an object such that it had to be serialized... seems
> like the proper design is to point at the data, or a subset of it, in a
> big buffer.  Then data transfers would just transfer the offset/length
> and the reference to the buffer.
>
> > and/or thousands of data structure instances,
>
> ... and this is another surprise!  You have thousands of objects (data
> structure instances) to move from one thread to another?
>

I think we miscommunicated there--I'm actually agreeing with you. I
was trying to make the same point you were: that intricate and/or
large structures are meant to be passed around by a top-level pointer,
not using and serialization/messaging. This is what I've been trying
to explain to others here; that IPC and shared memory unfortunately
aren't viable options, leaving app threads (rather than child
processes) as the solution.


> Of course, I know that data get large, but typical multimedia streams
> are large, binary blobs.  I was under the impression that processing
> them usually proceeds along the lines of keeping offsets into the blobs,
> and interpreting, etc.  Editing is usually done by making a copy of a
> blob, transforming it or a subset in some manner during the copy
> process, resulting in a new, possibly different-sized blob.


Your instincts are right. I'd only add on that when you're talking
about data structures associated with an intricate video format, the
complexity and depth of the data structures is insane -- the LAST
thing you want to burn cycles on is serializing and unserializing that
stuff (so IPC is out)--again, we're already on the same page here.

I think at one point you made the comment that shared memory is a
solution to handle large data sets between a child process and the
parent. Although this is certainty true in principle, it doesn't hold
up in practice since complex data structures often contain 3rd party
and OS API objects that have their own allocators. For example, in
video encoding, there's TONS of objects that comprise memory-resident
video from all kinds of APIs, so the idea of having them allocated
from shared/mapped memory block isn't even possible. Again, I only
raise this to offer evidence that doing real-world work in a child
process is a deal breaker--a shared address space is just way too much
to give up.


Andy

James Mills

unread,
Oct 26, 2008, 10:11:45 PM10/26/08
to Andy O'Meara, pytho...@python.org
On Mon, Oct 27, 2008 at 12:03 PM, Andy O'Meara <and...@gmail.com> wrote:
> I think we miscommunicated there--I'm actually agreeing with you. I
> was trying to make the same point you were: that intricate and/or
> large structures are meant to be passed around by a top-level pointer,
> not using and serialization/messaging. This is what I've been trying
> to explain to others here; that IPC and shared memory unfortunately
> aren't viable options, leaving app threads (rather than child
> processes) as the solution.

Andy,

Why don't you just use a temporary file
system (ram disk) to store the data that
your app is manipulating. All you need to
pass around then is a file descriptor.

--JamesMills

--
--
-- "Problems are solved by method"

"Martin v. Löwis"

unread,
Oct 27, 2008, 4:05:27 AM10/27/08
to
Andy O'Meara wrote:
> On Oct 24, 9:52 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
>>>> A c-level module, on the other hand, can sidestep/release
>>>> the GIL at will, and go on it's merry way and process away.
>>> ...Unless part of the C module execution involves the need do CPU-
>>> bound work on another thread through a different python interpreter,
>>> right?
>> Wrong.
[...]

>
> So I think the disconnect here is that maybe you're envisioning
> threads being created *in* python. To be clear, we're talking out
> making threads at the app level and making it a given for the app to
> take its safety in its own hands.

No. Whether or not threads are created by Python or the application
does not matter for my "Wrong" evaluation: in either case, C module
execution can easily side-step/release the GIL.

>>> As far as I can tell, it seems
>>> CPython's current state can't CPU bound parallelization in the same
>>> address space.
>> That's not true.
>>
>
> Well, when you're talking about large, intricate data structures
> (which include opaque OS object refs that use process-associated
> allocators), even a shared memory region between the child process and
> the parent can't do the job. Otherwise, please describe in detail how
> I'd get an opaque OS object (e.g. an OS ref that refers to memory-
> resident video) from the child process back to the parent process.

WHAT PARENT PROCESS? "In the same address space", to me, means
"a single process only, not multiple processes, and no parent process
anywhere". If you have just multiple threads, the notion of passing
data from a "child process" back to the "parent process" is
meaningless.

> Again, the big picture that I'm trying to plant here is that there
> really is a serious need for truly independent interpreters/contexts
> in a shared address space.

I understand that this is your mission in this thread. However, why
is that your problem? Why can't you just use the existing (limited)
multiple-interpreters machinery, and solve your problems with that?

> For most
> industry-caliber packages, the expectation and convention (unless
> documented otherwise) is that the app can make as many contexts as its
> wants in whatever threads it wants because the convention is that the
> app is must (a) never use one context's objects in another context,
> and (b) never use a context at the same time from more than one
> thread. That's all I'm really trying to look at here.

And that's indeed the case for Python, too. The app can make as many
subinterpreters as it wants to, and it must not pass objects from one
subinterpreter to another one, nor should it use a single interpreter
from more than one thread (although that is actually supported by
Python - but it surely won't hurt if you restrict yourself to a single
thread per interpreter).

Regards,
Martin

Message has been deleted
Message has been deleted

Rhamphoryncus

unread,
Oct 28, 2008, 5:05:59 AM10/28/08
to
On Oct 26, 6:57 pm, "Andy O'Meara" <and...@gmail.com> wrote:
> Grrr... I posted a ton of lengthy replies to you and other recent
> posts here using Google and none of them made it, argh. Poof. There's
> nothing that fires more up more than lost work,  so I'll have to
> revert short and simple answers for the time being.  Argh, damn.
>
> On Oct 25, 1:26 am, greg <g...@cosc.canterbury.ac.nz> wrote:
>
>
>
> > Andy O'Meara wrote:
> > > I would definitely agree if there was a context (i.e. environment)
> > > object passed around then perhaps we'd have the best of all worlds.
>
> > Moreover, I think this is probably the *only* way that
> > totally independent interpreters could be realized.
>
> > Converting the whole C API to use this strategy would be
> > a very big project. Also, on the face of it, it seems like
> > it would render all existing C extension code obsolete,
> > although it might be possible to do something clever with
> > macros to create a compatibility layer.
>
> > Another thing to consider is that passing all these extra
> > pointers around everywhere is bound to have some effect
> > on performance.
>
> I'm with you on all counts, so no disagreement there.  On the "passing
> a ptr everywhere" issue, perhaps one idea is that all objects could
> have an additionalfieldthat would point back to their parent context

> (ie. their interpreter).  So the only prototypes that would have to be
> modified to contain the context ptr would be the ones that don't
> inherently operate on objects (e.g. importing a module).

Trying to directly share objects like this is going to create
contention. The refcounting becomes the sequential portion of
Amdahl's Law. This is why safethread doesn't scale very well: I share
a massive amount of objects.

An alternative, actually simpler, is to create proxies to your real
object. The proxy object has a pointer to the real object and the
context containing it. When you call a method it serializes the
arguments, acquires the target context's GIL (while releasing yours),
and deserializes in the target context. Once the method returns it
reverses the process.

There's two reasons why this may perform well for you: First,
operations done purely in C may cheat (if so designed). A copy from
one memory buffer to another memory buffer may be given two proxies as
arguments, but then operate directly on the target objects (ie without
serialization).

Second, if a target context is idle you can enter it (acquiring its
GIL) without any context switch.

Of course that scenario is full of "maybes", which is why I have
little interest in it..

An even better scenario is if your memory buffer's methods are in pure
C and it's a simple object (no pointers). You can stick the memory
buffer in shared memory and have multiple processes manipulate it from
C. More "maybes".

An evil trick if you need pointers, but control the allocation, is to
take advantage of the fork model. Have a master process create a
bunch of blank files (temp files if linux doesn't allow /dev/zero),
mmap them all using MAP_SHARED, then fork and utilize. The addresses
will be inherited from the master process, so any pointers within them
will be usable across all processes. If you ever want to return
memory to the system you can close that file, then have all processes
use MAP_SHARED|MAP_FIXED to overwrite it. Evil, but should be
disturbingly effective, and still doesn't require modifying CPython.

Michael Sparks

unread,
Oct 28, 2008, 5:34:31 AM10/28/08
to
Glenn Linderman wrote:

> so a 3rd party library might be called to decompress the stream into a
> set of independently allocated chunks, each containing one frame (each
> possibly consisting of several allocations of memory for associated
> metadata) that is independent of other frames

We use a combination of a dictionary + RGB data for this purpose. Using a
dictionary works out pretty nicely for the metadata, and obviously one
attribute holds the frame data as a binary blob.

http://www.kamaelia.org/Components/pydoc/Kamaelia.Codec.YUV4MPEG gives some
idea structure and usage. The example given there is this:

Pipeline( RateControlledFileReader("video.dirac",readmode="bytes", ...),
DiracDecoder(),
FrameToYUV4MPEG(),
SimpleFileWriter("output.yuv4mpeg")
).run()

Now all of those components are generator components.

That's useful since:
a) we can structure the code to show what it does more clearly, and it
still run efficiently inside a single process
b) We can change this over to using multiple processes trivially:

ProcessPipeline(
RateControlledFileReader("video.dirac",readmode="bytes", ...),
DiracDecoder(),
FrameToYUV4MPEG(),
SimpleFileWriter("output.yuv4mpeg")
).run()

This version uses multiple processes (under the hood using Paul Boddies
pprocess library, since this support predates the multiprocessing module
support in python).

The big issue with *this* version however is that due to pprocess (and
friends) pickling data to be sent across OS pipes, the data throughput on
this would be lowsy. Specifically in this example, if we could change it
such that the high level API was this:

ProcessPipeline(
RateControlledFileReader("video.dirac",readmode="bytes", ...),
DiracDecoder(),
FrameToYUV4MPEG(),
SimpleFileWriter("output.yuv4mpeg")
use_shared_memory_IPC = True,
).run()

That would be pretty useful, for some hopefully obvious reasons. I suppose
ideally we'd just use shared_memory_IPC for everything and just go back to
this:

ProcessPipeline(
RateControlledFileReader("video.dirac",readmode="bytes", ...),
DiracDecoder(),
FrameToYUV4MPEG(),
SimpleFileWriter("output.yuv4mpeg")
).run()

But essentially for us, this is an optimisation problem, not a "how do I
even begin to use this" problem. Since it is an optimisation problem, it
also strikes me as reasonable to consider it OK to special purpose and
specialise such links until you get an approach that's reasonable for
general purpose data.

In theory, poshmodule.sourceforge.net, with a bit of TLC would be a good
candidate or good candidate starting point for that optimisation work
(since it does work in Linux, contrary to a reply in the thread - I've not
tested it under windows :).

If someone's interested in building that, then someone redoing our MiniAxon
tutorial using processes & shared memory IPC rather than generators would
be a relatively gentle/structured approach to dealing with this:

* http://www.kamaelia.org/MiniAxon/

The reason I suggest that is because any time we think about fiddling and
creating a new optimisation approach or concurrency approach, we tend to
build a MiniAxon prototype to flesh out the various issues involved.


Michael
--
http://www.kamaelia.org/Home

Michael Sparks

unread,
Oct 28, 2008, 6:30:54 AM10/28/08
to
Philip Semanchuk wrote:
> On Oct 25, 2008, at 7:53 AM, Michael Sparks wrote:
>> Glenn Linderman wrote:
>>> In the module multiprocessing environment could you not use shared
>>> memory, then, for the large shared data items?
>>
>> If the poshmodule had a bit of TLC, it would be extremely useful for
>> this,... http://poshmodule.sourceforge.net/

>
> Last time I checked that was Windows-only. Has that changed?

I've only tested it under Linux where it worked, but does clearly need a bit
of work :)

> The only IPC modules for Unix that I'm aware of are one which I
> adopted (for System V semaphores & shared memory) and one which I
> wrote (for POSIX semaphores & shared memory).
>
> http://NikitaTheSpider.com/python/shm/
> http://semanchuk.com/philip/posix_ipc/

I'll take a look at those - poshmodule does need a bit of TLC and doesn't
appear to be maintained.

> If anyone wants to wrap POSH cleverness around them, go for it! If
> not, maybe I'll make the time someday.

I personally don't have the time do do this, but I'd be very interested in
hearing someone building an up-to-date version. (Indeed, something like
this would be extremely useful for everyone to have in the standard library
now that the multiprocessing library is in the standard library)


Michael.
--
http://www.kamaelia.org/Home

Andy O'Meara

unread,
Oct 28, 2008, 10:23:44 AM10/28/08
to
On Oct 26, 10:11 pm, "James Mills" <prolo...@shortcircuit.net.au>
wrote:

Unfortunately, it's the penalty of serialization and unserialization.
When you're talking about stuff like memory-resident images and video
(complete with their intricate and complex codecs), then the only
option is to be passing around a couple pointers rather then take the
hit of serialization (which is huge for video, for example). I've
gone into more detail in some other posts but I could have missed
something.


Andy

Andy O'Meara

unread,
Oct 28, 2008, 11:00:47 AM10/28/08
to and...@gmail.com
On Oct 27, 4:05 am, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> Andy O'Meara wrote:


>
> > Well, when you're talking about large, intricate data structures
> > (which include opaque OS object refs that use process-associated
> > allocators), even a shared memory region between the child process and
> > the parent can't do the job.  Otherwise, please describe in detail how
> > I'd get an opaque OS object (e.g. an OS ref that refers to memory-
> > resident video) from the child process back to the parent process.
>
> WHAT PARENT PROCESS? "In the same address space", to me, means
> "a single process only, not multiple processes, and no parent process
> anywhere". If you have just multiple threads, the notion of passing
> data from a "child process" back to the "parent process" is
> meaningless.

I know... I was just responding to you and others here keep beating
the "fork" drum. I just trying make it clear that a shared address
space is the only way to go. Ok, good, so we're in agreement that
threads is the only way to deal with the "intricate and complex" data
set issue in a performance-centric application.

>
> > Again, the big picture that I'm trying to plant here is that there
> > really is a serious need for truly independent interpreters/contexts
> > in a shared address space.
>
> I understand that this is your mission in this thread. However, why
> is that your problem? Why can't you just use the existing (limited)
> multiple-interpreters machinery, and solve your problems with that?

Because then we're back into the GIL not permitting threads efficient
core use on CPU bound scripts running on other threads (when they
otherwise could). Just so we're on the same page, "when they
otherwise could" is relevant here because that's the important given:
that each interpreter ("context") truly never has any context with
others.

An example would be python scripts that generate video programatically
using an initial set of params and use an in-house C module to
construct frame (which in turn make and modify python C objects that
wrap to intricate codec related data structures). Suppose you wanted
to render 3 of these at the same time, one on each thread (3
threads). With the GIL in place, these threads can't anywhere close
to their potential. Your response thus far is that the C module
should release the GIL before it commences its heavy lifting. Well,
the problem is that if during its heavy lifting it needs to call back
into its interpreter. It's turns out that this isn't an exotic case
at all: there's a *ton* of utility gained by making calls back into
the interpreter. The best example is that since code more easily
maintained in python than in C, a lot of the module "utility" code is
likely to be in python. Unsurprisingly, this is the situation myself
and many others are in: where we want to subsequently use the
interpreter within the C module (so, as I understand it, the proposal
to have the C module release the GIL unfortunately doesn't work as a
general solution).

>
> > For most
> > industry-caliber packages, the expectation and convention (unless
> > documented otherwise) is that the app can make as many contexts as its
> > wants in whatever threads it wants because the convention is that the
> > app is must (a) never use one context's objects in another context,
> > and (b) never use a context at the same time from more than one
> > thread.  That's all I'm really trying to look at here.
>
> And that's indeed the case for Python, too. The app can make as many
> subinterpreters as it wants to, and it must not pass objects from one
> subinterpreter to another one, nor should it use a single interpreter
> from more than one thread (although that is actually supported by
> Python - but it surely won't hurt if you restrict yourself to a single
> thread per interpreter).
>

I'm not following you there... I thought we're all in agreement that
the existing C modules are FAR from being reentrant, regularly making
use of static/global objects. The point I had made before is that
other industry-caliber packages specifically don't have restrictions
in *any* way.

I appreciate your arguments these a PyC concept is a lot of work with
some careful design work, but let's not kill the discussion just
because of that. The fact remains that the video encoding scenario
described above is a pretty reasonable situation, and as more people
are commenting in this thread, there's an increasing need to offer
apps more flexibility when it comes to multi-threaded use.


Andy


Greg Ewing

unread,
Oct 25, 2008, 7:29:38 PM10/25/08
to pytho...@python.org
Glenn Linderman wrote:

> So your 50% number is just a scare tactic, it would seem, based on wild
> guesses. Was there really any benefit to the comment?

All I was really trying to say is that it would be a
mistake to assume that the overhead will be negligible,
as that would be just as much a wild guess as 50%.

--
Greg

Andy O'Meara

unread,
Oct 28, 2008, 11:30:50 AM10/28/08
to and...@gmail.com
On Oct 25, 9:46 am, "M.-A. Lemburg" <m...@egenix.com> wrote:
> These discussion pop up every year or so and I think that most of them
> are not really all that necessary, since the GIL isn't all that bad.
>

Thing is, if the topic keeps coming up, then that may be an indicator
that change is truly needed. Someone much wiser than me once shared
that a measure of the usefulness and quality of a package (or API) is
how easily it can be added to an application--of any flavors--without
the application needing to change.

So in the rising world of idle cores and worker threads, I do see an
increasing concern over the GIL. Although I recognize that the debate
is lengthy, heated, and has strong arguments on both sides, my reading
on the issue makes me feel like there's a bias for the pro-GIL side
because of the volume of design and coding work associated with
considering various alternatives (such as Glenn's "Py*" concepts).
And I DO respect and appreciate where the pro-GIL people come from:
who the heck wants to do all that work and recoding so that a tiny
percent of developers can benefit? And my best response is that as
unfortunate as it is, python needs to be more multi-threaded app-
friendly if we hope to attract the next generation of app developers
that want to just drop python into their app (and not have to change
their app around python). For example, Lua has that property, as
evidenced by its rapidly growing presence in commercial software
(Blizzard uses it heavily, for example).

>
> Furthermore, there are lots of ways to tune the CPython VM to make
> it more or less responsive to thread switches via the various sys.set*()
> functions in the sys module.
>
> Most computing or I/O intense C extensions, built-in modules and object
> implementations already release the GIL for you, so it usually doesn't
> get in the way all that often.


The main issue I take there is that it's often highly useful for C
modules to make subsequent calls back into the interpreter. I suppose
the response to that is to call the GIL before reentry, but it just
seems to be more code and responsibility in scenarios where it's no
necessary. Although that code and protocol may come easy to veteran
CPython developers, let's not forget that an important goal is to
attract new developers and companies to the scene, where they get
their thread-independent code up and running using python without any
unexpected reengineering. Again, why are companies choosing Lua over
Python when it comes to an easy and flexible drop-in interpreter? And
please take my points here to be exploratory, and not hostile or
accusatory, in nature.


Andy


Andy O'Meara

unread,
Oct 28, 2008, 12:14:34 PM10/28/08
to and...@gmail.com, v+py...@g.nevcal.com
On Oct 27, 10:55 pm, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:


> And I think we still are miscommunicating!  Or maybe communicating anyway!
>
> So when you said "object", I actually don't know whether you meant
> Python object or something else.  I assumed Python object, which may not
> have been correct... but read on, I think the stuff below clears it up.
>
>
> Then when you mentioned thousands of objects, I imagined thousands of
> Python objects, and somehow transforming the blob into same... and back
> again.  

My apologies to you and others here on my use of "objects" -- I'm use
the term generically and mean it to *not* refer to python objects (for
the all the reasons discussed here). Python only makes up a small
part of our app, hence my habit of "objects" to refer to other APIs'
allocated and opaque objects (including our own and OS APIs). For all
the reasons we've discussed, in our world, python objects don't travel
around outside of our python C modules -- when python objects need to
be passed to other parts of the app, they're converted into their non-
python (portable) equivalents (ints, floats, buffers, etc--but most of
the time, the objects are PyCObjects, so they can enter and leave a
python context with negligible overhead). I venture to say this is
pretty standard when any industry app uses a package (such as python),
for various reasons:
- Portability/Future (e.g. if we do decode to drop Python and go
with Lua, the changes are limited to only one region of code).
- Sanity (having any API's objects show up in places "far away"
goes against easy-to-follow code).
- MT flexibility (because we always never use static/global
storage, we have all kinds of options when it comes to
multithreading). For example, recall that by throwing python in
multiple dynamic libs, we were able to achieve the GIL-less
interpreter independence that we want (albeit ghetto and a pain).

Andy

Rhamphoryncus

unread,
Oct 28, 2008, 4:03:56 PM10/28/08
to

Okay, here's the bottom line:
* This is not about the GIL. This is about *completely* isolated
interpreters; most of the time when we want to remove the GIL we want
a single interpreter with lots of shared data.
* Your use case, although not common, is not extraordinarily rare
either. It'd be nice to support.
* If CPython had supported it all along we would continue to maintain
it.
* However, since it's not supported today, it's not worth the time
invested, API incompatibility, and general breakage it would imply.
* Although it's far more work than just solving your problem, if I
were to remove the GIL I'd go all the way and allow shared objects.

So there's really only two options here:
* get a short-term bodge that works, like hacking the 3rd party
library to use your shared-memory allocator. Should be far less work
than hacking all of CPython.
* invest yourself in solving the *entire* problem (GIL removal with
shared python objects).

"Martin v. Löwis"

unread,
Oct 28, 2008, 6:11:28 PM10/28/08
to
> Because then we're back into the GIL not permitting threads efficient
> core use on CPU bound scripts running on other threads (when they
> otherwise could).

Why do you think so? For C code that is carefully written, the GIL
allows *very well* to write CPU bound scripts running on other threads.
(please do get back to Jesse's original remark in case you have lost
the thread :-)

> An example would be python scripts that generate video programatically
> using an initial set of params and use an in-house C module to
> construct frame (which in turn make and modify python C objects that
> wrap to intricate codec related data structures). Suppose you wanted
> to render 3 of these at the same time, one on each thread (3
> threads). With the GIL in place, these threads can't anywhere close
> to their potential. Your response thus far is that the C module
> should release the GIL before it commences its heavy lifting. Well,
> the problem is that if during its heavy lifting it needs to call back
> into its interpreter.

So it should reacquire the GIL then. Assuming the other threads
all do their heavy lifting, it should immediately get the GIL,
fetch some data, release the GIL, and continue to do heavy lifting.
If it's truly CPU-bound, I hope it doesn't spend most of its time
in Python API, but in true computation.

> It's turns out that this isn't an exotic case
> at all: there's a *ton* of utility gained by making calls back into
> the interpreter. The best example is that since code more easily
> maintained in python than in C, a lot of the module "utility" code is
> likely to be in python.

You should really reconsider writing performance-critical code in
Python. Regardless of the issue under discussion, a lot of performance
can be gained by using "flattened" data structures, less pointer,
less reference counting, less objects, and so on - in the inner loops
of the computation. You didn't reveal what *specific* computation you
perform, so it's difficult to give specific advise.

> Unsurprisingly, this is the situation myself
> and many others are in: where we want to subsequently use the
> interpreter within the C module (so, as I understand it, the proposal
> to have the C module release the GIL unfortunately doesn't work as a
> general solution).

Not if you do the actual computation in Python, no. However, this
subthread started with Jesse's remark that you *can* release the GIL
in C code.

Again, if you do heavy-lifting in Python, you should consider to rewrite
the performance-critical parts in C. You may find that the need for
multiple CPUs goes even away.

> I appreciate your arguments these a PyC concept is a lot of work with
> some careful design work, but let's not kill the discussion just
> because of that.

Any discussion in this newsgroup is futile, except when it either
a) leads to a solution that is already possible, and the OP didn't
envision, or
b) is followed up by code contributions from one of the participants.

If neither is likely to result, killing the discussion is the most
productive thing we can do.

Regards,
Maritn

Patrick Stinson

unread,
Oct 29, 2008, 3:03:20 AM10/29/08
to Andy O'Meara, pytho...@python.org
Close, I work currently for EastWest :)

Well, I actually like almost everything else about CPython,
considering my audio work the only major problem I've had is with the
GIL. I like the purist community, and I like the code, since
integrating it on both platforms has been relatively clean, and
required *zero* support. Frankly, with the exception of some windows
deployment issues relating to static linking of libpython and some
extensions, it's been a dream lib to use.

Further, I really appreciate the discussions that happen in these
lists, and I think that this particular problem is a wonderful example
of a situation that requires tons of miscellaneous opinions and input
from all angles - especially at this stage. I think that this problem
has lots of standing discussion and lots of potential solutions and/or
workarounds, and it would be cool for someone to aggregate and
paraphrase that stuff into a page to assist those thinking about doing
some patching. That's probably something that the coder would do
themselves though.

On Fri, Oct 24, 2008 at 10:25 AM, Andy O'Meara <and...@gmail.com> wrote:
>>
>> So we are sitting this music platform with unimaginable possibilities
>> in the music world (of which python does not play a role), but those
>> little CPU spikes caused by the GIL at low latencies won't let us have
>> it. AFAIK, there is no music scripting language out there that would
>> come close, and yet we are sooooo close! This is a big deal.
>
>
> Perfectly said, Patrick. It pains me to know how widespread python
> *could* be in commercial software!
>
> Also, good points about people being longwinded and that "code talks".
>
> Sadly, the time alone I've spend in the last couple days on this
> thread is scary, but I'm committed now, I guess. :^( I look at the
> length of the posts of some of these guys and I have to wonder what
> the heck they do for a living!
>
> As I mentioned, however, I'm close to just blowing the whistle on this
> crap and start making CPythonES (as I call it, in the spirit of the
> "ES" in "OpenGLES"). Like you, we just want the core features of
> python in a clean, tidy, *reliable* fashion--something that we can
> ship and not lose sleep (or support hours) over. Basically, I imagine
> developing an interpreter designed for dev houses like yours and mine
> (you're Ableton or Propellerhead, right?)--a python version of lua, if
> you will. The nice thing about it is that is could start fresh and
> small, but I have a feeling it would really catch on because every
> commercial dev house would choose it over CPython any day of the week
> and it would be completely disjoint form CPython.
>
> Andy
>

Patrick Stinson

unread,
Oct 29, 2008, 3:27:01 AM10/29/08
to Glenn Linderman, Andy O'Meara, pytho...@python.org
Wow, man. Excellent post. You want a job?

The gui could use PyA threads for sure, and the audio thread could use
PyC threads. It would not be a problem to limit the audio thread to
only reentrant libraries.

This kind of thought is what I had in mind about finding a compromise,
especially in the way that PyD would not break old code assuming that
it could eventually be ported.

On Fri, Oct 24, 2008 at 11:02 AM, Glenn Linderman <v+py...@g.nevcal.com> wrote:
> On approximately 10/24/2008 8:42 AM, came the following characters from the
> keyboard of Andy O'Meara:
>>
>> Glenn, great post and points!
>>
>
> Thanks. I need to admit here that while I've got a fair bit of professional
> programming experience, I'm quite new to Python -- I've not learned its
> internals, nor even the full extent of its rich library. So I have some
> questions that are partly about the goals of the applications being
> discussed, partly about how Python is constructed, and partly about how the
> library is constructed. I'm hoping to get a better understanding of all of
> these; perhaps once a better understanding is achieved, limitations will be
> understood, and maybe solutions be achievable.
>
> Let me define some speculative Python interpreters; I think the first is
> today's Python:
>
> PyA: Has a GIL. PyA threads can run within a process; but are effectively
> serialized to the places where the GIL is obtained/released. Needs the GIL
> because that solves lots of problems with non-reentrant code (an example of
> non-reentrant code, is code that uses global (C global, or C static)
> variables – note that I'm not talking about Python vars declared global...
> they are only module global). In this model, non-reentrant code could
> include pieces of the interpreter, and/or extension modules.
>
> PyB: No GIL. PyB threads acquire/release a lock around each reference to a
> global variable (like "with" feature). Requires massive recoding of all code
> that contains global variables. Reduces performance significantly by the
> increased cost of obtaining and releasing locks.
>
> PyC: No locks. Instead, recoding is done to eliminate global variables
> (interpreter requires a state structure to be passed in). Extension modules
> that use globals are prohibited... this eliminates large portions of the
> library, or requires massive recoding. PyC threads do not share data between
> threads except by explicit interfaces.
>
> PyD: (A hybrid of PyA & PyC). The interpreter is recoded to eliminate global
> variables, and each interpreter instance is provided a state structure.
> There is still a GIL, however, because globals are potentially still used by
> some modules. Code is added to detect use of global variables by a module,
> or some contract is written whereby a module can be declared to be reentrant
> and global-free. PyA threads will obtain the GIL as they would today. PyC
> threads would be available to be created. PyC instances refuse to call
> non-reentrant modules, but also need not obtain the GIL... PyC threads would
> have limited module support initially, but over time, most modules can be
> migrated to be reentrant and global-free, so they can be used by PyC
> instances. Most 3rd-party libraries today are starting to care about
> reentrancy anyway, because of the popularity of threads.
>
> The assumptions here are that:
>
> Data-1) A Python interpreter doesn't provide any mechanism to share normal
> data among threads, they are independent... but message passing works.
> Data-2) A Python interpreter could be extended to provide mechanisms to
> share special data, and the data would come with an implicit lock.
> Data-3) A Python interpreter could be extended to provide unlocked access to
> special data, requiring the application to handle the synchronization
> between threads. Data of type 2 could be used to control access to data of
> type 3. This type of data could be large, or frequently referenced data, but
> only by a single thread at a time, with major handoffs to a different thread
> synchronized by the application in whatever way it chooses.
>
> Context-1) A Python interpreter would know about threads it spawns, and
> could pass in a block of context (in addition to the state structure) as a
> parameter to a new thread. That block of context would belong to the thread
> as long as it exists, and return to the spawner when the thread completes.
> An embedded interpreter would also be given a block of context (in addition
> to the state structure). This would allow application context to be created
> and passed around. Pointers to shared memory structures, might be typical
> context in the embedded case.
>
> Context-2) Embedded Python interpreters could be spawned either as PyA
> threads or PyC threads. PyC threads would be limited to modules that are
> reentrant.
>
>
> I think that PyB and PyC are the visions that people see, which argue
> against implementing independent interpreters. PyB isn't truly independent,
> because data are shared, recoding is required, and performance suffers. Ick.
> PyC requires "recoding the whole library" potentially, if it is the only
> solution. PyD allows access to the whole standard library of modules,
> exactly like today, but the existing limitations still obtain for PyA
> threads using that model – very limited concurrency. But PyC threads would
> execute in their own little environments, and not need locking. Pure Python
> code would be immediately happy there. Properly coded (reentrant,
> global-free) extensions would be happy there. Lots of work could be done
> there, to use up multi-core/multi-CPU horsepower (shared-memory
> architecture).
>
> Questions for people that know the Python internals: Is PyD possible? How
> hard? Is a PyC thread an effective way of implementing a Python sandbox? If
> it is, and if it would attract the attention of Brett Cannon, who at least
> once wanted to do a thesis on Python sandboxes, he could be a helpful
> supporter.
>
> Questions for Andy: is the type of work you want to do in independent
> threads mostly pure Python? Or with libraries that you can control to some
> extent? Are those libraries reentrant? Could they be made reentrant? How
> much of the Python standard library would need to be available in reentrant
> mode to provide useful functionality for those threads? I think you want PyC
>
> Questions for Patrick: So if you had a Python GUI using the whole standard
> library -- would it likely runs fine in PyA threads, and still be able to
> use PyC threads for the audio scripting language? Would it be a problem for
> those threads to have limited library support (only reentrant modules)?
>
>> That's the rub... In our case, we're doing image and video
>> manipulation--stuff not good to be messaging from address space to
>> address space. The same argument holds for numerical processing with
>> large data sets. The workers handing back huge data sets via
>> messaging isn't very attractive.


>>
>
> In the module multiprocessing environment could you not use shared memory,
> then, for the large shared data items?
>

>> Our software runs in real time (so performance is paramount),
>> interacts with other static libraries, depends on worker threads to
>> perform real-time image manipulation, and leverages Windows and Mac OS
>> API concepts and features. Python's performance hits have generally
>> been a huge challenge with our animators because they often have to go
>> back and massage their python code to improve execution performance.
>> So, in short, there are many reasons why we use python as a part
>> rather than a whole.
>>
>> The other area of pain that I mentioned in one of my other posts is
>> that what we ship, above all, can't be flaky. The lack of module
>> cleanup (intended to be addressed by PEP 3121), using a duplicate copy
>> of the python dynamic lib, and namespace black magic to achieve
>> independent interpreters are all examples that have made using python
>> for us much more challenging and time-consuming then we ever
>> anticipated.
>>
>> Again, if it turns out nothing can be done about our needs (which
>> appears to be more and more like the case), I think it's important for
>> everyone here to consider the points raised here in the last week.
>> Moreover, realize that the python dev community really stands to gain
>> from making python usable as a tool (rather than a monolith). This
>> fact alone has caused lua to *rapidly* rise in popularity with
>> software companies looking to embed a powerful, lightweight
>> interpreter in their software.
>>
>> As a python language fan an enthusiast, don't let lua win! (I say
>> this endearingly of course--I have the utmost respect for both
>> communities and I only want to see CPython be an attractive pick when
>> a company is looking to embed a language that won't intrude upon their
>> app's design).
>>
>
> Thanks for the further explanations.
>
> --
> Glenn -- http://nevcal.com/
> ===========================
> A protocol is complete when there is nothing left to remove.
> -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

It is loading more messages.
0 new messages