Exploiting Dual Core's with Py_NewInterpreter's separated GIL ?

85 views
Skip to first unread message

robert

unread,
Nov 2, 2006, 1:32:54 PM11/2/06
to
I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.

Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.
I expect to be able to directly push around Python Object-Trees between the 2 (or more) interpreters by doing some careful locking.

Any hope to come through? If possible, what are the main dangers? Is there an example / guideline around for that task? - using ctypes or so.

Or is there even a ready made Python module which makes it easy to setup and deal with extra Interpreter instances?
If not, would it be an idea to create such thing in the Python std libs to make Python multi-processor-ready. I guess Python will always have a GIL - otherwise it would loose lots of comfort in threaded programming


robert

Jean-Paul Calderone

unread,
Nov 2, 2006, 2:15:58 PM11/2/06
to pytho...@python.org
On Thu, 02 Nov 2006 19:32:54 +0100, robert <no-...@no-spam-no-spam.com> wrote:
>I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.

NumPy releases the GIL in quite a few places. I haven't used scipy much,
but I would expect it does too. Are you certain you need this complexity?

Jean-Paul

Jean-Paul Calderone

unread,
Nov 2, 2006, 2:41:25 PM11/2/06
to pytho...@python.org
On Thu, 2 Nov 2006 14:15:58 -0500, Jean-Paul Calderone <exa...@divmod.com> wrote:
>On Thu, 02 Nov 2006 19:32:54 +0100, robert <no-...@no-spam-no-spam.com> wrote:
>>I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.
>
>NumPy releases the GIL in quite a few places.

Eh, no it doesn't. Sorry for the misinformation. I'll double check
my memory first next time.

>Jean-Paul

Daniel Dittmar

unread,
Nov 3, 2006, 4:27:05 AM11/3/06
to
robert wrote:
> I'd like to use multiple CPU cores for selected time consuming Python
> computations (incl. numpy/scipy) in a frictionless manner.
>
> Interprocess communication is tedious and out of question, so I thought
> about simply using a more Python interpreter instances
> (Py_NewInterpreter) with extra GIL in the same process.

If I understand Python/ceval.c, the GIL is really global, not specific
to an interpreter instance:
static PyThread_type_lock interpreter_lock = 0; /* This is the GIL */

Daniel


robert

unread,
Nov 3, 2006, 5:30:17 AM11/3/06
to

only lin. alg and rare locations. But that doesn't make up my typical CPU load.
Parallel execution of any varying Python code (and also Pyrex) without constantly thinking about C-level is definitely my real aim. (Otherwise I'd code most algs directly in C with less worries.)

robert

robert

unread,
Nov 3, 2006, 5:48:54 AM11/3/06
to
Feature Request: Py_NewInterpreter to create separate GIL (branch)

Thats the show stopper as of now.
There are only a handfull funcs in ceval.c to use that very global lock. The rest uses that funcs around thread states.

Would it be a possibilty in next Python to have the lock separate for each Interpreter instance.
Thus: have *interpreter_lock separate in each PyThreadState instance and only threads of same Interpreter have same GIL?
Separation between Interpreters seems to be enough. The Interpreter runs mainly on the stack. Possibly only very few global C-level resources would require individual extra locks.

Sooner or later Python will have to answer the multi-processor question.
A per-interpreter GIL and a nice module for tunneling Python-Objects directly between Interpreters inside one process might be the answer at the right border-line ? Existing extension code base would remain compatible, as far as there is already decent locking on module globals, which is the the usual case.

Robert

Filip Wasilewski

unread,
Nov 3, 2006, 6:38:43 AM11/3/06
to
robert wrote:
> I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.
>
> Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.
> I expect to be able to directly push around Python Object-Trees between the 2 (or more) interpreters by doing some careful locking.

I don't want to discourage you but what about reference counting/memory
management for shared objects? Doesn't seem fun for me.


Take a look at IPython1 and it's parallel computing capabilities [1,
2]. It is designed to run on multiple systems or a single system with
multiple CPU/multi-core. It's worker interpreters (engines) are loosely
coupled and can utilize several MPI modules, so there is no low-level
messing with GIL. Although it is work in progress it already looks
quite awesome.

[1] http://ipython.scipy.org/moin/Parallel_Computing
[2] http://ipython.scipy.org/moin/Parallel_Computing/Tutorial

fw

robert

unread,
Nov 3, 2006, 8:50:24 AM11/3/06
to
Filip Wasilewski wrote:
> robert wrote:
>> I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.
>>
>> Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.
>> I expect to be able to directly push around Python Object-Trees between the 2 (or more) interpreters by doing some careful locking.
>
> I don't want to discourage you but what about reference counting/memory
> management for shared objects? Doesn't seem fun for me.

in combination with some simple locking (anyway necessary) I don't see a problem in ref-counting.

If at least any interpreter branch has a pointer to the (root) object in question the ref-count is >0.


----
Question Besides: do concurrent INC/DEC machine OP-commands execute atomically on Multi-Cores as they do in Single-Core threads?

Example:

obj=Obj()

In a read-only phase (e.g. after computations) without locking, 2 Interpreters would for example both access the obj (and change around the refcount but no data).
The CPU will execute 2 [INC/DEC @refcount] OP-codes on different cores concurrently. Is it guaranteed that the counts sum up correctly?


> Take a look at IPython1 and it's parallel computing capabilities [1,
> 2]. It is designed to run on multiple systems or a single system with
> multiple CPU/multi-core. It's worker interpreters (engines) are loosely
> coupled and can utilize several MPI modules, so there is no low-level
> messing with GIL. Although it is work in progress it already looks
> quite awesome.
>
> [1] http://ipython.scipy.org/moin/Parallel_Computing
> [2] http://ipython.scipy.org/moin/Parallel_Computing/Tutorial

there are some MPI methods around. (This IPython method seems to be only on the level of an artefact of the interactive terminal connections.)

Yet with its burden of expensive data sync thats far away from my requirements. Thats for massive parallel computing and in sci areas.

I do selected things with interprocess shared memory techniques already. Thats medium efficent.

Multiple interpreters inside one process seem to be most promising for seemless multi-core programming. As all Python objects share the same malloc space - thats the key requirement in order to get the main effect.
As soon as with have object pickling in between its well away from this very discussion.


robert

Daniel Dittmar

unread,
Nov 3, 2006, 10:16:47 AM11/3/06
to
robert wrote:
> ---- Question Besides: do concurrent INC/DEC machine OP-commands
> execute atomically on Multi-Cores as they do in Single-Core threads?

No on the level that that Python reference counting is implemented. The
CPUs have often special assembler ops for these operations. I think that
it's even more complicated as some machines require additional read and
write barriers around these operations.

>> Take a look at IPython1 and it's parallel computing capabilities [1,

[...]


> I do selected things with interprocess shared memory techniques already.
> Thats medium efficent.
> Multiple interpreters inside one process seem to be most promising for
> seemless multi-core programming.

Both IPython and Jython use multiple CPUs without the GIL, so you don't
need to add the complication of multiple interpreters.

> As all Python objects share the same
> malloc space - thats the key requirement in order to get the main
> effect. As soon as with have object pickling in between its well away
> from this very discussion.

There's the complication that CPython has garbage collection in addition
to reference counting. I'm not sure what happens when the garbage
collector looks at objects rooted in another interpreter.

Daniel

robert

unread,
Nov 3, 2006, 12:13:51 PM11/3/06
to
Daniel Dittmar wrote:
> robert wrote:
>> ---- Question Besides: do concurrent INC/DEC machine OP-commands
>> execute atomically on Multi-Cores as they do in Single-Core threads?
>
> No on the level that that Python reference counting is implemented. The
> CPUs have often special assembler ops for these operations. I think that
> it's even more complicated as some machines require additional read and
> write barriers around these operations.
>
>>> Take a look at IPython1 and it's parallel computing capabilities [1,
> [...]
>> I do selected things with interprocess shared memory techniques
>> already. Thats medium efficent.
>> Multiple interpreters inside one process seem to be most promising for
>> seemless multi-core programming.
>
> Both IPython and Jython use multiple CPUs without the GIL, so you don't
> need to add the complication of multiple interpreters.

(IPython is only a special "python network terminal" as already said.)

With Jython I'm not familiar. Yet I think going to Jython will first cost more performance than you can win - at least on only dual cores as we have now. In 5 years with > 8x MPU systems this may be different ...
Does Jython really eliminate the GIL? What happens when different threads alter/read a dict concurrently - the basic operation in python, which is usually not protected by locks. Does Jython use a dict data type (and other collection types) which lock during _each_ access? in case - that may be very expensive.

> > As all Python objects share the same
> > malloc space - thats the key requirement in order to get the main
> > effect. As soon as with have object pickling in between its well away
> > from this very discussion.
>
> There's the complication that CPython has garbage collection in addition
> to reference counting. I'm not sure what happens when the garbage
> collector looks at objects rooted in another interpreter.

garbage is collected earliest, when the refcount went to 0. If it ever went to 0, no one will ever use such object again. Thus GC should not be different at all.


robert

Daniel Dittmar

unread,
Nov 3, 2006, 12:34:57 PM11/3/06
to
robert wrote:
> (IPython is only a special "python network terminal" as already said.)

Sorry, I thought of IronPython, the .NET variant.

> Does Jython really eliminate the GIL? What happens when different

Yes.

> threads alter/read a dict concurrently - the basic operation in python,
> which is usually not protected by locks. Does Jython use a dict data
> type (and other collection types) which lock during _each_ access? in
> case - that may be very expensive.

True, though Java synchronization has gotten better and better. Still,
Sun introduced non locking hash tables, so it was a problem. It'd be
interesting to know which kind of dict is used in Jython for attribute
access.


>
> garbage is collected earliest, when the refcount went to 0. If it ever
> went to 0, no one will ever use such object again. Thus GC should not be
> different at all.

Since Python 2.?, there's a mark-and-sweep garbage collection in
addition to the reference counting scheme. This was put into Python to
be able to reclaim object cycles.


Daniel

robert

unread,
Nov 3, 2006, 5:31:53 PM11/3/06
to
Daniel Dittmar wrote:
> robert wrote:
>> (IPython is only a special "python network terminal" as already said.)
>
> Sorry, I thought of IronPython, the .NET variant.
>
>> Does Jython really eliminate the GIL? What happens when different
>
> Yes.
>
>> threads alter/read a dict concurrently - the basic operation in
>> python, which is usually not protected by locks. Does Jython use a
>> dict data type (and other collection types) which lock during _each_
>> access? in case - that may be very expensive.
>
> True, though Java synchronization has gotten better and better. Still,
> Sun introduced non locking hash tables, so it was a problem. It'd be
> interesting to know which kind of dict is used in Jython for attribute
> access.

I hate Jave :-) , but to come to concerns of this thread, the speed.
Found not much about.
http://mail.zope.org/pipermail/zpt/2002-March/002884.html
says Jython to be at least 9x slower. Is this still the gap (and constant locking a big factor in this).

IronPython seems to be surprisingly faster than CPython (sometimes): http://www.python.org/pycon/dc2004/papers/9/
But I cannot imagine, that this includes GIL free locking dicts an all ... !?

Is numpy/scipy available for Iron/Jython ?


>> garbage is collected earliest, when the refcount went to 0. If it ever
>> went to 0, no one will ever use such object again. Thus GC should not
>> be different at all.
>
> Since Python 2.?, there's a mark-and-sweep garbage collection in
> addition to the reference counting scheme. This was put into Python to
> be able to reclaim object cycles.

still that only walks on objects which have already zero refcount. Cannot imagine any additional problems with the GIL.

robert

"Martin v. Löwis"

unread,
Nov 3, 2006, 6:40:27 PM11/3/06
to
robert schrieb:

> in combination with some simple locking (anyway necessary) I don't see a
> problem in ref-counting.

In the current implementation, simple locking isn't necessary.
The refcounter can be modified freely since the code modifying
it will always hold the GIL.

> ---- Question Besides: do concurrent INC/DEC machine OP-commands
> execute atomically on Multi-Cores as they do in Single-Core threads?

Not necessarily, no. On x86, you need to prefix the instruction
with the LOCK prefix for it to become atomic. Otherwise, the
two necessary read/write cycles to main memory may interleave
with the memory operations of another processor.

On other processors, INC <mem> might not exist at all as an
instruction; when you only have register add, then implementing
an atomic increment is a challenge. That's why the Windows API
provides you with an atomic increment function as part of the
operating system API.

Regards,
Martin

Paul Rubin

unread,
Nov 3, 2006, 7:29:16 PM11/3/06
to
robert <no-...@no-spam-no-spam.invalid> writes:
> > I don't want to discourage you but what about reference
> > counting/memory
> > management for shared objects? Doesn't seem fun for me.
>
> in combination with some simple locking (anyway necessary) I don't
> see a problem in ref-counting.
> If at least any interpreter branch has a pointer to the (root)
> object in question the ref-count is >0. ----
> Question Besides: do concurrent INC/DEC machine OP-commands execute
> atomically on Multi-Cores as they do in Single-Core threads?

Generally speaking, no, the inc/dec instructions are not atomic. You
can do an atomic increment on the x86 using LOCK XCHG (or maybe LOCK
INC is possible). The thing is that the locking protocol that
guarantees atomicity is very expensive, like 100x as expensive as an
unlocked instruction on a big multiprocessor. So yes, of course you
could accomplish reference counting through locks around the ref
counts, but performance suffers terribly. The solution is to get rid
of the ref counts and manage the entire heap using garbage collection.

For stuff like dictionary access, there are protocols (again based on
LOCK XCHG) that don't require locking for lookups. Only updates
require locking. Simon Peyton-Jones has a good paper about how it's
done in Concurrent Haskell:

http://research.microsoft.com/~simonpj/papers/stm/stm.pdf

This is really cool stuff and has found its way into Perl 6. I'd like
to see Python get something like it.

Steve Holden

unread,
Nov 3, 2006, 10:50:11 PM11/3/06
to pytho...@python.org
robert wrote:
> Daniel Dittmar wrote:
>
>>robert wrote:
[...]

>
>>>garbage is collected earliest, when the refcount went to 0. If it ever
>>>went to 0, no one will ever use such object again. Thus GC should not
>>>be different at all.
>>
>>Since Python 2.?, there's a mark-and-sweep garbage collection in
>>addition to the reference counting scheme. This was put into Python to
>>be able to reclaim object cycles.
>
>
> still that only walks on objects which have already zero refcount. Cannot imagine any additional problems with the GIL.
>
If I understand you correctly, then you are suffering a misapprehension.
Any object whose reference count goes to zero will immediately be
reclaimed. The mark-sweep garbage collector is used to detect objects
that are taking part in cycles - sets of objects that only refer to each
other without being bound to a name in any current namespace or to any
container object bound to such a name.

In other words, it detects (and reclaims) objects with non-zero
reference counts which nevertheless can be reclaimed without ill effect
on the program.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

robert

unread,
Nov 4, 2006, 7:44:09 AM11/4/06
to
Martin v. Löwis wrote:
> robert schrieb:
>> in combination with some simple locking (anyway necessary) I don't see a
>> problem in ref-counting.
>
> In the current implementation, simple locking isn't necessary.
> The refcounter can be modified freely since the code modifying
> it will always hold the GIL.

( meant: a lock to prevent multiple Interpreters accessing concurrently the hot shared/tunneled objects )

>> ---- Question Besides: do concurrent INC/DEC machine OP-commands
>> execute atomically on Multi-Cores as they do in Single-Core threads?
>
> Not necessarily, no. On x86, you need to prefix the instruction
> with the LOCK prefix for it to become atomic. Otherwise, the
> two necessary read/write cycles to main memory may interleave
> with the memory operations of another processor.
>
> On other processors, INC <mem> might not exist at all as an
> instruction; when you only have register add, then implementing
> an atomic increment is a challenge. That's why the Windows API
> provides you with an atomic increment function as part of the
> operating system API.

Thanks for that info. That is interesting.
Thus even on x86 currently this LOCK is not used (just (op)->ob_refcnt++) )

Reading this I got pinched:
In win32ui there are infact Py_INC/DECREF's outside of the GIL !
And I have a severe crash problem with threaded apps - the problem is only only on dual cores !
That pointer probably will end a long search...


robert


PS: Besides: what are speed costs of LOCK INC <mem> ?

GHUM

unread,
Nov 4, 2006, 8:07:28 AM11/4/06
to
robert,

> Interprocess communication is tedious and out of questio

[...]


> I expect to be able to directly push around Python Object-Trees between the 2 (or more) interpreters by doing some careful locking.

Please do yourself a favour and have a look at pyro. pyro makes
InterComputer and InterProcess Communication a snap. And it allows you
to push around Python Objects not only between processes, but
computers.

Maybe it solves your problem much easier and even more than you want to
solve it now by being able to use more then one computer.

Harald

robert

unread,
Nov 4, 2006, 8:42:17 AM11/4/06
to

Thats really interesting. Do expect this to remove once the GIL from Python? As dict-accesses (which are also must-be-atoms here) compose a major Python CPU load, the 100x costing instructions would probably put a >30% burden on Pythons overall speed.

A lock-protected possibilty to use multiple well separated Interpreters by tunneling objects will probably still be a most effective solution without speed costs.
The problem of singleton object's refcount (None, "", 1,2,3...), which MvL mentioned, is the main concern as far as I've understood.
The locking of Python core global resources (files etc.) can be done with litte effort.
The globals of extension modules are not really critical, as in the typical applications the tunnel method is mainly used for limited number crunching and as programmer you are well aware of multi-interpreter-unfit modules (until they final become mature). There could be also a special import method to duplicate extension data as workaround.

The singletons refcount could be fix-reset to MAXINT/2 at refcreation-time (or GC or ..) or so to freeze them quiet for ever.

(Mutable (why?)) exception types could be doubled in same style as normal python modules, or they could be rendered read-only. Thrown Exceptions will not cross the border. ( Think, the fact that "Exception.x=5" is possible is more an artefact than intended )


robert

robert

unread,
Nov 4, 2006, 8:53:44 AM11/4/06
to

another MPI - pickles and worse on the network :-( Havy networked batch jobs were never a problem with Python. Thats what I want to go away from. Read the rest of this thread. Discussion is about in-place object tunnel for fine grained high-speed SMP multi-threading ("memory bus")

"Martin v. Löwis"

unread,
Nov 4, 2006, 9:41:55 AM11/4/06
to
robert schrieb:

> PS: Besides: what are speed costs of LOCK INC <mem> ?

That very much depends on the implementation. In

http://gcc.gnu.org/ml/java/2001-03/msg00132.html

Hans Boehm claims it's 15 cycles. The LOCK prefix
itself asserts the lock# bus signal for the entire
operation, meaning that the other processors
can't perform memory operations during that time.
On the P6, if the data is cacheable (for some Intel
definition of this word), the lock# signal will
not be asserted, just the cache gets locked.
The LOCK prefix also causes any pending writes
to be performed before the operation starts.

So in the worst case, a LOCK INC will have to
wait for pending writes, then will assert the
lock# prefix, then perform a read and a write
cycle memory cycle, with the increment processor
cycle in-between. Assuming a 400MHz memory bus
and a 4GHz processor, LOCK INC will take around
20 cycles, whereas a plain INC might get done
in a single cycle or less (assuming pipelining
and caching).

Regards,
Martin

Paul Rubin

unread,
Nov 4, 2006, 12:12:06 PM11/4/06
to
"Martin v. Löwis" <mar...@v.loewis.de> writes:
> > PS: Besides: what are speed costs of LOCK INC <mem> ?
>
> That very much depends on the implementation. In
> http://gcc.gnu.org/ml/java/2001-03/msg00132.html
> Hans Boehm claims it's 15 cycles.

I think that has to be on a single processor, or at most a dual core
processor with shared cache on die. With multiple cpu chips I don't
think can get the signals around that fast.

"Martin v. Löwis"

unread,
Nov 4, 2006, 10:20:29 PM11/4/06
to
Paul Rubin schrieb:

Can you explain what you mean? The lock# signal takes *immediate*
effect, within the CPU cycle in which it is asserted. There is no
delay whatsoever (except for speed-of-light issues, of course).

Regards,
Martin

Paul Rubin

unread,
Nov 5, 2006, 3:30:18 AM11/5/06
to

I dunno about x86 hardware signals but these instructions do
read-modify-write operaitons. That means there has to be enough
interlocking to prevent two cpu's from updating the same memory
location simultaneously, which means the CPU's have to communicate.
See <http://en.wikipedia.org/wiki/MESI_protocol> (I think this is
how the current x86's do it):

A cache that holds a line in the Modified state must snoop
(intercept) all attempted reads (from all of the other CPUs in the
system) of the corresponding main memory location and insert the
data that it holds. This is typically done by forcing the read to
back off (i.e. to abort the memory bus transaction), then writing
the data to main memory and changing the cache line to the Shared
state.

A cache that holds a line in the Shared state must also snoop all
invalidate broadcasts from other CPUs, and discard the line (by
moving it into Invalid state) on a match.

It is quite difficult to get signals off of a chip at GHz speeds, and
so this signalling is much slower than the CPU cycle.

Ross Ridge

unread,
Nov 5, 2006, 6:06:33 AM11/5/06
to
Paul Rubin wrote:
> I dunno about x86 hardware signals but these instructions do
> read-modify-write operaitons. That means there has to be enough
> interlocking to prevent two cpu's from updating the same memory
> location simultaneously, which means the CPU's have to communicate.
> See <http://en.wikipedia.org/wiki/MESI_protocol> (I think this is
> how the current x86's do it):

The MESI protocol is used to ensure all read and write operations are
cache coherent, not just locked read-modify-write operations. Whether
the LOCK prefix is used or not in an instruction normally isn't
externally visable. From IA-32 Intel® Architecture Software
Developer's Manual Volume 3A: System Programming Guide, Part 1:

For the Pentium 4, Intel Xeon, and P6 family processors,
if the area of memory being locked during a LOCK operation
is cached in the processor that is performing the LOCK operation
as write-back memory and is completely contained in a cache line,
the processor may not assert the LOCK# signal on the bus. Instead,
it will modify the memory location internally and allow it's
cache coherency mechanism to insure that the operation is carried
out atomically. This operation is called "cache locking."
The cache coherency mechanism automatically prevents two or
more processors that have cached the same area of memory from
simultaneously modifying data in that area.

The cost of mantaining cache coherency for a locked increment
instruction should be no different than that of an unlocked increment
instruction.

Ross Ridge

"Martin v. Löwis"

unread,
Nov 5, 2006, 6:08:40 AM11/5/06
to Paul Rubin
Paul Rubin schrieb:

> I dunno about x86 hardware signals but these instructions do
> read-modify-write operaitons. That means there has to be enough
> interlocking to prevent two cpu's from updating the same memory
> location simultaneously, which means the CPU's have to communicate.
> See <http://en.wikipedia.org/wiki/MESI_protocol> (I think this is
> how the current x86's do it):

Ah, but in the case where the lock# signal is used, it's known that
the data is not in the cache of the CPU performing the lock operation;
I believe it is also known that the data is not in the cache of any
other CPU. So the CPU performing the LOCK INC sequence just has
to perform two memory cycles. No cache coherency protocol runs
in that case.

But even when caching is involved, I don't see why there should be
more than three memory cycles. The MESI "protocol" really consists
only of two pins: HIT# and HITM#; the actual implementation is just
in keeping the state for each cache line, and in snooping. There
CPU's don't really "communicate". Instead, if one processor tries
to fill a cache line, the others snoop the read, and assert either
HIT# (when they have not modified their cache lines) or HITM#
(when they do have modified their cache lines). Assertions of
these signals is also immediate.

The worst case would be that one processor performs a LOCK INC,
and another processor has the modified value in its cache line.
So it needs to first flush the cache line, before the other
processor can modify the memory. If the memory is not cached
yet in another processor, the MESI protocol does not involve
additional penalties.

Regards,
Martin

Paul Rubin

unread,
Nov 7, 2006, 3:19:49 AM11/7/06
to
"Martin v. Löwis" <mar...@v.loewis.de> writes:
> Ah, but in the case where the lock# signal is used, it's known that
> the data is not in the cache of the CPU performing the lock operation;
> I believe it is also known that the data is not in the cache of any
> other CPU. So the CPU performing the LOCK INC sequence just has
> to perform two memory cycles. No cache coherency protocol runs
> in that case.

How can any CPU know in advance that the data is not in the cache of
some other CPU?

> But even when caching is involved, I don't see why there should be
> more than three memory cycles. The MESI "protocol" really consists
> only of two pins: HIT# and HITM#; the actual implementation is just
> in keeping the state for each cache line, and in snooping. There
> CPU's don't really "communicate". Instead, if one processor tries
> to fill a cache line, the others snoop the read, and assert either
> HIT# (when they have not modified their cache lines) or HITM#
> (when they do have modified their cache lines). Assertions of
> these signals is also immediate.

OK, this is logical, but it already implies a cache miss, which costs
many dozen (100?) cycles. But this case may be uncommon, since one
hops that cache misses are relatively rare.

> The worst case would be that one processor performs a LOCK INC,
> and another processor has the modified value in its cache line.
> So it needs to first flush the cache line, before the other
> processor can modify the memory. If the memory is not cached
> yet in another processor, the MESI protocol does not involve
> additional penalties.

I think for Python refcounts this case must occur quite frequently
since there are many Python objects (small integers, None, etc.)
whose refcounts get modified extremely often.

IIRC, the SPJ paper that I linked claims that lock-free protocols
outperform traditional lock-based ones even with just two processors.
But maybe things are better with a dual core processor (shared cache)
than with two separate packages.

Ross Ridge

unread,
Nov 7, 2006, 7:03:03 AM11/7/06
to
"Martin v. Löwis" <mar...@v.loewis.de> writes:
> Ah, but in the case where the lock# signal is used, it's known that
> the data is not in the cache of the CPU performing the lock operation;
> I believe it is also known that the data is not in the cache of any
> other CPU. So the CPU performing the LOCK INC sequence just has
> to perform two memory cycles. No cache coherency protocol runs
> in that case.

Paul Rubin wrote:
> How can any CPU know in advance that the data is not in the cache of
> some other CPU?

In the case where the LOCK# signal is asserted the area of memory
accessed is marked as being uncachable. In a SMP system all CPUs must
have the same mapping of cached and uncached memory or things like this
break. In the case where the LOCK# signal isn't used, the MESI
protocol informs the CPU of which of it's cache lines might also be in
the cache of another CPU.

> OK, this is logical, but it already implies a cache miss, which costs
> many dozen (100?) cycles. But this case may be uncommon, since one
> hops that cache misses are relatively rare.

The cost of the cache miss is the same whether the increment
instruction is locked or not.

Ross Ridge

Joe Seigh

unread,
Nov 7, 2006, 9:06:59 AM11/7/06
to
Paul Rubin wrote:
> robert <no-...@no-spam-no-spam.invalid> writes:
>
>>>I don't want to discourage you but what about reference
>>>counting/memory
>>>management for shared objects? Doesn't seem fun for me.
>>
>>in combination with some simple locking (anyway necessary) I don't
>>see a problem in ref-counting.
>>If at least any interpreter branch has a pointer to the (root)
>>object in question the ref-count is >0. ----
>>Question Besides: do concurrent INC/DEC machine OP-commands execute
>>atomically on Multi-Cores as they do in Single-Core threads?
>
>
> Generally speaking, no, the inc/dec instructions are not atomic. You
> can do an atomic increment on the x86 using LOCK XCHG (or maybe LOCK
> INC is possible). The thing is that the locking protocol that
> guarantees atomicity is very expensive, like 100x as expensive as an
> unlocked instruction on a big multiprocessor. So yes, of course you
> could accomplish reference counting through locks around the ref
> counts, but performance suffers terribly. The solution is to get rid
> of the ref counts and manage the entire heap using garbage collection.

Atomic increment and decrement instructions are not by themselves
sufficient to make reference counting safe. It's what Boost::shared_ptr
uses to make itself internally thread-safe but it's not safe to copy
a shared_ptr without "owning" it. Not like Java where you can safely
copy a reference (Java pointer) no matter what.

Basically there's a race condition where an object containing the
refcount can be deleted between the time you load a pointer to
the object and the time you increment what used to be a refcount
and is possibly something else but definitely undefined.

I have an experimental refcounting implementation at
http://atomic-ptr-plus.sourceforge.net/


>
> For stuff like dictionary access, there are protocols (again based on
> LOCK XCHG) that don't require locking for lookups. Only updates
> require locking. Simon Peyton-Jones has a good paper about how it's
> done in Concurrent Haskell:
>
> http://research.microsoft.com/~simonpj/papers/stm/stm.pdf
>
> This is really cool stuff and has found its way into Perl 6. I'd like
> to see Python get something like it.

That seems to be about STM (Software Transactional Memory). What you're
describing seems to be read lock-free using what I call PDR
(PCOW (Partial Copy On Write) Deferred Reclaimation). Examples of PDR
are RCU (used in Linux kernel), Maged Michael's SMR hazard pointers,
and thread-safe GC (used in Java's concurrent collections java.util.concurrent).

You can also use PDR to manage safe refcounting ,e.g. RCU based refcounting, rcuref,
in the Linux kernel.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

"Martin v. Löwis"

unread,
Nov 7, 2006, 9:44:59 AM11/7/06
to Paul Rubin
Paul Rubin schrieb:

> "Martin v. Löwis" <mar...@v.loewis.de> writes:
>> Ah, but in the case where the lock# signal is used, it's known that
>> the data is not in the cache of the CPU performing the lock operation;
>> I believe it is also known that the data is not in the cache of any
>> other CPU. So the CPU performing the LOCK INC sequence just has
>> to perform two memory cycles. No cache coherency protocol runs
>> in that case.
>
> How can any CPU know in advance that the data is not in the cache of
> some other CPU?

AFAIU, the lock# line, in P6, is only used for memory regions that
are marked non-cacheable. The CPU determines a memory region to be
non-cacheable if either the memory-type register (MTRR) says it's
non-cacheable, or if the not-cached bit in the page table is set.

If a certain location is known to be modified a lot from different
CPUs (e.g. a spin lock), it might be best if the operating system
sets the page where this location lives as non-cacheable - it might
be more performant to always modify it through the main memory,
than having cache coherence deal with it.

> OK, this is logical, but it already implies a cache miss, which costs
> many dozen (100?) cycles. But this case may be uncommon, since one
> hops that cache misses are relatively rare.

Right - I'm completely uncertain as to what the cost of a cache miss
is, in terms of internal cycles. E.g. how many memory cycles does
the CPU have to perform to fill a cache line? And what is the ratio
between memory cycles and CPU cycles?

> IIRC, the SPJ paper that I linked claims that lock-free protocols
> outperform traditional lock-based ones even with just two processors.
> But maybe things are better with a dual core processor (shared cache)
> than with two separate packages.

Likely, yes. Although different dual-core designs are out there, some
of them not using a shared cache, but two individual ones.

Regards,
Martin

Ross Ridge

unread,
Nov 7, 2006, 10:38:02 AM11/7/06
to
Joe Seigh wrote:
> Basically there's a race condition where an object containing the
> refcount can be deleted between the time you load a pointer to
> the object and the time you increment what used to be a refcount
> and is possibly something else but definitely undefined.

That doesn't really make sense. The object can't be deleted because
the thread should already have a reference (directly or indirectly) to
the object, otherwise any access to it can cause the race condition you
describe.

Ross Ridge

Joe Seigh

unread,
Nov 7, 2006, 12:04:03 PM11/7/06
to

True but if the thread didn't already have a reference, how would you get
that initial reference to a shared object without a lock?

Ross Ridge

unread,
Nov 7, 2006, 12:57:11 PM11/7/06
to
Ross Ridge wrote:
> That doesn't really make sense. The object can't be deleted because
> the thread should already have a reference (directly or indirectly) to
> the object, otherwise any access to it can cause the race condition you
> describe.

Joe Seigh wrote:
> True but if the thread didn't already have a reference, how would you get
> that initial reference to a shared object without a lock?

The thread that shares it increments the reference count before passing
its address to directly another thread or indirectly through a shared
container.

Ross Ridge

"Martin v. Löwis"

unread,
Nov 7, 2006, 1:11:47 PM11/7/06
to
Ross Ridge schrieb:

> The thread that shares it increments the reference count before passing
> its address to directly another thread or indirectly through a shared
> container.

To make a specific example, consider this fragment from
Objects/fileobject.c:

static PyObject *
file_repr(PyFileObject *f)
{
if (PyUnicode_Check(f->f_name)) {
...

Now, assume there wasn't a GIL protecting this all, and also
assume f_name was a mutable member (which it currently isn't).

Then, this access wouldn't be thread-safe: This code roughly
translates to

reg_x = f->f_name
push reg_x
call PyUnicode_Check (assuming this was a function and
not a macro)

Meanwhile, another process might perform

old = f->f_name;
f->f_name = new;
Py_DECREF(old);

i.e. change the file name. Now, it might be that they
interleave this way:

reg_x = f->f_name
old = f->f_name
f->f_name = new
Py_DECREF_old
push reg_x
call PyUnicode_Check

which would now operate on a deallocated object.

To fix this, one might think that we need

Py_INCREF(f->f_name)
if(Py_UnicodeCheck(f->f_name))
...
Py_DECREF(f->f_name)

However, this would not help, because the
first incref translates to

reg_x = f->f_name
LOCK INC f->ob_refcnt

which again leaves a race condition where the
INCREF operation comes "too late".

How would you propose to fix file_repr to prevent such
a race condition?

Regards,
Martin

Ross Ridge

unread,
Nov 7, 2006, 2:49:30 PM11/7/06
to

Martin v. Löwis wrote:
> How would you propose to fix file_repr to prevent such
> a race condition?

The race condition you describe is different from the one Joe Seigh
described. It's caused because without GIL access to the file object
is no longer thread safe, not because reference counting isn't thread
safe.

Ross Ridge

"Martin v. Löwis"

unread,
Nov 7, 2006, 3:55:10 PM11/7/06
to Ross Ridge
Ross Ridge schrieb:

Joe's claim (quoting him) was: "Atomic increment and decrement


instructions are not by themselves sufficient to make reference
counting safe."

Rephrasing this in Python terms is "if the GIL is dropped and
incref/decref are made atomic, the reference counting is not
automatically thread-safe". This is what my example demonstrates:
Drop the GIL, and assume an atomic inc/dec, and it won't be
thread-safe.

It's not that the file object becomes thread-unsafe. Instead,
access to the f_name property won't work anymore (as would
access to any other PyObject* field).

You still didn't say what you would suggest to make it thread-safe
again; most likely, you proposal would be to add locking. If I
understand Joe's approach correctly, he has a solution that does
not involve locking (although I don't understand how it works).

Regards,
Martin

Joe Seigh

unread,
Nov 7, 2006, 4:13:33 PM11/7/06
to
Martin v. Löwis wrote:
> You still didn't say what you would suggest to make it thread-safe
> again; most likely, you proposal would be to add locking. If I
> understand Joe's approach correctly, he has a solution that does
> not involve locking (although I don't understand how it works).
>
Sun had applied for a patent on it. You can go to the
uspto search page here http://www.uspto.gov/patft/index.html
and look for

20060218561 Code preparation technique employing lock-free pointer operations
20060037026 Lightweight reference counting using single-target synchronization

Click on the images link on the patent application where the illustrations
are which show the concepts probably better than the text.

The first one above is actually a continuation patent on three different
techniques. One using double wide compare and swap, one using ROP (Repeat
Offender Problem), a form of PDR, and one using DCAS (compare and swap
of two separate locations) which only exists on MC68020 and MC68030
processors.

Chaz Ginger

unread,
Nov 7, 2006, 4:25:36 PM11/7/06
to
Check out the work in the '80s from the NYU Ultra project. They did a
great deal of work on using atomic incr/decr for all sorts of algorithms
to get around locking on parallel processors.

Chaz.

Shane Hathaway

unread,
Nov 7, 2006, 4:36:05 PM11/7/06
to robert, pytho...@python.org
robert wrote:
> I'd like to use multiple CPU cores for selected time consuming Python
> computations (incl. numpy/scipy) in a frictionless manner.
>
> Interprocess communication is tedious and out of question, so I
> thought about simply using a more Python interpreter instances
> (Py_NewInterpreter) with extra GIL in the same process. I expect to

> be able to directly push around Python Object-Trees between the 2 (or
> more) interpreters by doing some careful locking.
>
> Any hope to come through? If possible, what are the main dangers? Is
> there an example / guideline around for that task? - using ctypes or
> so.
>
> Or is there even a ready made Python module which makes it easy to
> setup and deal with extra Interpreter instances? If not, would it be
> an idea to create such thing in the Python std libs to make Python
> multi-processor-ready. I guess Python will always have a GIL -
> otherwise it would loose lots of comfort in threaded programming

I'd like to mention mod_python, which creates multiple interpreters
inside Apache. It does this transparently and works well. There is no
apparent lock contention between the interpreters, because no Python
objects are shared.

Also, I'd like to make the observation that it is not always necessary
to share the entire interpreter (ala threads) in order to take advantage
of multiple cores. I think Python only needs a nice way to share a
relatively small set of objects using shared memory. POSH goes in that
direction, but I don't think it's simple enough yet.

http://modpython.org/
http://poshmodule.sourceforge.net/

Shane

Ross Ridge

unread,
Nov 7, 2006, 5:48:35 PM11/7/06
to
Martin v. Löwis wrote:
> How would you propose to fix file_repr to prevent such
> a race condition?

Ross Ridge schrieb:


> The race condition you describe is different from the one Joe Seigh
> described. It's caused because without GIL access to the file object
> is no longer thread safe, not because reference counting isn't thread
> safe.

Martin v. Löwis wrote:
> Joe's claim (quoting him) was: "Atomic increment and decrement
> instructions are not by themselves sufficient to make reference
> counting safe."

So give an example where reference counting is unsafe.

Remember that I only said that Joe Seigh's statement didn't make sense,
not that I had some magic solution that would somehow make Python
thread safe without locking.

Ross Ridge

gra...@dscpl.com.au

unread,
Nov 7, 2006, 6:38:51 PM11/7/06
to

Shane Hathaway wrote:
> robert wrote:
> > I'd like to use multiple CPU cores for selected time consuming Python
> > computations (incl. numpy/scipy) in a frictionless manner.
> >
> > Interprocess communication is tedious and out of question, so I
> > thought about simply using a more Python interpreter instances
> > (Py_NewInterpreter) with extra GIL in the same process. I expect to
> > be able to directly push around Python Object-Trees between the 2 (or
> > more) interpreters by doing some careful locking.
> >
> > Any hope to come through? If possible, what are the main dangers? Is
> > there an example / guideline around for that task? - using ctypes or
> > so.
> >
> > Or is there even a ready made Python module which makes it easy to
> > setup and deal with extra Interpreter instances? If not, would it be
> > an idea to create such thing in the Python std libs to make Python
> > multi-processor-ready. I guess Python will always have a GIL -
> > otherwise it would loose lots of comfort in threaded programming
>
> I'd like to mention mod_python, which creates multiple interpreters
> inside Apache. It does this transparently and works well. There is no
> apparent lock contention between the interpreters, because no Python
> objects are shared.

In relation to mod_python, your statement isn't strictly true. One can
create scenarios in mod_python where a Python object created in the
context of one interpreter is then used in another interpreter at a
later point. Generally this isn't an issue because it is part of a
recursive sequence of processing and thus while the second interpreter
is doing stuff with the object the first is stalled waiting for the
second to return. It wouldn't be totally inconceivable though for a
user to create threads which may operate on the object at the same time
in the first interpreter as the object is being used in the second
interpreter.

The situation gets even more complicated when you start to use third
party modules which may be used from multiple interpreters and a
multithreaded Apache MPM is used and where the third party module
performs callbacks into interpreters with some common object.

Thus, mod_python may not be able to benefit from each Python
interpreter instance having its own GIL if that was your point in
referencing mod_python.

Graham

Sandra-24

unread,
Nov 7, 2006, 7:59:29 PM11/7/06
to
On Nov 2, 1:32 pm, robert <no-s...@no-spam-no-spam.com> wrote:
> I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.
>
> Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.

Why not use IronPython? It's up to date (unlike Jython), has no GIL,
and is cross-platform wherever you can get .NET or Mono (UNIX, macs,
windows) and you can use most any scientific libraries written for the
.NET/Mono platform (there's a lot) Take a look anyway.

-Sandra

"Martin v. Löwis"

unread,
Nov 8, 2006, 12:10:14 AM11/8/06
to Ross Ridge
Ross Ridge schrieb:

> So give an example where reference counting is unsafe.

Nobody claimed that, in that thread. Instead, the claim was


"Atomic increment and decrement instructions are not by themselves
sufficient to make reference counting safe."

I did give an example, in <4550cc64$0$3462$9b62...@news.freenet.de>.
Even though f_name is reference-counted, it might happen that you get a
dangling pointer. More specifically, even if file_repr *incremented* the
reference counter for f_name, it might still be a dangling pointer.

To quote Joe again:

# Basically there's a race condition where an object containing the
# refcount can be deleted between the time you load a pointer to
# the object and the time you increment what used to be a refcount
# and is possibly something else but definitely undefined.

I don't understand why you can't see that this statement is factually
correct.

Regards,
Martin

Ross Ridge

unread,
Nov 8, 2006, 8:21:03 AM11/8/06
to
Ross Ridge schrieb:
> So give an example where reference counting is unsafe.

Martin v. Löwis wrote:
> Nobody claimed that, in that thread. Instead, the claim was
> "Atomic increment and decrement instructions are not by themselves
> sufficient to make reference counting safe."

So give an example of where atomic increment and decrement instructions
are not by themselves sufficent to make reference counting safe.

> I did give an example, in <4550cc64$0$3462$9b62...@news.freenet.de>.
> Even though f_name is reference-counted, it might happen that you get a
> dangling pointer.

Your example is of how access to the "f_name" member is unsafe, not of
how reference counting being unsafe. The same sort of race condition
can without reference counting being involved at all. Consider the
"f_fp" member: if one thread tries to use "printf()" on it while
another thread calls "fclose()", then you can have same problem. The
race condition here doesn't happen because reference counting hasn't
been made safe, nor does it happen because stdio isn't thread-safe. It
happens because accessing "f_fp" (without the GIL) is unsafe.

The problem your describing isn't that reference counting hasn't been
made safe. What you and Joe seem to be trying to say is that atomic
increment and decrement instructions alone don't make accessing shared
structure members safe.
Ross Ridge

Joe Seigh

unread,
Nov 8, 2006, 9:40:13 AM11/8/06
to
Ross Ridge wrote:
> Ross Ridge schrieb:
>
>>So give an example where reference counting is unsafe.
>
>
> Martin v. Löwis wrote:
>
>>Nobody claimed that, in that thread. Instead, the claim was
>>"Atomic increment and decrement instructions are not by themselves
>>sufficient to make reference counting safe."
>
>
[...]

>
> The problem your describing isn't that reference counting hasn't been
> made safe. What you and Joe seem to be trying to say is that atomic
> increment and decrement instructions alone don't make accessing shared
> structure members safe.

How you increment and decrement a reference count is an implementation
issue and whether it's correct or not depends on the semantics of the
refcount based pointer. I.e. what kind of thread safety guarantees
are we ascribing to pointer operations? Java reference (pointer)
guarantees are not the same as C++ Boost shared_ptr guarantees. In
Java simple pointer assignment, "a = b;" is always safe no matter what
any other thread may be doing to a and/or b. With shared_ptr you have
to have a priori knowlege that the refcount for b will never go to
zero during the copy operation. Usually that implies ownership of b.

To get a better feeling for the latter semantics you should take a look
at C++ String implementations. A C++ String COW (copy on write) implementation
using atomic inc/dec is as thread-safe as a non-COW String implementation
because in both cases you have to own the strings you access.

About the term "thread-safe". In Posix it's taken as meaning an operation
on non-shared data isn't affected by other threads. So non-COW strings
are thread-safe, and COW strings are thread-safe if their internal
refcounting is synchronized properly. But note that in both cases,
the implemention is transparent to the user.

So as you say, a lot depends on how you access or want to access shared
structure members, e.g. with or without locking.

robert

unread,
Nov 8, 2006, 11:18:12 AM11/8/06
to

Yes.

To recall the motivation and have a real world example:
The idea to share/handover objects between 2 (N) well separated Python interpreter instances (free-threading) with 2 different GILs.
There is of course the usage condition: * Only one interpreter may access (read & write) the hot (tunneled) object tree at a time *
(e.g. application: a numeric calc and when finished (flag) the other interpreter walks again the objects directly (without MPI/pickling))

But a key problem was, that the other thread can have old pointer objects (containers) pointing into the hot object tree - an there is shared use of (immutable) singleton objects (None,1,2...): The old pointer objects in other interpreter may disapear at any time or the pointers maybe be doubled.
Thus the refcount of hot object will count down/up out of the interpreter which has not possession of the hot object tree - even if the pointers are not used for write/read access.

Only and truly if you have atomic Py_INCREF/Py_DECREF this is no problem.

Before all interpreters have lost the object there will be no accidental disapearance of the object as Ross Ridge already pointed out.
In addition concurrent read access to _constant_ objects/sub-trees would be possible, and also concurrent read&write access by using an explicit locking!
Thus the original OP requirements would be fulfilled.

See so far only 5 main additional requirements to offer the possibility of separated GILs/free-threading interpreters:

* pointer to current GIL in threadstate and dynamic PyThreadState_GET() / currentthreadstate in TLS

* locks for global objects (file etc) of course, if they should be supported therefore. (I'd use the free-threading only for mere computations)

* enable the already existing obmalloc.c/LOCK&UNLOCK by something fast like:
_retry:
__asm LOCK INC malloc_lock
if (malloc_lock!=1) { LOCK DEC malloc_lock; /*yield();*/ goto _retry; }

* a special (LOCK INC) locking dict type for the global dict of extension modules
(created clearly by Py_InitModule(name, methods) - thus that would also preserve backwards compatibility for extension C-code)

* nice tunnel functions to create extra interpreters and for actually tunneling the objects and maybe offering the fast locking-dict type to enable a fast sharing of the hot tunneled object tree.

Speed costs? Probably not much as far as the discussion in this thread sounds...

Of course this option of 2 interpreters - though easy to use - would still be for power use cases only: A Python programming bug doing accidential unlocked concurrent access into a hot tunneled tree can cause a C-level crash. (This cannot happen so far with simple Python threading - you'd only get inconsistent data or a Python exception. But of course you can crash already now at C-level by using the Python standard library :-) ).
That danger would be ok for me so far. Conceptually its not more complicated that using locks right in normal Python thread programming - only the effect of bugs will more critical ...

If one thinks about overcoming the GIL at all - we are probably not far away. Mainly:

* make the locking dict type (and locking list type) the common case - the non-locking obsolete or for optimization only

* lock some other non-constant types which are not already mainly dicts/lists. most objects which' access-functions only change INTEGERS etc and call threadsafe C-lib functions etc don't require extra locking

Maybe the separated-GIL/interpreter-method can be a bridge to that.
Refcounting probably doen't necessarily block that road.

Really interesting would be to make an experiment about the speed costs of LOCK INC. Guess the hot spot regarding recounting will be Py_None's cache line (on multi core CPUs with separated cache per core).


Robert

robert

unread,
Nov 8, 2006, 11:25:26 AM11/8/06
to
robert wrote:
> Martin v. Löwis wrote:
[..]
>
> Thanks for that info. That is interesting.
> Thus even on x86 currently this LOCK is not used (just
> (op)->ob_refcnt++) )
>
> Reading this I got pinched: In win32ui there are infact Py_INC/DECREF's
> outside of the GIL !
> And I have a severe crash problem with threaded apps - the problem is
> only only on dual cores !
> That pointer probably will end a long search...

In fact that it was in win32ui. Months I've search the bug strainedly, and this fancy discussion revealed it :-)

robert

unread,
Nov 8, 2006, 11:40:39 AM11/8/06
to

what about speed. Is it true that IronPython is almost as fast as C-Python meanwhile?

When this all is really true, its probably a proof that putting out LOCK-INC-lock's (on dicts, lists, mutables ...) in CPython to remove the GIL in future should not be really so expensive as it was teached in the past :-)

Still to adopt to .NET libraries will be a big step. Is there really a thing out there as usable as numpy/scipy. And GUI programming in IronPython ...


( The FAQ's on codeplex.com and many docs are not readable currently due to site errors. What about overall stability and # of users of Iron as of today? )


-robert

robert

unread,
Nov 8, 2006, 11:57:21 AM11/8/06
to
Shane Hathaway wrote:
> of multiple cores. I think Python only needs a nice way to share a
> relatively small set of objects using shared memory. POSH goes in that
> direction, but I don't think it's simple enough yet.
>
> http://poshmodule.sourceforge.net/

interesting, a solution possibly a little faster than pickling - but maybe only in selected situations. Made already experiments with pickling through shared memory.

With "x = posh.share(x)" an object tree will be (deep-)copied to shared mem ( as far as objects fullfil some conditions http://poshmodule.sourceforge.net/posh/html/node6.html: is this true for numpy arrays?)
Every object to be inserted in the hot tunnel object tree has to be copied that same style. Thus ~pickling, but somewhat easier to use.

And compiling it for Windows will be quite difficult ...

Robert

sturlamolden

unread,
Nov 8, 2006, 12:41:15 PM11/8/06
to
robert wrote:

> I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.

Threading is not the best way to exploit multiprocessors in this
context. Threads are not the preferred way of exploiting multiple
processors in scientific computing.

here are a few thoughts on the matter:

1. SciPy uses ATLAS/BLAS and LAPACK. You can compile these libraries
for SMPs. The same goes for FFTW, vendor optimized math kernels, etc.
If most of the CPU time is spent inside these numeric libraries, using
multi-processor versions of these libraries are a much better strategy.

2. The number of CPUs are not the only speed limiting factor on an SMP.
Use of cache and prefetching are just as important. That can make
multi-processor aware numeric libraries a lot more efficient than
manual multi-threading.

3. One often uses cluster architectures (e.g. Beowulf) instead of SMPs
for scientific computing. MPI works on SMP and clusters. Threads only
work on SMPs.

4. Fortran compilers can recognize parallel array statements in
Fortran90/95 and exploit multiple processors on an SMP automatically.
NumPy should be able to to the same when it matures. E.g. if you make a
statement like "arr[1,::] = arr[2,::] * arr[3,::]", then this statement
could be evaluated in parallel on multiple CPUs, without any
multi-threading on your part. Since the order in which the
multiplications are performed are of no significance, the work can just
as well be spread out to multiple processors in an SMP or a cluster.
NumPy is still immature, but Fortran compilers have done this at least
two decades.

5. Streaming SIMD extensions (SSE) and similar opcodes: Are you aware
that Pentium III (and newer) processors are pipe-lined to do four
floating-point operations in parallel? You could theoretically
quadruple your flops using the SSE registers, using no threading at
all. (The actual improvement is slightly less, due to some extra
book-keeping required to get the data in and out of the SSE registers.)
Again this requires modifications inside NumPy, not multi-threading.

> If not, would it be an idea to create such thing in the Python std libs to make Python multi-processor-ready. I guess Python will always have a GIL - otherwise it would loose lots of comfort in threaded programming

I would say that the GIL actually has very little effect of Python's
potential in high-performance numeric and scientific computing. It all
depends on the libraries, not on Python per se. Threading is for making
certain tasks more comfortable to write, not so much for computational
speed.

Paul Rubin

unread,
Nov 8, 2006, 1:01:56 PM11/8/06
to
robert <no-...@no-spam-no-spam.invalid> writes:
> what about speed. Is it true that IronPython is almost as fast as C-Python meanwhile?
>
> When this all is really true, its probably a proof that putting out
> LOCK-INC-lock's (on dicts, lists, mutables ...) in CPython to remove
> the GIL in future should not be really so expensive as it was teached
> in the past :-)

I don't think IronPython uses locks that way. AFAIK it doesn't use
reference counts, and relies on user-supplied synchronization to keep
stuff like dictionaries safe.

sturlamolden

unread,
Nov 8, 2006, 1:55:47 PM11/8/06
to

sturlamolden wrote:

> 3. One often uses cluster architectures (e.g. Beowulf) instead of SMPs
> for scientific computing. MPI works on SMP and clusters. Threads only
> work on SMPs.

Following up on my previous post, there is a simple Python MPI wrapper
that can be used to exploit multiple processors for scientific
computing. It only works for Numeric, but an adaptaion to NumPy should
be easy (there is only one small C file in the source):

http://datamining.anu.edu.au/~ole/pypar/

If you are using Windows, you can get a free implementation of MPI
here:

http://www-unix.mcs.anl.gov/mpi/mpich1/mpich-nt/

sturlamolden

unread,
Nov 8, 2006, 2:08:51 PM11/8/06
to

"Martin v. Löwis"

unread,
Nov 8, 2006, 4:14:20 PM11/8/06
to
Ross Ridge schrieb:

> The problem your describing isn't that reference counting hasn't been
> made safe. What you and Joe seem to be trying to say is that atomic
> increment and decrement instructions alone don't make accessing shared
> structure members safe.

All I can do is to repeat Joe's words literally


"Atomic increment and decrement instructions are not by themselves
sufficient to make reference counting safe."

Replace "by themselves" with "alone", and "reference counting"
with "reference counting in the presence of shared structure
members", and you find that your statement and Joe's are equivalent.
*Of course* reference counting is about shared structure members.
That's it's primary purpose.

Regards,
Martin

Fernando Perez

unread,
Nov 9, 2006, 2:01:34 AM11/9/06
to pytho...@python.org
sturlamolden wrote:

> Following up on my previous post, there is a simple Python MPI wrapper
> that can be used to exploit multiple processors for scientific
> computing. It only works for Numeric, but an adaptaion to NumPy should
> be easy (there is only one small C file in the source):
>
> http://datamining.anu.edu.au/~ole/pypar/
>
> If you are using Windows, you can get a free implementation of MPI
> here:
>
> http://www-unix.mcs.anl.gov/mpi/mpich1/mpich-nt/

It's also worth noting

http://mpi4py.scipy.org

The project just moved over to scipy hosting a few days ago, and the boxes
haven't been unpacked yet. But it is very actively developed, has full
numpy and MPI-2 support, and is looking very nice.

Cheers,

f

Paul Boddie

unread,
Nov 9, 2006, 6:43:58 AM11/9/06
to
robert wrote:
> Shane Hathaway wrote:
> > of multiple cores. I think Python only needs a nice way to share a
> > relatively small set of objects using shared memory. POSH goes in that
> > direction, but I don't think it's simple enough yet.
> >
> > http://poshmodule.sourceforge.net/
>
> interesting, a solution possibly a little faster than pickling - but maybe only in selected
> situations. Made already experiments with pickling through shared memory.

What did you discover in your experiments? I'd certainly suspect that
pickling is going to add an unacceptable degree of overhead in the kind
of application you're attempting to write (using a shared data
structure with random access properties), and I'll admit that I haven't
really had to deal with that kind of situation when using my own
pickle-based parallelisation solutions (which involve communicating
processes).

> With "x = posh.share(x)" an object tree will be (deep-)copied to shared mem ( as far as
> objects fullfil some conditions http://poshmodule.sourceforge.net/posh/html/node6.html: is
> this true for numpy arrays?)

My impression is that POSH isn't maintained any more and that work was
needed to make it portable, as you have observed. Some discussions did
occur on one of the Python development mailing lists about the
possibility of using shared memory together with serialisation
representations faster than pickles (which also don't need to be
managed as live objects by Python), and I think that this could be an
acceptable alternative.

> Every object to be inserted in the hot tunnel object tree has to be copied that same style.
> Thus ~pickling, but somewhat easier to use.

If my understanding of "hot tunnel object tree" is correct, you're
really wanting fast access involving mutual exclusion to some shared
data structure. At this point it becomes worth considering some kind of
distributed object technology (or even a database technology) where you
have to initialise the data structure and then communicate with an
object (or database) to perform operations on it, all for reasons I'll
explain shortly.

In your ideal situation, you say that you'd have the data structure in
the same address space as a number of threads, and each thread would be
able to perform some algorithm on the data structure, but the pattern
of accessing the structure isn't an insignificant concern even where
you can assume that the overheads are generally low: reading and
writing things might be fast in the same address space, but if the
nature of access involves lots of locking, you'll incur quite a penalty
anyway. In other words, if each read or write to the structure involves
acquiring a lock for that operation in isolation, this could
significantly diminish performance, whereas if you can guarantee that
the granularity of locking is more coarse than having to occur upon
each read or write access - that there exist some high-level operations
that require consistency within the data structure - then reasonable
performance might be maintained.

However, if the pattern of concurrent access is restricted to coarse
operations, where some entity has exclusive access to a potentially
large dataset, and where the overhead of the communication of inputs
and outputs to and from that entity is low in comparison to the cost of
performing such coarse operations, and where such operations are
themselves performed infrequently, then such a pattern coincides with
classic database or distributed object scenario. In other words, you
implement the operations acting on the data structure in a distributed
object (or as a database query or operation) and then invoke such
operations from separate processes.

I hope this makes some sense. Generally, demands for high concurrent
performance using threads often ignore other important properties such
as reliability and scalability. Certainly, threads have an important
place - classically, this involved maintaining responsiveness in
graphical user interfaces - but even then, the background threads were
often detached and not competing for the same resources as the
foreground threads servicing the event loop. The only kinds of
situation I can think of right now which might benefit from uninhibited
random access to shared data structures might be things like
simulations or large-scale multi-user games, where an appropriate data
architecture cannot be decided in advance, but even then there are
approaches which attempt to mitigate the seemingly unpredictable nature
of access to shared data.

Paul

Robin Becker

unread,
Nov 9, 2006, 9:39:27 AM11/9/06
to pytho...@python.org
Paul Boddie wrote:

> My impression is that POSH isn't maintained any more and that work was
> needed to make it portable, as you have observed. Some discussions did
> occur on one of the Python development mailing lists about the
> possibility of using shared memory together with serialisation
> representations faster than pickles (which also don't need to be
> managed as live objects by Python), and I think that this could be an
> acceptable alternative.
>

.......

your impression is correct, when I contacted the author some while ago he
maintained that it was a proof of concept only. When I compiled this today with
python-2.4.2 the example fails with this rather cryptic message

~/devel/posh/examples:
$ python Matrix.py
Traceback (most recent call last):
File "Matrix.py", line 6, in ?
import posh
File "/usr/local/lib/python2.4/site-packages/posh/__init__.py", line 14, in ?
import _core
RuntimeError: semaphore set creation failed

but that's most likely something to do with not allocating a semaphore of the
right sort or something.

I think it uses sysv semaphores and although freeBSD 6 has them perhaps there's
something I need to do to allow them to work.


> Paul
>


--
Robin Becker

robert

unread,
Nov 9, 2006, 2:24:35 PM11/9/06
to

thus there would be crash if 2 threads use the global variables (module.__dict__) of a module?
Cannot imagine .. that would would break most reasons for nice python threading.

-robert

robert

unread,
Nov 9, 2006, 2:31:33 PM11/9/06
to
sturlamolden wrote:
> robert wrote:
>
>> I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.
>
> Threading is not the best way to exploit multiprocessors in this
> context. Threads are not the preferred way of exploiting multiple
> processors in scientific computing.


You repeat on the simple bulk jobs in sci computing where a file batch interface is enough:
Process X works on input file Y and writes output to file Z. A manager script collects the results and creates a gnuplot diagramm.
Such things of course would not raise the question of this discussion. Not even the question of threads at all and MPI ...

-robert

"Martin v. Löwis"

unread,
Nov 9, 2006, 3:02:06 PM11/9/06
to
robert schrieb:

>>> what about speed. Is it true that IronPython is almost as fast as
>>> C-Python meanwhile?
>
> thus there would be crash if 2 threads use the global variables
> (module.__dict__) of a module?

IronPython uses the .NET virtual machine. That, in itself, gives
consistency guarantees. Read the ECMA IL specification to find
out what the consistency model is.

Regards,
Martin

robert

unread,
Nov 9, 2006, 3:37:53 PM11/9/06
to

Some thoughts may make reason, when one wants to trick the single GIL a little.
But the length of the text tells us again to possibly get at least refcounting thread safe (within access restrictions) - or get rid of refcount - or do it an go the full path towards a GIL free Python.

Besides some programming costs the speed cost would be that of a few LOCK INC's. Rumors tell not much - and almost nothing when not using non-SMP threading. I'd like to see experimental data regarding the speed. Found nothing on the web so far.

And optimistic all lockfree basic datastructures (obmalloc's freelist, dicts, queues,..) could be built up for futher optimizations:
http://64.233.183.104/search?q=cache:KUXW1Ya2duwJ:softwareforums.intel.com/ids/board/message%3Fboard.id%3D42%26message.id%3D68+SMP+%22LOCK+INC%22+speed&hl=de&gl=de&ct=clnk&cd=2


Little test:

--------------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

volatile int x=0;
clock_t c0,c1;
time_t t0,t1;

int main (int argc, char **argv) {
int sum = 0, i, count = 1000000000;
t0=time(&t0);
c0=clock();
for (i = 0; i < count; ++i) {
__asm lock inc x;
// __asm inc x;
sum += x;
}
c1=clock();
t1=time(&t1);
// printf("%d\n", sum);
printf("clocks: %d\n", c1-c0);
printf("secs: %d\n", t1-t0);
return sum;
}


--------------
0040101F mov eax,3B9ACA00h
13: for (i = 0; i < count; ++i) {
14: __asm lock inc x;
00401024 lock inc dword ptr [_x (00408a00)]
15: sum += x;
0040102B mov edx,dword ptr [_x (00408a00)]
00401031 add esi,edx
00401033 dec eax
00401034 jne main+24h (00401024)
16: }
---------------

results on a UP PIII :

INC version:
clocks: 7520
secs: 7

LOCK INC version:
clocks: 36632
secs: 36


Its probably not much...

-robert

Andrew MacIntyre

unread,
Nov 9, 2006, 4:57:48 PM11/9/06