par i in list:
updatePartition(i)
There would be no locking and it would be the programmer's
responsibility to ensure that the loop was truly parallel and correct.
The intention of this would be to speed up Python execution on multi-
core platforms. Within a few years we will see 100+ core processors as
standard and we need to be ready for that.
There could also be parallel versions of map, filter and reduce
provided.
BUT...none of this would be possible with the current implementation
of Python with its Global Interpreter Lock, which effectively rules
out true parallel processing.
See: http://jessenoller.com/2009/02/01/python-threads-and-the-global-interpreter-lock/
What do others think?
Jeremy Martin
par i in list:
for j in iterations:
updatePartion(i)
sync
commitBoundaryValues(i)
sync
This example is a typical iteration over a grid, e.g. finite elements,
calculation, where the boundary values need to be read by neighbouring
partitions before they are updated. It assumes that the new values of
the boundary values are stored in temporary variables until they can
be safely updated.
Jeremy
> From a user point of view I think that adding a 'par' construct to
> Python for parallel loops would add a lot of power and simplicity, e.g.
>
> par i in list:
> updatePartition(i)
>
> There would be no locking and it would be the programmer's
> responsibility to ensure that the loop was truly parallel and correct.
What does 'par' actually do there?
Given that it is the programmer's responsibility to ensure that
updatePartition was actually parallelized, couldn't that be written as:
for i in list:
updatePartition(i)
and save a keyword?
--
Steven
My reading of the OP is that it tells the interpreter that it
can execute any/all iterations of updatePartion(i) in parallel
(or presumably serially in any order) rather than serially in a
strict sequence.
> Given that it is the programmer's responsibility to ensure
> that updatePartition was actually parallelized, couldn't that
> be written as:
>
> for i in list:
> updatePartition(i)
>
> and save a keyword?
No, because a "for" loop is defined to execute it's iterations
serially in a specific order. OTOH, a "par" loop is required
to execute once for each value, but those executions could
happen in parallel or in any order.
At least that's how I understood the OP.
--
Grant
The multiprocessing module Python2.6 already does something like what
you are talking about. For example I have used the parallel map of
that module to almost double the speed of a small program of mine.
Bye,
bearophile
I can try guessing what the OP is thinking just as well as anyone else,
but "in the face of ambiguity, refuse the temptation to guess" :)
It isn't clear to me what the OP expects the "par" construct is supposed
to actually do. Does it create a thread for each iteration? A process?
Something else? Given that the rest of Python will be sequential (apart
from explicitly parallelized functions), and that the OP specifies that
updatePartition still needs to handle its own parallelization, does it
really matter if the calls to updatePartition happen sequentially?
If it's important to make the calls in arbitrary order, random.shuffle
will do that. If there's some other non-sequential and non-random order
to the calls, the OP should explain what it is. What else, if anything,
does par do, that it needs to be a keyword and statement rather than a
function? What does it do that (say) a parallel version of map() wouldn't
do?
The OP also suggested:
"There could also be parallel versions of map, filter and reduce
provided."
It makes sense to talk about parallelizing map(), because you can
allocate a list of the right size to slot the results into as they become
available. I'm not so sure about filter(), unless you give up the
requirement that the filtered results occur in the same order as the
originals.
But reduce()? I can't see how you can parallelize reduce(). By its
nature, it has to run sequentially: it can't operate on the nth item
until it is operated on the (n-1)th item.
--
Steven
That depends on the operation in question. Addition for example would
work. My math-skills are a bit too rusty to qualify the exact nature of
the operation, commutativity springs to my mind.
Diez
Well, if you're willing to impose the additional constraint that f() must
be associative, then you could load the items into a tree, and work your
way up from the bottom of the tree, applying f() pairwise to the left and
right child of each node, propagating upward.
It would take k1 * O(n) to create the (unsorted) tree, and if all the pairs
in each layer really could be done in parallel, k2 * O(lg n) to propagate
the intermediate values. As long as k2 is large compared to k1, you win.
Of course, if the items are already in some random-access container (such
as a list), you don't even need to do the first step, but in the general
case of generating the elements on the fly with an iterable, you do. Even
with an iterable, you could start processing the first elements while
you're still generating the rest of them, but that gets a lot more
complicated and assuming k2 >> k1, of limited value.
If k2 is about the same as k1, then the whole thing is pointless.
But, this would be something to put in a library function, or maybe a
special-purpose Python derivative, such as numpy. Not in the core language.
That should read "associative" not "commutative".
For instance A+B+C+D could be calculated sequentially as implied by
((A+B)+C)+D
or with some parallelism as implied by
(A+B)+(C+D)
That's an application of the associativity of addition.
Gary Herron
>> But reduce()? I can't see how you can parallelize reduce(). By its
>> nature, it has to run sequentially: it can't operate on the nth item
>> until it is operated on the (n-1)th item.
>
> That depends on the operation in question. Addition for example would
> work.
You'd think so, but you'd be wrong. You can't assume addition is always
commutative.
>>> reduce(operator.add, (1.0, 1e57, -1e57))
0.0
>>> reduce(operator.add, (1e57, -1e57, 1.0))
1.0
> My math-skills are a bit too rusty to qualify the exact nature of
> the operation, commutativity springs to my mind.
And how is reduce() supposed to know whether or not some arbitrary
function is commutative?
--
Steven
>> But reduce()? I can't see how you can parallelize reduce(). By its
>> nature, it has to run sequentially: it can't operate on the nth item
>> until it is operated on the (n-1)th item.
>>
> It can calculate the items in parallel,
I don't understand what calculation you are talking about. Let's take a
simple example:
reduce(operator.sub, [100, 50, 25, 5]) => 100-50-25-5 = 20
What calculations do you expect to do in parallel?
> but the final result must be
> calculated sequence, although if the final operation is commutative then
> some of them could be done in parallel.
But reduce() can't tell whether the function being applied is commutative
or not. I suppose it could special-case a handful of special cases (e.g.
operator.add for int arguments -- but not floats!) or take a caller-
supplied argument that tells it whether the function is commutative or
not. But in general, you can't assume the function being applied is
commutative or associative, so unless you're willing to accept undefined
behaviour, I don't see any practical way of parallelizing reduce().
--
Steven
I don't recall anybody saying it should know that - do you? The OP wants
to introduce parallel variants, not replace the existing ones.
Diez
I was thinking about calculating the sum of a list of expressions, where
the expressions could be calculated in parallel.
def reduce(operation, sequence, startitem=None, parallelize=False)
should be enough. Approaches such as OpenMP also don't guess, they use
explicit annotations.
Diez
You can do this right now with a small amount of work to make
updatePartition a callable which works in parallel, and without the
need for extra syntax. For example, with the pprocess module, you'd
use boilerplate like this:
import pprocess
queue = pprocess.Queue(limit=ncores)
updatePartition = queue.manage(pprocess.MakeParallel
(updatePartition))
(See http://www.boddie.org.uk/python/pprocess/tutorial.html#Map for
details.)
At this point, you could use a normal "for" loop, and you could then
"sync" for results by reading from the queue. I'm sure it's a similar
story with the multiprocessing/processing module.
> There would be no locking and it would be the programmer's
> responsibility to ensure that the loop was truly parallel and correct.
Yes, that's the idea.
> The intention of this would be to speed up Python execution on multi-
> core platforms. Within a few years we will see 100+ core processors as
> standard and we need to be ready for that.
In what sense are we not ready? Perhaps the abstractions could be
better, but it's definitely possible to run Python code on multiple
cores today and get decent core utilisation.
> There could also be parallel versions of map, filter and reduce
> provided.
Yes, that's what pprocess.pmap is for, and I imagine that other
solutions offer similar facilities.
> BUT...none of this would be possible with the current implementation
> of Python with its Global Interpreter Lock, which effectively rules
> out true parallel processing.
>
> See:http://jessenoller.com/2009/02/01/python-threads-and-the-global-inter...
>
> What do others think?
That your last statement is false: true parallel processing is
possible today. See the Wiki for a list of solutions:
http://wiki.python.org/moin/ParallelProcessing
In addition, Jython and IronPython don't have a global interpreter
lock, so you have the option of using threads with those
implementations, too.
Paul
Did I really need to spell it out? From context, I'm talking about the
*parallel version* of reduce().
--
Steven
It would be nice if the OP would speak up and tell us what he intended,
so we didn't have to guess what he meant. We're getting further and
further away from his original suggestion of a "par" loop.
If you pass parallize=True, then what? Does it assume that operation is
associative, or take some steps to ensure that it is? Does it guarantee
to perform the operations in a specific order, or will it potentially
give non-deterministic results depending on the order that individual
calculations come back?
As I said earlier, parallelizing map() sounds very plausible to me, but
the approaches that people have talked about for parallelizing reduce()
so far sound awfully fragile and magically to me. But at least I've
learned one thing: given an associative function, you *can* parallelize
reduce using a tree. (Thanks Roy!)
--
Steven
Hi Paul and others,
Thanks for your responses to my original questions.
Paul, thanks for explaining about the pprocess module which appears
very useful. I presume that this is using multiple operating system
processes rather than threads which would probably imply that it is
suitable for coarse grained parallel programming rather than fine-
grained because of overhead in starting up new processes and sharing
objects. (How is that done, by the way?). It probably has advantages
and disadvantages compared with thread based parallelism.
My suggestion is primarily about using multiple threads and sharing
memory - something akin to the OpenMP directives that one of you has
mentioned. To do this efficiently would involve removing the Global
Interpreter Lock, or switching to Jython or Iron Python as you
mentioned.
However I *do* actually want to add syntax to the language. I think
that 'par' makes sense as an official Python construct - we already
have had this in the Occam programming language for twenty-five years.
The reason for this is ease of use. I would like to make it easy for
amateur programmers to exploit natural parallelism in their
algorithms. For instance somebody who wishes to calculate a property
of each member from a list of chemical structures using the Python
Daylight interface: with my suggestion they could potentially get a
massive speed up just by changing 'for' to 'par' or 'map' to 'pmap'.
(Or map with a parallel keyword argument set as suggested). At present
they would have to manually chop up their work and run it as multiple
processes in order to achieve the same - fine for expert programmers
but not reasonable for people working in other domains who wish to use
Python as a utility because of its fantastic productivity and ease of
use.
Let me clarify what I think par, pmap, pfilter and preduce would mean
and how they would be implemented. A par loop is like a for loop,
however the programmer is saying that the order in which the
iterations are performed doesn't matter and they might be performed in
parallel. The python system then has the option to allocate a number
of threads to the task and share out the iterations accordingly
between the threads. (It might be that the programmer should be
allowed to explictly define the number of threads to use or can
delegate that decision to the system). Parallel pmap and pfilter would
be implemented in much the same way, although the resultant list might
have to be reassembled from the partial results returned from each
thread. As people have pointed out, parallel reduce is a tricky option
because it requires the binary operation to be associative in which
case it can be parallelised by calculating the result using a tree-
based evaluation strategy.
I have used all of OpenMP, MPI, and Occam in the past. OpenMP adds
parallelism to programs by the use of special comment strings, MPI by
explicit calls to library routines, and Occam by explicit syntactical
structures. Each has its advantages. I like the simplicity of OpenMP,
the cross-language portability of MPI and the fact the concurrency is
built in to the Occam language. What I am proposing here is a hybrid
of the OpenMP and Occam approaches - a change to the language which is
very natural and yet is easy for programmers to understand.
Concurrency is generally regarded as the hardest concept for
programmers to grasp.
Jeremy
From what context - your permanent arguing that there is no way for reduce
to determine if the passed operation is associative (which no one claimed
to be possible, nor to be needed)? No, from *that* context it wasn't clear.
It's the programmer who decides which version to use.
Diez
> My suggestion is primarily about using multiple threads and sharing
> memory - something akin to the OpenMP directives that one of you has
> mentioned. To do this efficiently would involve removing the Global
> Interpreter Lock, or switching to Jython or Iron Python as you
> mentioned.
>
> However I *do* actually want to add syntax to the language.
Good luck with that. The GIL is not going away any time soon (or
probably ever) and as long as CPython is the "official"
implementation, there are almost zero chances of adding syntax support
for this. Besides, Guido and other py-devs are not particularly keen
on threads as a parallelization mechanism.
George
I can understand you having a preference, but you may have to choose
between fighting over that method or achieving results. I agree with ...
> Good luck with that. The GIL is not going away any time soon (or
> probably ever) and as long as CPython is the "official"
> implementation, there are almost zero chances of adding syntax support
> for this. Besides, Guido and other py-devs are not particularly keen
> on threads as a parallelization mechanism.
Parallel processes can run on multiple processors as well as multiple
cores within a processor. Some problems, like massive search, require
multiple disks (or memories, or IO ports) as well as multiple processing
units. There is debate over how useful massively multicore processors
will actually be and for which types of problems.
tjr
Hi George,
> The GIL is not going away any time soon (or probably ever) and as long as CPython is
> the "official" implementation, there are almost zero chances of adding syntax support
> for this.
What concerns me about this statement is that, if it is true, Python
risks falling behind when other languages which can exploit multicore
effectively start to come to the fore. I know that Microsoft is
actively researching in this area and they are hoping that F# will
offer good ways to exploit multi-core architectures.
As I understand it the reason for the GIL is to prevent problems with
garbage collection in multi-threaded applications. Without it the
reference count method is prone to race conditions. However it seems
like a fairly crude mechanism to solve this problem. Individual
semaphores could be used for each object reference counter, as in
Java.
Jeremy
Hi Terry,
> Parallel processes can run on multiple processors as well as multiple
> cores within a processor. Some problems, like massive search, require
> multiple disks (or memories, or IO ports) as well as multiple processing
> units. There is debate over how useful massively multicore processors
> will actually be and for which types of problems.
I agree with this. My approach is in the same space as OpenMP - a
simple way for users to define shared memory parallelism. There is no
reason why it would not work with multiple disks or IO ports on the
same shared memory server. However for distributed memory hardware it
would be a non-starter. In that case we would need something like the
Message Passing Interface.
Jeremy
Thanks for your interesting response!
> Paul, thanks for explaining about the pprocess module which appears
> very useful. I presume that this is using multiple operating system
> processes rather than threads which would probably imply that it is
> suitable for coarse grained parallel programming rather than fine-
> grained because of overhead in starting up new processes and sharing
> objects. (How is that done, by the way?). It probably has advantages
> and disadvantages compared with thread based parallelism.
Communication via interprocess channels is done using pickled objects.
For large data volumes, the most efficient approach can actually
involve the filesystem. My opinion on the threads vs. processes debate
centres on the relative efficiency of creating processes on systems
which have evolved to minimise the cost of doing so: people seem to
believe that forking a process creates an entirely new copy, but this
is unlikely to be the case on most modern, popular general-purpose
operating systems.
> My suggestion is primarily about using multiple threads and sharing
> memory - something akin to the OpenMP directives that one of you has
> mentioned. To do this efficiently would involve removing the Global
> Interpreter Lock, or switching to Jython or Iron Python as you
> mentioned.
One could always share memory using processes: I think the ease of
doing so is the only argument for using threads, and the caveats
involved obviously lead us back to the global interpreter lock (in the
threaded situation).
> However I *do* actually want to add syntax to the language. I think
> that 'par' makes sense as an official Python construct - we already
> have had this in the Occam programming language for twenty-five years.
> The reason for this is ease of use. I would like to make it easy for
> amateur programmers to exploit natural parallelism in their
> algorithms. For instance somebody who wishes to calculate a property
> of each member from a list of chemical structures using the Python
> Daylight interface: with my suggestion they could potentially get a
> massive speed up just by changing 'for' to 'par' or 'map' to 'pmap'.
> (Or map with a parallel keyword argument set as suggested). At present
> they would have to manually chop up their work and run it as multiple
> processes in order to achieve the same - fine for expert programmers
> but not reasonable for people working in other domains who wish to use
> Python as a utility because of its fantastic productivity and ease of
> use.
Well, you have to know that the components actually lend themselves to
parallelisation, and the "chopping up" of the work would be similar to
that involved with one of the parallel processing libraries today: you
call something in each iteration that actually goes off with its own
piece of the data and runs in parallel.
> Let me clarify what I think par, pmap, pfilter and preduce would mean
> and how they would be implemented. A par loop is like a for loop,
> however the programmer is saying that the order in which the
> iterations are performed doesn't matter and they might be performed in
> parallel.
Actually, with pprocess's abstractions, the iteration ordering doesn't
really matter, either, although the processes are dispatched in order.
It's when the results are collected that the ordering matters: the
difference between maps and queues. (I suppose it's similar with other
libraries.)
> The python system then has the option to allocate a number
> of threads to the task and share out the iterations accordingly
> between the threads. (It might be that the programmer should be
> allowed to explictly define the number of threads to use or can
> delegate that decision to the system). Parallel pmap and pfilter would
> be implemented in much the same way, although the resultant list might
> have to be reassembled from the partial results returned from each
> thread. As people have pointed out, parallel reduce is a tricky option
> because it requires the binary operation to be associative in which
> case it can be parallelised by calculating the result using a tree-
> based evaluation strategy.
The sharing out of tasks to threads or execution contexts is done
using various abstractions in pprocess which probably resemble the
"pool" abstraction in the multiprocessing library: only as tasks are
completed are the free resources allocated to new tasks. A parallel
reduce abstraction doesn't exist in pprocess, but you could roll your
own.
> I have used all of OpenMP, MPI, and Occam in the past. OpenMP adds
> parallelism to programs by the use of special comment strings, MPI by
> explicit calls to library routines, and Occam by explicit syntactical
> structures. Each has its advantages. I like the simplicity of OpenMP,
> the cross-language portability of MPI and the fact the concurrency is
> built in to the Occam language. What I am proposing here is a hybrid
> of the OpenMP and Occam approaches - a change to the language which is
> very natural and yet is easy for programmers to understand.
> Concurrency is generally regarded as the hardest concept for
> programmers to grasp.
It's interesting to hear your experiences on this topic. I still don't
see the need for special syntax; if you were to propose more pervasive
changes, like arbitrary evaluation ordering for function arguments or
the ability to prevent/detect side-effects, the motivation would be
clearer, I think.
Paul
> However I *do* actually want to add syntax to the language. I think that
> 'par' makes sense as an official Python construct - we already have had
> this in the Occam programming language for twenty-five years. The reason
> for this is ease of use. I would like to make it easy for amateur
> programmers to exploit natural parallelism in their algorithms. For
> instance somebody who wishes to calculate a property of each member from
> a list of chemical structures using the Python Daylight interface: with
> my suggestion they could potentially get a massive speed up just by
> changing 'for' to 'par' or 'map' to 'pmap'. (Or map with a parallel
> keyword argument set as suggested). At present they would have to
> manually chop up their work and run it as multiple processes in order to
> achieve the same - fine for expert programmers but not reasonable for
> people working in other domains who wish to use Python as a utility
> because of its fantastic productivity and ease of use.
There seems to be some discrepancy between this, and what you wrote in
your first post:
"There would be no locking and it would be the programmer's
responsibility to ensure that the loop was truly parallel and correct."
So on the one hand, you want par to be utterly simple-minded and to do no
locking. On the other hand you want it so simple to use that amateurs can
mechanically replace 'for' with 'par' in their code and everything will
Just Work, no effort or thought required. But those two desires are
inconsistent.
Concurrency is an inherently complicated problem: deadlocks and race
conditions abound, and are notoriously hard to reproduce, let alone
debug. If par is simple, and does no locking, then the programmer needs
to worry about those complications. If you want programmers to ignore
those complications, either (1) par needs to be very complicated and
smart, to do the Right Thing in every case, or (2) you're satisfied if
par produces buggy code when used in the fashion you recommend.
The third option is, make par really simple and put responsibility on the
user to write code which is concurrent. I think that's the right
solution, but it means a simplistic "replace `for` with `par` and your
code will run faster" will not work. It might run faster three times out
of five, but the other two times it will hang in a deadlock, or produce
incorrect results, or both.
--
Steven
Not really. It's main purpose to prevent context switches from
happening in the middle of some C code (essentially it makes all C
code a "critical" section, except when the C code explicitly releases
the GIL). This tremendously simplifies the task of writing extension
modules, not to mention the Python core.
> Without it the
> reference count method is prone to race conditions. However it seems
> like a fairly crude mechanism to solve this problem. Individual
> semaphores could be used for each object reference counter, as in
> Java.
Java doesn't use reference counting.
Individual locks on the refcounts would be prohibitively expensive in
Python, a cost that Java doesn't have.
Even if you decided to accept the penalty and add locking to
refcounts, you still have to be prepared for context switching at any
time when writing C code, which means in practice you have to lock any
object that's being accessed--that's in addition to the refcount lock.
I am not disagreeing with your assessment in general, it would be
great if Python were better able to take advantage of multiple cores,
but it's not as simple a thing as you're making it out to be.
Carl Banks
> Let me clarify what I think par, pmap, pfilter and preduce would mean
> and how they would be implemented.
[...]
Just for fun, I've implemented a parallel-map function, and done a couple
of tests. Comments, criticism and improvements welcome!
import threading
import Queue
import random
import time
def f(arg): # Simulate a slow function.
time.sleep(0.5)
return 3*arg-2
class PMapThread(threading.Thread):
def __init__(self, clients):
super(PMapThread, self).__init__()
self._clients = clients
def start(self):
super(PMapThread, self).start()
def run(self):
while True:
try:
data = self._clients.get_nowait()
except Queue.Empty:
break
target, where, func, arg = data
result = func(arg)
target[where] = result
class VerbosePMapThread(threading.Thread):
def __init__(self, clients):
super(VerbosePMapThread, self).__init__()
print "Thread %s created at %s" % (self.getName(), time.ctime())
def start(self):
super(VerbosePMapThread, self).start()
print "Thread %s starting at %s" % (self.getName(), time.ctime())
def run(self):
super(VerbosePMapThread, self).run()
print "Thread %s finished at %s" % (self.getName(), time.ctime())
def pmap(func, seq, verbose=False, numthreads=4):
size = len(seq)
results = [None]*size
if verbose:
print "Initiating threads"
thread = VerbosePMapThread
else:
thread = PMapThread
datapool = Queue.Queue(size)
for i in xrange(size):
datapool.put( (results, i, f, seq[i]) )
threads = [PMapThread(datapool) for i in xrange(numthreads)]
if verbose:
print "All threads created."
for t in threads:
t.start()
# Block until all threads are done.
while any([t.isAlive() for t in threads]):
if verbose:
time.sleep(0.25)
print results
return results
And here's the timing results:
>>> from timeit import Timer
>>> setup = "from __main__ import pmap, f; data = range(50)"
>>> min(Timer('map(f, data)', setup).repeat(repeat=5, number=3))
74.999755859375
>>> min(Timer('pmap(f, data)', setup).repeat(repeat=5, number=3))
20.490942001342773
--
Steven
My only comment would be that your "slow function" might not be
a very simulation for the general-case, since it uses
time.sleep() which releases the GIL:
> def f(arg): # Simulate a slow function.
> time.sleep(0.5)
> return 3*arg-2
Any Python function that isn't calling a library function
written in C that releases the GIL won't show any speedup will
it? I don't have a multi-core machine to try it on, but what
happens when you replace your "slow function" code with
something that actually burns CPU using pure-Python code
instead of blocking on a timer in the OS?
--
Grant
Hi Steven,
> you want it so simple to use that amateurs can mechanically replace 'for' with 'par' in their
> code and everything will Just Work, no effort or thought required.
Yes I do want the par construction to be simple, but of course you
can't just replace a for loop with a par loop in the general case.
This issue arises when people use OpenMP: you can take a correct piece
of code, add a comment to indicate that a loop is 'parallel', and if
you get it wrong the code with no longer work correctly. With my 'par'
construct the programmer's intention is made explicit in the code,
rather than by a compiler directive and so I think that is clearer
than OpenMP.
As I wrote before, concurrency is one of the hardest things for
professional programmers to grasp. For 'amateur' programmers we need
to make it as simple as possible, and I think that a parallel loop
construction and the dangers that lurk within would be reasonably
straightforward to explain: there are no locks to worry about, no
message passing. The only advanced concept is the 'sync' keyword,
which would be used to rendezvous all the threads. That would only be
used to speed up certain codes in order to avoid having to repeatedly
shut down and start up gangs of threads.
Jeremy
Hi Carl,
> I am not disagreeing with your assessment in general, it would be
> great if Python were better able to take advantage of multiple cores,
> but it's not as simple a thing as you're making it out to be.
Thanks for explaining a few things to me. So it would seem that
replacing the GIL with something which allows better scalability of
multi-threaded applications, would be very complicated. The paper by
Jesse Nolle which I referenced in my original posting includes the
following:
"In 1999 Greg Stein created a patch set for the interpreter that
removed the GIL, but added granular locking around sensitive
interpreter operations. This patch set had the direct effect of
speeding up threaded execution, but made single threaded execution two
times slower."
Source: http://jessenoller.com/2009/02/01/python-threads-and-the-global-interpreter-lock/
That was ten years ago - do you have any idea as to how things have
been progressing in this area since then?
Jeremy
> "In 1999 Greg Stein created a patch set for the interpreter that
> removed the GIL, but added granular locking around sensitive
> interpreter operations. This patch set had the direct effect of
> speeding up threaded execution, but made single threaded execution two
> times slower."
>
> Source: http://jessenoller.com/2009/02/01/python-threads-and-the-global-interpreter-lock/
>
> That was ten years ago - do you have any idea as to how things have
> been progressing in this area since then?
The slowdown may be less severe, but known attempts to replace, rather
than merely remove, GIL slow down single thread execution. Otherwise,
GIL wouild be gone. No one has been willing to maintain a multi-thread
fork/branch.
Hi Steven,
I am impressed by this - it shows the potential speedup that pmap
could give. Although the GIL would be a problem as things for speed up
of pure Python code. Do Jython and Iron Python include the threading
module?
Jeremy
I have coded up a (slightly) more realistic example. Here is a code to
implement the numerical solution to Laplace's equation, which can
calculate the values of a potential field across a rectangular region
given fixed boundary values:
xmax = 200
ymax = 200
niterations = 200
# Initialisation
old=[[0.0 for y in range(ymax)] for x in range(xmax)]
for x in range(xmax):
old[x][0] = 1.0
for y in range(ymax):
old[0][y] = 1.0
new=[[old[x][y] for y in range(ymax)] for x in range(xmax)]
# Iterations
for i in range(1,100):
print "Iteration: ", i
for x in range(1,ymax-1):
for y in range(1, xmax-1):
new[x][y] = \
0.25*(old[x-1][y] + old[x+1][y] + old[x][y-1] + old[x-1][y])
# Swap over the new and old storage arrays
tmp = old
old = new
new = tmp
# Print results
for y in range(ymax):
for x in range(xmax):
print str(old[x][y]).rjust(7),
print
In order to parallelise this with the par construct would require a
single alteration to the iteration section:
for i in range(1,100):
print "Iteration: ", i
par x in range(1,ymax-1):
for y in range(1, xmax-1):
new[x][y] = \
0.25*(old[x-1][y] + old[x+1][y] + old[x][y-1] + old[x-1][y])
# Swap over the new and old storage arrays
tmp = old
old = new
new = tmp
The par command tells python that it may choose to fire up multiple
threads and automatically partition the data between them. So, for
instance, if there were ten threads created each one would work on a
sub-range of x values: thread 1 takes x from 1 to 100, thread 2 takes
x from 101 to 200, etc.
However with this approach there is an overhead in each iteration of
starting up the threads and shutting them down again. Depending on the
impact of this overhead, it might be better to keep the threads
running between iterations by modifying the code like this, adding a
'sync' command to synchronise the threads at the end of each iteration
and also making sure that only one of the threads performs the swap of
the data arrays.
par x in range(1,ymax-1):
for i in range(1,100):
if __thread__ == 0: print "Iteration: ", i
for y in range(1, xmax-1):
new[x][y] = \
0.25*(old[x-1][y] + old[x+1][y] + old[x][y-1] + old[x-1][y])
# Swap over the new and old storage arrays
sync
if __thread__ == 0:
tmp = old
old = new
new = tmp
sync
Jeremy
Jython does, and AFAIK IronPython also. Jython also has no GIL I think,
for IronPython I don't know that.
Diez
Hi, joining late.
My first guess is that you haven't considered the existing syntactic
variants.
for i in list:
fireandforget( updatePartition )( i )
for i in list:
with fireandforget( ) as x:
x.call( updatePartition )( i )
@fireandforget
def fun( i ):
updatePartition( i )
for i in list:
fun( i )
Are you suggesting only syntactic sugar, or a true semantic change to
make parallel programming easier?
>> you want it so simple to use that amateurs can mechanically replace
>> 'for' with 'par' in their code and everything will Just Work, no effort
>> or thought required.
>
> Yes I do want the par construction to be simple, but of course you can't
> just replace a for loop with a par loop in the general case.
But that's exactly what you said you wanted people to be able to do:
"with my suggestion they could potentially get a massive speed up just by
changing 'for' to 'par' or 'map' to 'pmap'."
I am finding this conversation difficult because it seems to me you don't
have a consistent set of requirements.
> This issue
> arises when people use OpenMP: you can take a correct piece of code, add
> a comment to indicate that a loop is 'parallel', and if you get it wrong
> the code with no longer work correctly.
How will 'par' be any different? It won't magically turn code with
deadlocks into bug-free code.
> With my 'par' construct the
> programmer's intention is made explicit in the code, rather than by a
> compiler directive and so I think that is clearer than OpenMP.
A compiler directive is just as clear about the programmer's intention as
a keyword. Possibly even more so.
#$ PARALLEL-LOOP
for x in seq:
do(x)
Seems pretty obvious to me. (Not that I'm suggesting compiler directives
is a good solution to this problem.)
> As I wrote before, concurrency is one of the hardest things for
> professional programmers to grasp. For 'amateur' programmers we need to
> make it as simple as possible,
The problem is that "as simple as possible" is Not Very Simple. There's
no getting around the fact that concurrency is inherently complex. In
some special cases, you can keep it simple, e.g. parallel-map with a
function that has no side-effects. But in the general case, no, you can't
avoid dealing with the complexity, at least a little bit.
> and I think that a parallel loop
> construction and the dangers that lurk within would be reasonably
> straightforward to explain: there are no locks to worry about, no
> message passing.
It's *already* easy to explain. And having explained it, you still need
to do something about it. You can't just say "Oh well, I've had all the
pitfalls explained to me, so now I don't have to actually do anything
about avoiding those pitfalls". You still need to actually avoid them.
For example, you can choose one of four tactics:
(1) the loop construct deals with locking;
(2) the caller deals with locking;
(3) nobody deals with locking, therefore the code is buggy and risks
deadlocks; or
(4) the caller is responsible for making sure he never shares data while
looping over it.
I don't think I've missed any possibilities. You have to pick one of
those four.
> The only advanced concept is the 'sync' keyword, which
> would be used to rendezvous all the threads. That would only be used to
> speed up certain codes in order to avoid having to repeatedly shut down
> and start up gangs of threads.
So now you want a second keyword as well.
--
Steven
> On 2009-05-19, Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au>
> wrote:
>> On Mon, 18 May 2009 02:27:06 -0700, jeremy wrote:
>>
>>> Let me clarify what I think par, pmap, pfilter and preduce would mean
>>> and how they would be implemented.
>> [...]
>>
>> Just for fun, I've implemented a parallel-map function, and done a
>> couple of tests. Comments, criticism and improvements welcome!
>
> My only comment would be that your "slow function" might not be a very
> simulation for the general-case, since it uses time.sleep() which
> releases the GIL:
I didn't expect my code to magically overcome fundamental limitations of
the CPython interpreter :)
>> def f(arg): # Simulate a slow function.
>> time.sleep(0.5)
>> return 3*arg-2
>
> Any Python function that isn't calling a library function written in C
> that releases the GIL won't show any speedup will it?
Not necessarily. Here's another function, that uses a loop instead of
sleep.
def g(arg, SIZE=8*10**6):
# Default SIZE is chosen so that on my machine, the loop
# takes approximately 0.5 second.
for x in xrange(SIZE):
pass
return 3*arg-2
>>> setup = 'from __main__ import pmap, g; data = range(50)'
>>> min(Timer('map(g, data)', setup).repeat(repeat=5, number=3))
65.093590974807739
>>> min(Timer('pmap(g, data)', setup).repeat(repeat=5, number=3))
20.268381118774414
However, if the function is fast enough that the GIL doesn't get a chance
to be released often enough, then pmap() is a pessimation:
>>> def g(arg):
... return 3*arg+2
...
>>> min(Timer('map(g, data)', setup).repeat(repeat=5, number=3))
0.00012803077697753906
>>> min(Timer('pmap(g, data)', setup).repeat(repeat=5, number=3))
19.960626125335693
Concurrency doesn't come for free. There are setup and teardown costs,
and management costs. Presumably if pmap() was written more cleverly, in
C, those costs could be reduced, but not eliminated.
> I don't have a
> multi-core machine to try it on, but what happens when you replace your
> "slow function" code with something that actually burns CPU using
> pure-Python code instead of blocking on a timer in the OS?
Two simple work-arounds are:
* use Jython or IronPython; or
* insert time.sleep(0.000001) into your function at various points.
--
Steven
I wonder if the compiler can check that a given function doesn't
change any data. Then:
@pure
def f(x):
return x*sqrt(x) + 3 # does not mutate any data
@pure
def g(x): ... # likewise
s = parallel_dot_product(parallel_map(f, vec), parallel_map(g,vec))
You can do the basics of this using the 'ast' module. Just check that
no nodes in the ast tree are Assign nodes, including augmented
assign. Then 'f' is defined as:
f= pure( '''
return x*sqrt(x) + 3 # does not mutate any data
''' )
Untested. However, you do need to check that the subchildren don't
make mutations. This adds hooks to the AST, since names can change at
run-time. Also, it's only defined for functions that call functions
that are all pure Python and defined using 'pure' as well, without
inspecting the bytecode. At this point one might look into
metaprogramming and script autogeneration.
The 'pure' wrapper doesn't fundamentally change the semantics of the
function; it is more like a pre- and post-condition check.
> On May 19, 11:20 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
>> Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> writes:
>> > (4) the caller is responsible for making sure he never shares data
>> > while looping over it.
>>
>> > I don't think I've missed any possibilities. You have to pick one of
>> > those four.
>>
>> I wonder if the compiler can check that a given function doesn't change
>> any data. Then:
>>
>> @pure
>> def f(x):
>> return x*sqrt(x) + 3 # does not mutate any data
>>
>> @pure
>> def g(x): ... # likewise
>>
>> s = parallel_dot_product(parallel_map(f, vec), parallel_map(g,vec))
>
> You can do the basics of this using the 'ast' module. Just check that
> no nodes in the ast tree are Assign nodes, including augmented assign.
> Then 'f' is defined as:
>
> f= pure( '''
> return x*sqrt(x) + 3 # does not mutate any data
> ''' )
>
> Untested.
Can you explain how you can tell that there are no side-effects from
x*sqrt(x)+3 ? What if I have this?
class Funny(object):
def __add__(self, other):
global parrot
parrot += 1
return 5 + other
x = Funny()
--
Steven
Hi Steven,
You wrote:
> I am finding this conversation difficult because it seems to me you don't have a consistent set of requirements".
I think that my position has actually been consistent throughout this
discussion about what I would like to achieve. However I have learned
more about the inner workings of python than I knew before which have
made it clear that it would be difficult to implement (in CPython at
least). And also I never intended to present this as a fait accompli -
the intention was to start a debate as we have been doing. You also
wrote
> So now you want a second keyword as well
I actually described the 'sync' keyword in my second email before
anybody else contributed.
I *do* actually know a bit about concurrency and would never imply
that *any* for loop could be converted to a parallel one. The
intention of my remark "with my suggestion they could potentially get
a massive speed up just by changing 'for' to 'par' or 'map' to
'pmap'." is that it could be applied in the particular circumstances
where there are no dependencies between different iterations of the
loop.
Regarding your implementation strategies, mine would be related to
this one:
> (4) the caller is responsible for making sure he never shares data while
> looping over it.
However in my world there is no problem with threads sharing data as
long as they do not change the values. So the actual rule would be
something like:
5. The caller is responsible for making sure that one iteration of the
parallel loop never tries to write to a variable that another
iteration may read, unless separated by a 'sync' event.
This shows why the sync event is needed - to avoid race conditions on
shared variables. It is borrowed from the BSP paradigm - although that
is a distibuted memory approach. Without the sync clause, rule 5 would
just be the standard way of defining a parallelisable loop.
Jeremy
P.S. I have a couple of additional embellishments to share at this
stage:
1. Parallel slackness - this is a concept where one creates many more
threads than there are available cores to cover up for a load
imbalance. So you might use this in a case where the iterations you
wish to parallelise take a variable amount of time. I think my par
concept needs a mechanism for the user to specify how many threads to
create - overriding any decision made by Python, e.g.
par 100 i in list: # Create 100 threads no matter how many cores are
available.
2. Scope of the 'sync' command. It was pointed out to me by a
colleague that I need to define what happens with sync when there are
nested par loops. I think that it should be defined to apply to the
innermost par construct which encloses the statement.
Yes, in general you need whole-program analysis with Python to know if
there are any side-effects or not. That said, with a process forking
mechanism where modified "globals" are not global beyond each process,
you should be able to guard against side-effects more effectively.
Paul
I was going to write something like this, but you've beat me to it :)
Slightly different though; rather than have pmap collate everything
together then return it, have it yield results as and when it gets
them and stop iteration when it's done, and rename it to par to keep
the OP happy and you should get something like what he initially
requests (I think):
total = 0
for score in par(f, data):
total += score
Iain
>> Any Python function that isn't calling a library function written in C
>> that releases the GIL won't show any speedup will it?
>
> Not necessarily. Here's another function, that uses a loop instead of
> sleep.
>
> def g(arg, SIZE=8*10**6):
> # Default SIZE is chosen so that on my machine, the loop
> # takes approximately 0.5 second.
> for x in xrange(SIZE):
> pass
> return 3*arg-2
>
>
>>>> setup = 'from __main__ import pmap, g; data = range(50)'
>>>> min(Timer('map(g, data)', setup).repeat(repeat=5, number=3))
> 65.093590974807739
>>>> min(Timer('pmap(g, data)', setup).repeat(repeat=5, number=3))
> 20.268381118774414
I find that surprising. Evidently my understanding of the GIL
is wrong.
>> I don't have a multi-core machine to try it on, but what
>> happens when you replace your "slow function" code with
>> something that actually burns CPU using pure-Python code
>> instead of blocking on a timer in the OS?
>
> Two simple work-arounds are:
>
> * use Jython or IronPython; or
>
> * insert time.sleep(0.000001) into your function at various points.
While the latter will allow thread switching, I don't see how
it will increase thread parallelism.
--
Grant Edwards grante Yow! WHO sees a BEACH BUNNY
at sobbing on a SHAG RUG?!
visi.com
It depends on whether you want the outputs to correspond to the inputs
in the resulting sequence. If pmap is supposed to behave like map, you
want the positions of each input and corresponding output to match. As
I wrote earlier, in pprocess you distinguish between these cases by
employing maps (for input/output correspondence) and queues (for
"first ready" behaviour).
I don't recall whether the Map class in pprocess blocks for all the
data to be returned, or whether you can consume any outputs that are
at the start of the output sequence (and then block until output
arrives at the next available position), but this would be a logical
enhancement.
Paul
While I agree that the GIL greatly simplifies things for the interpreter, I
don't understand this statement. In practice, you should lock all critical
sections if you expect your code to be used in a multithreading environment.
That can't be different from what Java, C# or any other languages do,
including C++. Why is that so expensive in python extensions, that it is used
as an argument against removing the GIL?
--
Luis Zarrabeitia (aka Kyrie)
Fac. de Matemática y Computación, UH.
http://profesores.matcom.uh.cu/~kyrie
I wasn't really arguing that locking individual objects was a
significant penalty in computer time, only in programmer time. The
locks on reference counts are what's expensive.
Also, I'm not using it as an argument against removing the GIL. I
want to remove the GIL. I'm only pointing out that removing the GIL
is not easy, and once it's removed there is a cost.
Carl Banks
Ah, allright then. Thanks for the clarification.
Python is intended to be simple/easy to integrate with random C
libraries. Therefore you have to write explicit code from the C side in
order to drop the GIL.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/
"A foolish consistency is the hobgoblin of little minds, adored by little
statesmen and philosophers and divines." --Ralph Waldo Emerson
Erm, that doesn't really answers the question. If there were no GIL, the C
code called from python would be just as unsafe as the C code called from C.
And if "not thread-safe, you take care of the locking" notices are enough for
the original libraries (and a lot of other languages, and even python
constructs), the C extension could always grab the locks.
There must be another reason (i.e, the refcounts) to argue _for_ the GIL,
because this one just seems to be just an attempt to fix unsafe code when
called from python. And that was my point. Refcounts + interpreter simplicity
seem to imply the need for a GIL, but to make unsafe code safe without fixing
said code (or even thinking about it) is a weird goal... specially if it
became the only reason for a GIL. After all, one could argue for that goal in
almost all languages.
The designers of Python made a design decision(**) that extension
writers would not have to take care of locking. They could have made
a different decision, they just didn't.
> There must be another reason (i.e, the refcounts) to argue _for_ the GIL,
Why?
> because this one just seems to be just an attempt to fix unsafe code when
> called from python.
I think you are being unfair in calling it unsafe.
Suppose if I were to call PyList_Append from a C extension. It's not
necessary for me to guard the list I'm calling it on with a lock,
because only the GIL thread is allowed to call most Python API or
otherwise access objects. But you seem to be suggesting that since I
didn't guard the list with a lock it is "unsafe", even though the GIL
is sufficient?
No, I totally disagree. The code is not "unsafe" and the GIL doesn't
"fix" it. The code is jsut
> And that was my point. Refcounts + interpreter simplicity
> seem to imply the need for a GIL, but to make unsafe code safe without fixing
> said code (or even thinking about it) is a weird goal...
Why? Do you seriously not see the benefit of simplifying the work of
extention writers and core maintainers? You don't have to agree that
it's a good trade-off but it's a perfectly reasonable goal.
I highly suspect Aahz here would argue for a GIL even without the
refcount issue, and even though I wouldn't agree, there's nothing
weird or unreasonable about the argument, it's just a different
viewpoint.
> specially if it
> became the only reason for a GIL. After all, one could argue for that goal in
> almost all languages.
"B-B-B-But other languages do it that way!" is not a big factor in
language decisions in Python.
Carl Banks
(**) - To be fair, Python didn't originally support threads, so there
was a lot of code that would have become unsafe had threading been
added without the GIL, and that probably influenced the decision to
use a GIL, but I'm sure that wasn't only reason. Would Python have a
GIL if threading had been there from the start? Who knows.
> On May 20, 4:07�pm, Luis Zarrabeitia <ky...@uh.cu> wrote:
> > On Wednesday 20 May 2009 06:16:39 pm Aahz wrote:
>
> The designers of Python made a design decision(**) that extension
> writers would not have to take care of locking. They could have made
> a different decision, they just didn't.
Well, then, maybe that's the only python's decision so far I may not agree with.
And I'm not criticizing it... but I'm questioning it, because I honestly don't
understand it.
> > There must be another reason (i.e, the refcounts) to argue _for_ the GIL,
>
> Why?
1- refcounts is a _very_ strong reason to argue for the GIL. Add to that
simplifying CPython implementation, and you don't really need another one on top
of that, at least not to convince me, but
2- in [almost] every other language, _you_ have to be aware of the critical
sections when multithreading. Even in pure python, you have use locks. Don't you
find at least a bit odd the argument that in the particular case you are writing
a C extension from python, you should be relieved of the burden of locking?
> > because this one just seems to be just an attempt to fix unsafe code when
> > called from python.
>
> I think you are being unfair in calling it unsafe.
I think I was unfair calling it a "fix". It sounds like it was broken, my bad. I
should have used 'not multithread-ready'.
> Suppose if I were to call PyList_Append from a C extension. It's not
> necessary for me to guard the list I'm calling it on with a lock,
> because only the GIL thread is allowed to call most Python API or
> otherwise access objects. But you seem to be suggesting that since I
> didn't guard the list with a lock it is "unsafe", even though the GIL
> is sufficient?
Certainly not. If you program under the assumption that you have a GIL, of
course it is not unsafe. Not your code, anyway. But, why is PyList_Append not
multithread-ready? (or rather, why does PyList_Append would requiere a _global_
lock to be multithread ready, instead of a more local lock?) Of course, given
that the GIL exists, to use it is the easier solution, but for this particular
case, it feels like discarding multithreading just to avoid locking.
> No, I totally disagree. The code is not "unsafe" and the GIL doesn't
> "fix" it. The code is jsut
[I think there was a word missing from that sentence...)
> > And that was my point. Refcounts + interpreter simplicity
> > seem to imply the need for a GIL, but to make unsafe code safe without
> fixing
> > said code (or even thinking about it) is a weird goal...
>
> Why? Do you seriously not see the benefit of simplifying the work of
> extention writers and core maintainers? You don't have to agree that
> it's a good trade-off but it's a perfectly reasonable goal.
I do agree that, at least for the core maintainers, it is a good trade-off and a
reasonable goal to keep CPython simple. And if that has, as a positive side
efect for extensions writers, that their code becomes easier, so be it. What I
don't agree - rather, what I don't understand, is why it is presented in the
opposite direction.
> I highly suspect Aahz here would argue for a GIL even without the
> refcount issue, and even though I wouldn't agree, there's nothing
> weird or unreasonable about the argument, it's just a different
> viewpoint.
I consider it weird, at least. If I were to say that python should not allow
multithreading to simplify the lives of pure-python programmers, I hope I would
be shot down and ignored. But somehow I must consider perfectly natural the idea
of not allowing[1] multithreading when building C extensions, to simplify the
lives of extension programmers.
> > specially if it
> > became the only reason for a GIL. After all, one could argue for that goal
> in
> > almost all languages.
>
> "B-B-B-But other languages do it that way!" is not a big factor in
> language decisions in Python.
That's not what I said. We are not talking about the _language_, but about one
very specific implementation detail. Not even that, I'm talking about one of the
reasons presented in favor of that specific implementation detail (while
agreeing with the others). The fact that the reason I'm having trouble with is
valid for almost any other language, and none of them have a GIL-like construct
(while still being successful, and not being exceptionally hard to build native
modules for) just suggests that _that_ particular reason for that particular
implementation detail is not a very strong one, even if all other reasons are.
> (**) - To be fair, Python didn't originally support threads, so there
> was a lot of code that would have become unsafe had threading been
> added without the GIL, and that probably influenced the decision to
> use a GIL, but I'm sure that wasn't only reason. Would Python have a
> GIL if threading had been there from the start? Who knows.
(This is a very good remmark. Maybe here lies the whole answer to my question.
We may be dragging the heavy chain of backward compatibility with existing
extension modules, that is just too costly to break.)
Regards,
Luis
[1] Ok, I know that is not exactly what the GIL does.
--
Luis Zarrabeitia
Facultad de Matem�tica y Computaci�n, UH
http://profesores.matcom.uh.cu/~kyrie
--
Participe en Universidad 2010, del 8 al 12 de febrero de 2010
La Habana, Cuba
http://www.universidad2010.cu
On May 20, 8:10 pm, Luis Alberto Zarrabeitia Gomez <ky...@uh.cu>
wrote:
> 2- in [almost] every other language, _you_ have to be aware of the critical
> sections when multithreading.
[snip]
> That's not what I said. We are not talking about the _language_, but about one
> very specific implementation detail. Not even that, I'm talking about one of the
> reasons presented in favor of that specific implementation detail (while
> agreeing with the others). The fact that the reason I'm having trouble with is
> valid for almost any other language, and none of them have a GIL-like construct
> (while still being successful, and not being exceptionally hard to build native
> modules for) just suggests that _that_ particular reason for that particular
> implementation detail is not a very strong one, even if all other reasons are.
No other languages have nesting by indentation (while still being
reasonablyt successful)....
etc
Comparisons to other languages are useless here. In many cases Python
does things differently from most other languages and usually it's
better off for it.
The fact that other languages do something differently doesn't mean
that other way's better, in fact it really doesn't mean anything at
all.
Carl Banks
How about only setting a lock when a "safe" user extension is called.
There could also be an "unsafe" extension interface with no lock. The
more important stdlib functions would be (carefully) written with the
unsafe interface. Ordinary python code would not need a lock.
Nobody likes GIL, but it just have to be there or things starts crumbling...
Nobody would actually argue _for_ GIL, they just know from experience,
that people that successfully GIL in the past, always produces
unsatisfactory solution.
Among them are slowing down single threaded code, locking issues, etc.
They are not really arguments _for_ GIL, but a challenge for the next
people that tries to remove it to think about before dedicating the rest
of their life to removing GIL; only to found that nobody likes the
solution...
> I don't have any reply to this post except for the following excerpts:
>
> On May 20, 8:10�pm, Luis Alberto Zarrabeitia Gomez <ky...@uh.cu>
> wrote:
> > 2- in [almost] every other language, _you_ have to be aware of the
> critical
> > sections when multithreading.
> [snip]
> > That's not what I said. We are not talking about the _language_, but about
> one
> > very specific implementation detail. Not even that, I'm talking about one
> of the
> > reasons presented in favor of that specific implementation detail (while
> > agreeing with the others).
[...]
>
> No other languages have nesting by indentation (while still being
> reasonablyt successful)....
> etc
>
> Comparisons to other languages are useless here. In many cases Python
> does things differently from most other languages and usually it's
> better off for it.
You seem to have missed that I'm not talking about the language but about a
specific implementation detail of CPython. I thought that my poor choice of
words in that sentence was completely clarified by the paragraphs that followed,
but apparently it wasn't. In my "2-" point, maybe I should've said instead: "in
[almost] every language, INCLUDING (proper) PYTHON, you have to be aware of
critcal sections when multithreading".
> The fact that other languages do something differently doesn't mean
> that other way's better, in fact it really doesn't mean anything at
> all.
No, it doesn't mean that it's better, and I didn't say it was. But it _does_
show that it is _possible_. For an argument about how hard could be to write
native extensions in there was no GIL, the fact that there are many other
GIL-less platforms[1] where is not painful to write native extensions is a valid
counter-example. And that, in all those languages and platforms (including
python), the only one where I hear that explicit, granular locking is too hard
or whatever[2], is CPython's /native/ extensions, is what I found weird.
Regards,
Luis
[1] Including the python implementations for .net and java.
[2] Really, this is was my question and nothing more. I was not comparing, I was
trying to understand what was that "whatever" that made it so hard for CPython.
And your footnote in the previous message gave me a reasonable explanation.
I don't think that can be right - that shows python working without
the GIL contention. So unless you ran it under IronPython?
Here is what happens when I run it under CPython 2.5 on my dual core
laptop. I made SIZE=10**6 because I got bored of waiting ;-)
map
9.85280895233
pmap
28.4256689548
So the pmap took nearly 3 times as long. I expect this is because the
task was divided into 5 sections each competing madly for the GIL.
I ran the same script under the latest jython beta which was very
interesting! pmap showing a slight improvement, and faster than
cPython!
$ jython2.5rc2/jython pmap.py
map
6.242000103
pmap
5.88800001144
--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick
> On 20 May, 03:43, Steven D'Aprano
> <ste...@REMOVE.THIS.cybersource.com.au> wrote:
>> On Tue, 19 May 2009 03:57:43 -0700, jeremy wrote:
>> > As I wrote before, concurrency is one of the hardest things for
>> > professional programmers to grasp. For 'amateur' programmers we need
>> to
>> > make it as simple as possible,
Here, I think, is the fatal flaw in your plan. As Steven pointed out,
concurrency isn't simple. All you are actually doing is making it
easier for 'amateur' programmers to write hard-to-debug buggy code,
since you seem to be trying to avoid making them think about how to
write parallelisable code at all.
> I *do* actually know a bit about concurrency and would never imply
> that *any* for loop could be converted to a parallel one. The
> intention of my remark "with my suggestion they could potentially get
> a massive speed up just by changing 'for' to 'par' or 'map' to
> 'pmap'." is that it could be applied in the particular circumstances
> where there are no dependencies between different iterations of the
> loop.
If you can read this newsgroup for a week and still put your hand on
your heart and say that programmers will check that there are no
dependencies before swapping 'par' for 'for', I want to borrow your
rose-coloured glasses. That's not to say this isn't the right solution,
but you must be aware that people will screw this up very, very
regularly, and making the syntax easy will only up the frequency of
screw-ups.
> This shows why the sync event is needed - to avoid race conditions on
> shared variables. It is borrowed from the BSP paradigm - although that
> is a distibuted memory approach. Without the sync clause, rule 5 would
> just be the standard way of defining a parallelisable loop.
Pardon my cynicism but sync would appear to have all the disadvantages
of message passing (in terms of deadlock opportunities) with none of
advantages (like, say, actual messages). The basic single sync you put
forward may be coarse-grained enough to be deadlock-proof, but I would
need to be more convinced of that than I am at the moment before I was
happy.
> P.S. I have a couple of additional embellishments to share at this
> stage:
[snip]
> 2. Scope of the 'sync' command. It was pointed out to me by a
> colleague that I need to define what happens with sync when there are
> nested par loops. I think that it should be defined to apply to the
> innermost par construct which encloses the statement.
What I said before about deadlock-proofing? Forget it. There's hours
of fun to be had once you introduce scoping, not to mention the fact
that your inner loops can't now be protected against common code in the
outer loop accessing the shared variables.
--
Rhodri James *-* Wildebeeste Herder to the Masses
Hi Rhodri,
> If you can read this newsgroup for a week and still put your hand on
> your heart and say that programmers will check that there are no
> dependencies before swapping 'par' for 'for', I want to borrow your
> rose-coloured glasses.
I think this depends on whether we think that Python is a language for
people who we trust to know what they are doing (like Perl) or whether
it is a language for people we don't trust to get things right(like
Java). I suspect it probably lies somewhere in the middle.
Actually the 'sync' command could lead to deadlock potentially:
par i in range(2):
if i == 1:
sync
In this case there are two threads (or virtual threads): one thread
waits for a sync, the other does not, hence deadlock.
My view about deadlock avoidance is that it should not be built into
the language - that would make everything too restrictive - instead
people should use design patterns which guarantee freedom from
deadlock.
See http://www.wotug.org/docs/jeremy-martin/index.shtml
Jeremy
> I think this depends on whether we think that Python is a language for
> people who we trust to know what they are doing (like Perl) or whether
> it is a language for people we don't trust to get things right(like
> Java). I suspect it probably lies somewhere in the middle.
So do I *in general*, but your design principle -- make it easy -- came
down firmly in the first camp and, as I said, I come down in the second
where parallel processing is concerned. I've spent enough years weeding
bugs out of my insufficiently carefully designed Occam programs to have
the opinion that "easy" is the very last thing you want to make it.
> Actually the 'sync' command could lead to deadlock potentially:
>
> par i in range(2):
> if i == 1:
> sync
Hmm. I was assuming you had some sort of implicit rendez-vous sync
at the end of the PAR. Yes, this does make it very easy for the
freshly-enabled careless programmer to introduce deadlocks and
never realise it.
Why would it or need it? A Python that understands the ``par''
keyword is supposed to know it can play some tricks with
optimizing reduce() if the specific function is commutative.
>--
>Steven
Groetjes Albert
--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
The Judicial System in Islam : Its Legal Basis and Islam Ruling
Please forgive us for any disturbance, but we have an important
subject to address to you regarding FAITH, and we Don’t intend to
overload your email with unnecessary messages…
The Judicial System in Islam : Its Legal Basis and Islam Ruling
By The Editorial Team of Dr. Abdurrahman al-Muala (translated by
islamtoday.com)
Defining the Judicial System and its Legal basis
The judicial system in Islam is a system for deciding between people
in litigation with the aim of settling their disputes in accordance
with the injunctions of the Divine Law, injunctions that are taken
from the Quran and Sunnah.
All of the Messengers of God (may God praise them all) acted as
judges. God says:
“And remember David and Solomon, when they gave judgment concerning
the field when people’s sheep had browsed therein at night, and We
were witness to their judgment. And We made Solomon to understand the
case. And to each of them We gave good judgment and knowledge.” (Quran
21:78-79)
God also says:
“O David, verily we have placed you as a successor on Earth, so judge
between people in truth, and do not follow your desires for it will
mislead you from the path of God. Verily, those who stray from the
path of God have a severe punishment because they forgot the day of
reckoning.” (Quran 38:26)
Prophet Muhammad, who came with the final and eternal Message, was
ordered by God to pass judgment in disputes just as he was ordered to
spread the word of God and call people to Islam. This is mentioned in
the Quran in a number of places. God says, for instance:
“So judge (O Muhammad) between them by what God has revealed and do
not follow their vain desires, but beware of them lest they turn you
away from some of what God has sent down to you.” (Quran 5:49)
God also says:
“…And if you judge (O Muhammad), judge between them with justice.
Verily, God loves those who act justly.” (Quran 5:42)
And He says:
“But no, by your Lord, they shall have no faith until they make you (O
Muhammad) judge in all their disputes and find in themselves no
resistance against your decisions and accept them with full
submission.” (Quran 4:65)
The Sunnah also provides for the legal basis of the Islamic judicial
system. It is related by Amr b. al-Aas that the Prophet said:
“If a judge gives a judgment using his best judgment and is correct,
then he receives a double reward (from God). If he uses his best
judgment but makes a mistake, then he receives a single
reward.” (Ahmed)
God’s Messenger said:
“You should not wish to be like other people, except in two cases: a
man who God has given wealth and he spends it on Truth and another who
God has granted wisdom and he gives verdicts on its basis and teaches
others.” (Saheeh Al-Bukhari, Saheeh Muslim)
Many scholars have related to us that there is consensus among Muslims
on the legal status of the judicial system in Islam. Ibn Qudamah says:
“The Muslims are unanimously agreed that a judicial system must be
established for the people.”
The Islamic Ruling Concerning the Judiciary
The jurists agree that the duties of the judge are an obligation that
must be carried out by society. If some members of society carry out
this duty, it is sufficient for everyone. If, on the other hand,
everyone neglects it, then everyone in society is sinful.
The proof that these duties are obligatory comes from the Quran:
“O you who believe! Stand out firmly for justice...” (Quran 4:135)
It is only necessary for a small number of individuals to perform
judicial duties since judicial concerns come under the broad duty of
enjoining what is right and forbidding what is wrong. It is not
obligatory for every individual to carry out this duty as long as some
people are doing so.
The affairs of the people will not be correct and upright without a
judicial system. It is, consequently, obligatory for one to exist,
just like it is necessary to have a military. Imam Ahmad, one of the
greatest and most well-known scholars of Islam said:
“People have to have a judicial authority or their rights will
disappear.”
The duties of the judiciary include enjoining what is right, helping
the oppressed, securing people’s rights, and keeping oppressive
behavior in check. None of these duties can be performed without the
appointment of a judiciary.
A judicial system is a necessity for the prosperity and development of
nations. It is needed to secure human happiness, protect the rights of
the oppressed, and restrain the oppressor. It is the way to resolve
disputes and ensure human rights. It facilitates enjoining what is
right, forbidding what is wrong, and curbing immoral behavior. In this
way, a just social order can be enjoyed by all sectors of society, and
every individual can feel secure in his life, property, honor, and
liberty. In this environment, nations can progress, civilization can
be achieved, and people are free to pursue what will better them both
spiritually and materially.
———————————-
For more information about Islam
http://www.islambasics.com/index.php
http://www.islamtoday.net/english/
http://www.islamweb.net/ver2/MainPage/indexe.php
Contact Us At
>>And how is reduce() supposed to know whether or not some arbitrary
>>function is commutative?
>
> Why would it or need it? A Python that understands the ``par'' keyword
> is supposed to know it can play some tricks with optimizing reduce() if
> the specific function is commutative.
Fine. Move the smarts out of reduce() into the compiler. How is the
compiler supposed to know if an arbitrary function is commutative?
--
Steven
Unit testing.
I think this kind of gets to the heart of it, here- parallelization
is not a toy, runs with scissors, and will eat your dog, but so will
everything else if you don't know what you're doing. I just don't
see this as being a valid argument- maybe I'm wrong.
In the spirit of continuing to be wrong, would it be possible to
just fork off and synchronize the passed-in variables at the end
of the block via a container in shared memory? That sounds to me
like the simple, safe route, if a route must be taken.
Geremy Condra
(Correction: the characteristic we really care about is associativity,
not commutativity.)
I really shouldn't reply to this, because I fear this is just going to
drag me down into a bottomless pit, but here goes...
I don't see how unit tests can possibly help. There's an infinite number
of possible functions that could be passed to parallel reduce(), and you
can't test them all. Even if you lower your sights and just aim to
recognise a tiny subset of associative functions, you end up with code
like this:
def par_reduce(func, *args):
if func in (operator.add, operator.mul):
if isinstance(args[0], (int, long)):
return _associative_par_reduce(func, *args)
return _nonassociative_par_reduce(func, *args)
You can't even recognise functions like lambda x,y: x+y. In case you're
thinking you can, no, "if func == lambda x,y: x+y" won't work:
>>> (lambda x,y: x+y) == (lambda x,y: x+y)
False
I suppose if you're willing to special case a tiny handful of functions,
that's a solution of sorts, but I did ask about arbitrary functions.
Perhaps you're thinking of another approach: put the unit tests inside
the par_reduce() function, and use them to determine at runtime whether
or not the argument func is associative.
def par_reduce(func, *args):
try:
# a mass of unittests
except Exception:
# are we catching too much? are we masking bugs in the tests?
return _nonassociative_par_reduce(func, *args)
return _associative_par_reduce(func, *args)
This would be impractical, because the cost would be enormous. But even
putting that aside, it still wouldn't work. I don't see how you could
know what is a valid unittest for an arbitrary function, but suppose you
could, somehow. Then what? True enough, if you run lots of tests, and it
fails one, then you know that func is not associative. But having passed
all those tests, you can't know if your tests covered all the possible
cases.
E.g. for floats, you can run some tests and decide that addition looks
associative:
>>> 0.1 + (0.1 + 0.1) == (0.1 + 0.1) + 0.1
True
>>> 0.1 + (0.9 + 0.1) == (0.1 + 0.9) + 0.1
True
>>> 0.1e67 + (0.9 + 0.3) == (0.1e67 + 0.9) + 0.3
True
>>> 0.1e54 + (0.9 + 0.1) == (0.1e54 + 0.9) + 0.1
True
but you'd be wrong:
>>> 1e54 + (-1e54 + 0.1) == (1e54 + -1e54) + 0.1
False
No, you can't expect to discover experimentally whether an arbitrary
function is associative. You would need to analyse what the function
does, which means in practice the *programmer* needs to decide what to
do, not reduce().
--
Steven
>> But reduce()? I can't see how you can parallelize reduce(). By its
>> nature, it has to run sequentially: it can't operate on the nth item
>> until it is operated on the (n-1)th item.
>
> That depends on the operation in question. Addition for example would
> work. My math-skills are a bit too rusty to qualify the exact nature of
> the operation, commutativity springs to my mind.
Associativity:
((A op B) op C) = (A op (B op C))
So for example
A op B op C op D
could be grouped into
(A op B) op (C op D)
and the two parenthesized subexpressions evaluated concurrently. But this
would only give you a speedup that was logarithmic in the number of op's.
> threads = [PMapThread(datapool) for i in xrange(numthreads)]
Shouldn’t that “PMapThread” be “thread”?
It is worth noting, I think, that many realistic operations are not
associative. If op combines a summary and an item to create an updated
summary, it will probably croak or give a wrong answer when given a
summary and a summary. Combining two summaries might be possible, but
would be a different operation.
For a specific example, consider list.append(some_list, some_item)
versus list.extend(some_list, another_list). Imagine for that moment
that these methods returned some_list. Then
[].append(a).append(b).append(c).append(d) # parentheses left out
would have to be conceptually converted to the associative
[].extend([a]).extend([b]).extend([c]).extend([d])
before being grouped something like
(([].extend([a])).extend([b].extend([c]))).extend([d])
Of course, one would not do the conversion to the slower associative
operation unless one knew that there would be a compensating speedup due
to the grouping.
Terry Jan Reedy