Multi-threading on Multi-CPU machines

Garry Taylor

unread,

Jul 8, 2002, 11:02:54 AM7/8/02

to

Hello,
I am attempting to make a multi-threading function in one of my
programs in an effort to gain a speed increase, but I'm getting quite
the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten
me as to why, my code is below:
--------
import thread
import time

ThreadCounter = 0
Iterations = 100

def Threader():
global ThreadCounter
global Iterations
Counter = 0
temptime = time.time()
while Counter < Iterations:
Counter = Counter + 1
thread.start_new_thread(TakesTime,())

while ThreadCounter < Iterations:
pass

print "Threaded: "+str(time.time() - temptime)

def TakesTime():
global ThreadCounter
Text = "Test"
Counter = 0
while Counter < 20:
Text = Text + Text
Counter = Counter + 1
ThreadCounter = ThreadCounter + 1

def NoThreader():
global Iterations
temptime = time.time()
Counter = 0
while Counter < Iterations:
Counter = Counter + 1
TakesTime()

print "Non-Threaded: "+str(time.time() - temptime)

Threader()
NoThreader()
--------

This does the same thing, threaded and then not, but on all of my
machinbes, the multi-threaded is slower, what can I do about this?

Thanks

garry

Steven

unread,

Jul 8, 2002, 12:27:17 PM7/8/02

to

"Garry Taylor" <gta...@lowebroadway.com> wrote in message
news:f0fd5987.02070...@posting.google.com...

> Hello,
> I am attempting to make a multi-threading function in one of my
> programs in an effort to gain a speed increase, but I'm getting quite
> the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten
> me as to why, my code is below:

.....

>
> This does the same thing, threaded and then not, but on all of my
> machinbes, the multi-threaded is slower, what can I do about this?

does the Python interpreter run on both CPU's? that is, do you have the
Python threads executing concurrently on both CPU's, I'd imagine that they
would be, but just wondering...

I tested on my machine, which is a single AthlonXP 1900, and get times of
about 3.6 and 3.4 for the Threader and NoThreader versiosn respectively, and
I'd attribute a fair bit to the creation time for a thread, though I don't
have any real timing to back that up. I did change it slightly so that the
function passed did no work, so it was basically just the overhead of
setting up and finishing a thread, and the threaded version took much longer
than the non-threaded (as you'd expect on a single CPU machine).

As it is, although you can have a CPU running each tasks, you still do very
frequent checking of variables that are global to the whole thing, so that
constant checking of the THreadCounter is going to introduce an overhead
that isn't present on the non-threaded, as is the increment operation in
each thread, the non-threaded version doesn't have the overhead of changing
the variable on whatever CPU its sitting.

If there were a method running half the threads on one CPU, and half on the
other, and not communicating with that 'parent' until the very very end,
when all of each thread-spawners threads had completed, then you might see a
performance increase.

I'm sure someone with a better knowledge of threading on Python can give you
a better answer...

Steven

Joseph A Knapka

unread,

Jul 8, 2002, 4:57:52 PM7/8/02

to

Garry Taylor wrote:
>
> Hello,
> I am attempting to make a multi-threading function in one of my
> programs in an effort to gain a speed increase, but I'm getting quite
> the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten
> me as to why,

Yes. CPython threads cannot utilize multiple CPUs, due to the
Global Interpreter Lock, which may only be acquired by one
thread at a time. Apparently Jython threads do not have
this limitation, as the GIL doesn't exist in Jython, or so
I'm told. So if you simply ran your program under Jython
you might see an improvement.

Cheers,

-- Joe

Aahz

unread,

Jul 9, 2002, 12:14:01 AM7/9/02

to

In article <f0fd5987.02070...@posting.google.com>,

Garry Taylor <gta...@lowebroadway.com> wrote:
>
>I am attempting to make a multi-threading function in one of my
>programs in an effort to gain a speed increase, but I'm getting quite
>the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten
>me as to why, my code is below:

Pure Python code will always slow down when threaded; in order to gain a
speedup, you must call an extension that releases the GIL. All I/O
functions in Python release the GIL, for example. For more info, see
the slides on my home page.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

Project Vote Smart: http://www.vote-smart.org/

Garry Taylor

unread,

Jul 9, 2002, 5:02:38 AM7/9/02

to

Joseph A Knapka <jkn...@earthlink.net> wrote in message news:<3D29FC43...@earthlink.net>...

Thank you both for your anwsers, unfortunatly running under Jython is
not an option, as the whole program which I am writing runs to about
5,000 lines and use lots of Python modules which I don't really fancy
trying to get to work under Jython.

So, am I correct in thinking that there is nothing I can do about
this, and still use standard Python? I understand that Solaris has a
very good threading library, but from the comments above, I assume
that this would make no difference? Do you have any tips/ideas how I
could make use of multiple processors in a Python program?

Thanks again

Garry

Alex Martelli

unread,

Jul 9, 2002, 5:29:51 AM7/9/02

to

Garry Taylor wrote:
...

> that this would make no difference? Do you have any tips/ideas how I
> could make use of multiple processors in a Python program?

Use multiple *processes* rather than multiple *threads* within just
one process. Multiple processes running on the same machine can
share data very effectively via module mmap (you do need separate
process-synchronization mechanisms if the shared data structures
need to be written, of course), and you can use other fast same-machine
mechanisms such as pipes, in addition of course to general distributed
programming approaches that offer further scalability since they also
run on multiple machines on the same network as well as within a
single machine (pyro, corba, etc etc). Optimal-performance architectures
will be different for multiple processes than for a single multi-thread
process (and different for really-distributed versus single-machine),
but the key issue tends always to be, who shares / sends what data
with/to whom. If your problem is highly parallelizable anyway, the
architectural distinction between multithread, multiprocess and
distributed can boil down to using larger "slices" to farm out to
workers to reduce the per-slice communication overhead, sometimes.

Say for example that your task is to perform some pointwise
computation cpuintensivefunction(x) on each point x of some
huge array (assume without loss of generality the array is
one-dimensional -0- the pointwise assumption allows that).

With a multithreaded approach you might keep the array in memory
and have the main thread farm out work requests to worker threads
via a bounded queue. You want the queue a bit larger than the
number of worker threads, and you can determine the optimal size
for a work request (could be one item, or maybe two, or, say, 4)
via some benchmarking. Upon receiving a work request from the
Queue, a worker thread would:
-- get a local copy of the relevant points from the
large array,
-- enter the C-coded computation function which
-- releases the GIL,
-- does the computations getting the nes points,
-- acquires the GIL again,
-- put back the resulting new points to the same area
of the large array where the input came from,
then go back to peel one more work request from the Queue.

If you can't release the GIL during the computation, e.g.
because your computation is in Python or anyway requires you
to interact with the interpreter, then multithreading will
give no speedup and should not be used for that purpose.

A similar architecture might work for a single-machine multi
process design IF multiple processes can use mmap to read and
write different regions of a shared-memory array at the same
time, without locking (I don't think mmap ensures that on all
platforms, alas). "Get the next work request" would become
a bit less simple than just peeling an item off a queue, which
makes it likely that a rather larger size of work request
might be optimal -- depending on what guarantees you can count
on for simultaneous reads and writes from/to pipes or message
queues, those might provide the Queue equivalents.

Alternatively, wrap the data with a dedicated process which
knows how to respond to requests for "next still-unassigned
slice of work please" and (no return-acknowledgment needed)
"here's the new computed data for the slice at coordinate X".
pyro might be a good mechanism for such a task, and it would
scale from one multi-CPU running multiple processes to a
network (you might want to build-in sanity checking, most
particularly for the network case -- if a node goes down,
then after a while without a response from it the slices that
had been assigned to it should be farmed out to others...).

Of course, most parallel-computing cases are far more
intricate than simple albeit CPU-intensive computations
on a pointwise basis, but I hope this very elementary
overview can still help!-)

Alex

Duncan Booth

unread,

Jul 9, 2002, 8:50:29 AM7/9/02

to

gta...@lowebroadway.com (Garry Taylor) wrote in
news:f0fd5987.02070...@posting.google.com:

> So, am I correct in thinking that there is nothing I can do about
> this, and still use standard Python? I understand that Solaris has a
> very good threading library, but from the comments above, I assume
> that this would make no difference? Do you have any tips/ideas how I
> could make use of multiple processors in a Python program?
>

Can you split your program into several communicating processes? Each
process has its own GIL, so if you can run multiple processes they can make
better use of CPU.

The only other option really is to see if you can isolate CPU intensive
sections and rewrite them in C, then you might be able to release the GIL
enough to get a useful speedup.

Then again it may be possible to get enough speed improvement by modifying
existing code. I find it can be quite hard working out exactly where Python
is spending all its time. Do you know where in your current code most of
the CPU is actually used?

--
Duncan Booth dun...@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?

anton wilson

unread,

Jul 9, 2002, 9:52:59 AM7/9/02

to

> With a multithreaded approach you might keep the array in memory
> and have the main thread farm out work requests to worker threads
> via a bounded queue. You want the queue a bit larger than the
> number of worker threads, and you can determine the optimal size
> for a work request (could be one item, or maybe two, or, say, 4)
> via some benchmarking. Upon receiving a work request from the
> Queue, a worker thread would:
> -- get a local copy of the relevant points from the
> large array,
> -- enter the C-coded computation function which
> -- releases the GIL,
> -- does the computations getting the nes points,
> -- acquires the GIL again,

If the bounded queue were declared in a C extention module, would a thread
doing the calculations really have to reaquire the GIL everytime that thread
accessed this C data structure? Could mutexes be used instead?

Christopher Saunter

unread,

Jul 9, 2002, 5:31:11 AM7/9/02

to

Garry Taylor (gta...@lowebroadway.com) wrote:
: Hello,

: I am attempting to make a multi-threading function in one of my
: programs in an effort to gain a speed increase, but I'm getting quite
: the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten
: me as to why, my code is below:

Hi Gerry,
As other people have said, 'native' Python code does not bennefit
from multiple CPUs in one PC due to the GIL. Depending on what you are
doing with your threads, you may be able to utilise more then one
processor by splititng the threads into multiple programs, running them
simultaneously and communincating between them somehow (MPI etc...) - from
what I have seen it requires a little more effort, but can work well.
This is mainly usefull for 'number crunching' threads etc.

---

cds

Alex Martelli

unread,

Jul 9, 2002, 11:28:47 AM7/9/02

to

anton wilson wrote:

>
>> With a multithreaded approach you might keep the array in memory
>> and have the main thread farm out work requests to worker threads
>> via a bounded queue. You want the queue a bit larger than the
>> number of worker threads, and you can determine the optimal size
>> for a work request (could be one item, or maybe two, or, say, 4)
>> via some benchmarking. Upon receiving a work request from the
>> Queue, a worker thread would:
>> -- get a local copy of the relevant points from the
>> large array,
>> -- enter the C-coded computation function which
>> -- releases the GIL,
>> -- does the computations getting the nes points,
>> -- acquires the GIL again,
>
>
> If the bounded queue were declared in a C extention module, would a thread
> doing the calculations really have to reaquire the GIL everytime that
> thread accessed this C data structure? Could mutexes be used instead?

C code talking to other C code, with Python *nowhere* in the picture,
does not need the GIL but may make its own arrangements. However, it's
hard to see how the Python data placed in the queue would get turned
into C-usable data WITHOUT using some of the Python API -- whenever ANY
use of the Python API is made, the thread making such use must hold the
GIL (of course Python can't _guarantee_ that EVERY such GIL-less use
will crash the program, burn the CPU AND raze the machine room to the
ground, unfortunately, but you should still program AS IF that was
the case).

Given that a C-coded function is called from Python, it IS holding the
GIL when it starts executing -- what it must do it to RELEASE the GIL
as soon as it's finished doing calls to the Python API in order to let
other threads use the Python interpreter, then acquire the GIL again
before it can return control to the Python that called it. There is
no benefit that I can see in duplicating the Queue module in C with
all the attendant locking headaches &c -- moving the loop itself into
C seems to be a tiny, irrelevant speedup anyway.

Alex

Tim Churches

unread,

Jul 9, 2002, 3:55:38 PM7/9/02

to

As someone else suggested, consider using MPI, which can be used to
parallelise code on shared memory SMP machines as well as networked
clusters. Installing user-mode LAM/MPI is very easy, although other
forms of MPI such as MPICH may be a bit more difficult. However, once
you have MPI installed, there are a number of Python MPI modules around
which make using it a cinch. May I recommend PyPar, by Ole Nielsen
at the Austrlian National University, as being particularly easy to use?
See http://datamining.anu.edu.au/~ole/pypar/ I am pretty sure Ole has
been using MPI and Pypar on multi-CPU Solaris machines as well as Linux
Beowulf clusters and the hybrid shared/distributed memory APAC
supercomputer
at ANU.

Tim C

>
> Thanks again
>
> Garry
> --
> http://mail.python.org/mailman/listinfo/python-list

Garry Taylor

unread,

Jul 10, 2002, 5:59:52 AM7/10/02

to

aa...@pythoncraft.com (Aahz) wrote in message news:<agdnu9$cne$1...@panix1.panix.com>...

> In article <f0fd5987.02070...@posting.google.com>,
> Garry Taylor <gta...@lowebroadway.com> wrote:
> >
> >I am attempting to make a multi-threading function in one of my
> >programs in an effort to gain a speed increase, but I'm getting quite
> >the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten
> >me as to why, my code is below:
>
> Pure Python code will always slow down when threaded; in order to gain a
> speedup, you must call an extension that releases the GIL. All I/O
> functions in Python release the GIL, for example. For more info, see
> the slides on my home page.

Thanks to those who replied, I think releasing the GIL would appear to
be my best bet, as I don't want to add yet another dependency to the
program, i.e. MPI and PyPar, also by 'Shared Memory' I take it you
mean NUMAFlex machines and similar rather than a little 2x1GHz P4 Dell
Server?

The kind of machines my program will run on will max out at 4 way, I
would expect, and it's not math-intensive or anything, I just want to
speed up fairly average tasks, which only take around 10 seconds on a
single processor.

Thanks again

Garry