After reading http://www.python.org/dev/peps/pep-0371/ I was under impression that performance of multiprocessing package is similar to that of thread / threading. However, to familiarize myself with both packages I wrote my own test of spawning and returning 100,000 empty threads or processes (while maintaining at most 100 processes / threads active at any one time), respectively.
The results I got are very different from the benchmark quoted in PEP 371. On twin Xeon machine the threaded version executed in 5.54 secs, while multiprocessing version took over 222 secs to complete!
Am I doing smth wrong in code below? Or do I have to use multiprocessing.Pool to get any decent results?
def testth(self, tid): if tid % 1000 == 0: print "== Thread %d working ==" % tid self.tlock.acquire() self.reslist.append(tid) self.tactivnum -= 1 self.tlock.release()
def calc_100thousand(self): tid = 1 while tid <= 100000: while self.tactivnum > 99: time.sleep(0.01) self.tlock.acquire() self.tactivnum += 1 self.tlock.release() t = thread.start_new_thread(self.testth, (tid,)) tid += 1 while self.tactivnum > 0: time.sleep(0.01)
def testp(pid): if pid % 1000 == 0: print "== Process %d working ==" % pid
def palivelistlen(plist): pll = 0 for p in plist: if p.is_alive(): pll += 1 else: plist.remove(p) p.join() return pll
def testp_100thousand(): pid = 1 proclist = [] while pid <= 100000: while palivelistlen(proclist) > 99: time.sleep(0.01) p = multiprocessing.Process(target=testp, args=(pid,)) p.start() proclist.append(p) pid += 1 print "=== Main thread waiting for all processes to finish ===" for p in proclist: p.join()
> After readinghttp://www.python.org/dev/peps/pep-0371/I was under > impression that performance of multiprocessing package is similar to > that of thread / threading. However, to familiarize myself with both > packages I wrote my own test of spawning and returning 100,000 empty > threads or processes (while maintaining at most 100 processes / threads > active at any one time), respectively.
> The results I got are very different from the benchmark quoted in PEP > 371. On twin Xeon machine the threaded version executed in 5.54 secs, > while multiprocessing version took over 222 secs to complete!
> Am I doing smth wrong in code below? Or do I have to use > multiprocessing.Pool to get any decent results?
Oooh, 100000 processes! You're fortunate that your OS handled them in finite time.
[quick browsing through the code]
Ah, so there are 100 processes at time. 200secs still don't sound strange.
This is very different from PEP 371. It appears that the PEP 371 code was written on Mac OS X. The conclusion I get from comparing above costs sis that OS X must have very low cost of creating the process, at least when compared to Linux, not that multiprocessing is a viable alternative to thread / threading module. :-(
mk wrote: > Am I doing smth wrong in code below? Or do I have to use > multiprocessing.Pool to get any decent results?
You have missed an important point. A well designed application does neither create so many threads nor processes. The creation of a thread or forking of a process is an expensive operation. You should use a pool of threads or processes.
The limiting factor is not the creation time but the communication and synchronization overhead between multiple threads or processes.
Christian Heimes wrote: > mk wrote: >> Am I doing smth wrong in code below? Or do I have to use >> multiprocessing.Pool to get any decent results?
> You have missed an important point. A well designed application does > neither create so many threads nor processes.
Except I was not developing "well designed application" but writing the test the goal of which was measuring the thread / process creation cost.
> The creation of a thread > or forking of a process is an expensive operation.
Sure. The point is, how expensive? While still being relatively expensive, it turns out that in Python creating a thread is much, much cheaper than creating a process via multiprocessing on Linux, while this seems to be not necessarily true on Mac OS X.
> You should use a pool > of threads or processes.
Probably true, except, again, that was not quite the point of this exercise..
> The limiting factor is not the creation time but the communication and > synchronization overhead between multiple threads or processes.
In article <mailman.6337.1230563873.3487.python-l...@python.org>, Christian Heimes <li...@cheimes.de> wrote:
> You have missed an important point. A well designed application does > neither create so many threads nor processes. The creation of a thread > or forking of a process is an expensive operation. You should use a pool > of threads or processes.
It's worth noting that forking a new process is usually a much more expensive operation than creating a thread. Not that I would want to create 100,000 of either!
Not everybody realizes it, but threads eat up a fair chunk of memory (you get one stack per thread, which means you need to allocate a hunk of memory for each stack). I did a quick look around; 256k seems like a common default stack size. 1 meg wouldn't be unheard of.
> After readinghttp://www.python.org/dev/peps/pep-0371/I was under > impression that performance of multiprocessing package is similar to > that of thread / threading. However, to familiarize myself with both > packages I wrote my own test of spawning and returning 100,000 empty > threads or processes (while maintaining at most 100 processes / threads > active at any one time), respectively.
> The results I got are very different from the benchmark quoted in PEP > 371. On twin Xeon machine the threaded version executed in 5.54 secs, > while multiprocessing version took over 222 secs to complete!
> Am I doing smth wrong in code below? Or do I have to use > multiprocessing.Pool to get any decent results?
I'm running a 1.6 GHz. I only ran 10000 empty threads and 10000 empty processes. The threads were the ones you wrote. The processes were empty executables written in a lower language, also run 100 at a time, started with 'subprocess', not 'multiprocessing'. The threads took 1.2 seconds. The processes took 24 seconds.
The processes you wrote had only finished 3000 after several minutes.
There's smth wrong with numbers posted in PEP. This is what I got on 4-socket Xeon (+ HT) with Python 2.6.1 on Debian (Etch), with kernel upgraded to 2.6.22.14:
Roy Smith <r...@panix.com> writes: > In article <mailman.6337.1230563873.3487.python-l...@python.org>, > Christian Heimes <li...@cheimes.de> wrote:
>> You have missed an important point. A well designed application does >> neither create so many threads nor processes. The creation of a thread >> or forking of a process is an expensive operation. You should use a pool >> of threads or processes.
> It's worth noting that forking a new process is usually a much more > expensive operation than creating a thread.
If by "forking" you mean an actual fork() call, as opposed to invoking a different executable, the difference is not necessarily that great. Modern Unix systems tend to implement a 1:1 mapping between threads and kernel processes, so creating a thread and forking a process require similar amount of work.
On my system, as measured by timeit, spawning and joining a thread takes 111 usecs, while forking and waiting for a process takes 260. Slower, but not catastrophically so.
> Not that I would want to create 100,000 of either!
Agreed.
> Not everybody realizes it, but threads eat up a fair chunk of memory > (you get one stack per thread, which means you need to allocate a > hunk of memory for each stack). I did a quick look around; 256k > seems like a common default stack size. 1 meg wouldn't be unheard > of.
Note that this memory is virtual memory, so it doesn't use up the physical RAM until actually used. I've seen systems running legacy Java applications that create thousands of threads where *virtual* memory was the bottleneck.
On Tue, Dec 30, 2008 at 12:52 AM, mk <mrk...@gmail.com> wrote: > Hello everyone,
> After reading http://www.python.org/dev/peps/pep-0371/ I was under > impression that performance of multiprocessing package is similar to that of > thread / threading. However, to familiarize myself with both packages I > wrote my own test of spawning and returning 100,000 empty threads or > processes (while maintaining at most 100 processes / threads active at any > one time), respectively.
> The results I got are very different from the benchmark quoted in PEP 371. > On twin Xeon machine the threaded version executed in 5.54 secs, while > multiprocessing version took over 222 secs to complete!
> Am I doing smth wrong in code below? Or do I have to use > multiprocessing.Pool to get any decent results?
The overhead in starting OS level processes is quite high. This is why event-driven, single process servers can perform far better than ones that fork (spawn multiple processes) per request.
As others have mentioned, it's not suprising that spawning even 100 processes took some time.
Bottom line: multiprocessing should not be used this way. (nor should threading).
> On Tue, Dec 30, 2008 at 12:52 AM, mk <mrk...@gmail.com> wrote: > > Hello everyone,
> > After readinghttp://www.python.org/dev/peps/pep-0371/I was under > > impression that performance of multiprocessing package is similar to that of > > thread / threading. However, to familiarize myself with both packages I > > wrote my own test of spawning and returning 100,000 empty threads or > > processes (while maintaining at most 100 processes / threads active at any > > one time), respectively. snip > As others have mentioned, it's not suprising > that spawning even 100 processes took some > time.
> Bottom line: multiprocessing should not be used this way. > (nor should threading).
The OP may be interested in Erlang, which Wikipedia (end-all, be-all) claims is a 'distribution oriented language'.
You might also find it interesting to examine a theoretical OS that is optimized for process overhead. In other words, what is the minimum overhead possible? Can processes be as small as threads? Can entire threads be only a few bytes (words) big?
Also, could generators provide any of the things you need with your multiple threads? You could, say, call 'next()' on many items in a list, and just remove them on StopIteration.
On Tue, Dec 30, 2008 at 11:34 AM, Aaron Brady <castiro...@gmail.com> wrote: > The OP may be interested in Erlang, which Wikipedia (end-all, be-all) > claims is a 'distribution oriented language'.
I would suggest to the OP that he take a look at circuits (1) an event framework with a focus on component architectures and distributed processing.
I'm presently looking at Virtual Synchrony and other distributed processing architectures - but circuits is meant to be general purpose enough to fit event-driven applications/systems.
On Dec 29, 7:40 pm, "James Mills" <prolo...@shortcircuit.net.au> wrote:
> On Tue, Dec 30, 2008 at 11:34 AM, Aaron Brady <castiro...@gmail.com> wrote: > > The OP may be interested in Erlang, which Wikipedia (end-all, be-all) > > claims is a 'distribution oriented language'. snip > I'm presently looking at Virtual Synchrony and > other distributed processing architectures - but > circuits is meant to be general purpose enough > to fit event-driven applications/systems.
I noticed a while ago that threads can be used to simulate generators. 'next', 'send', and 'yield' are merely replaced by synchronizor calls (coining the term).
Not the other way around, though, unless the generator guarantees a yield frequently. 'settrace' anyone?
On Tue, Dec 30, 2008 at 12:52 PM, Aaron Brady <castiro...@gmail.com> wrote: > On Dec 29, 7:40 pm, "James Mills" <prolo...@shortcircuit.net.au> > wrote: >> On Tue, Dec 30, 2008 at 11:34 AM, Aaron Brady <castiro...@gmail.com> wrote: >> > The OP may be interested in Erlang, which Wikipedia (end-all, be-all) >> > claims is a 'distribution oriented language'. > snip >> I'm presently looking at Virtual Synchrony and >> other distributed processing architectures - but >> circuits is meant to be general purpose enough >> to fit event-driven applications/systems.
> I noticed a while ago that threads can be used to simulate > generators. 'next', 'send', and 'yield' are merely replaced by > synchronizor calls (coining the term).
> Not the other way around, though, unless the generator guarantees a > yield frequently. 'settrace' anyone?
Aaron, circuits doesn't use generators :) What did your comment have to do with this ?
I have often seen generators used to facilitate coroutine and coooperative programming though.
> On Tue, Dec 30, 2008 at 12:52 PM, Aaron Brady <castiro...@gmail.com> wrote: > > On Dec 29, 7:40 pm, "James Mills" <prolo...@shortcircuit.net.au> > > wrote: > >> On Tue, Dec 30, 2008 at 11:34 AM, Aaron Brady <castiro...@gmail.com> wrote: > >> > The OP may be interested in Erlang, which Wikipedia (end-all, be-all) > >> > claims is a 'distribution oriented language'. > > snip > >> I'm presently looking at Virtual Synchrony and > >> other distributed processing architectures - but > >> circuits is meant to be general purpose enough > >> to fit event-driven applications/systems.
> > I noticed a while ago that threads can be used to simulate > > generators. 'next', 'send', and 'yield' are merely replaced by > > synchronizor calls (coining the term).
> > Not the other way around, though, unless the generator guarantees a > > yield frequently. 'settrace' anyone?
> Aaron, circuits doesn't use generators :) > What did your comment have to do with this ?
> I have often seen generators used to > facilitate coroutine and coooperative > programming though.
> cheers > James
James, Hi. I'm glad you asked; I never know how "out there" my comments are (but surmise that feedback is always a good thing). What I was thinking was, I didn't know Virtual Synchrony, and I've never used Erlang, but I'm interested in concurrency especially as it pertains to units of work, division of labor, and division of context; and generators are another way to divide context. So: I wanted to put more of my background and interests on the table. What I said wasn't directly relevant, I see. But it's not like I "dissertated" (discussed) the Tibettan-Austrian spice trade. I think I just want to say stuff about threading! Maybe I'm just excited to meet people who share my interests... not unheard of.
In Economics, you can divide a market into vertical and horizontal dimensions, vertical being the chain from raw resources to finished products, and horizontal being market coverage. With a similar division in tasks, horizontal units would handle incoming events, prepare them, then pass them on to a next unit, which processes a little, then passes it on, like an assembly line (bucket brigade/ alternating current); and vertical units would take one incoming event, and see it through start to finish (direct current). You don't have to use that depiction.
The terminology is transposed from that of a distributed matrix multiplication task depiction, where horizontal works start-to-finish, and vertical takes handoffs from its predecessor.
'Circuits' doesn't use generators. I think generators are an underexplored technique. Is 'circuits' assembly line or start-to- finish, if my analogy makes any sense? 'Circuits' is event-driven, but I don't see any difference between 'event-driven' and multithreaded in general. (I think contrast can create a good picture and a clear understanding.) What is special about an 'event-driven' architecture? Are you distinguishing blocking from polling?
On Wed, Dec 31, 2008 at 12:29 AM, Aaron Brady <castiro...@gmail.com> wrote: > James, Hi. I'm glad you asked; I never know how "out there" my > comments are (but surmise that feedback is always a good thing). What > I was thinking was, I didn't know Virtual Synchrony, and I've never > used Erlang, but I'm interested in concurrency especially as it > pertains to units of work, division of labor, and division of context; > and generators are another way to divide context. So: I wanted to put > more of my background and interests on the table. What I said wasn't > directly relevant, I see. But it's not like I > "dissertated" (discussed) the Tibettan-Austrian spice trade. I think > I just want to say stuff about threading! Maybe I'm just excited to > meet people who share my interests... not unheard of.
Glad to see others also interested in these topics :)
(snip)
> 'Circuits' doesn't use generators. I think generators are an > underexplored technique. Is 'circuits' assembly line or start-to- > finish, if my analogy makes any sense? 'Circuits' is event-driven, > but I don't see any difference between 'event-driven' and > multithreaded in general. (I think contrast can create a good picture > and a clear understanding.) What is special about an 'event-driven' > architecture? Are you distinguishing blocking from polling?
I'll shortly be releasing circuits-1.0 today hopefully. To answer your question, circuits is inspired by a software architecture that my most favoured lecturer a few years back was teaching. That is: * Behaviour Trees (design) and consequently: * The concept of "everything is a Component" * Systems and Sub-Systems are built upon Components and * Everything is an event. and * An emergent property of such systems are "Behaviour".
That being said, circuits employs both an event-driven approach as well as a Component architecture. In your analogy it is both horizontal and vertical.
As I continue to develop circuits and improve it's core design as well as building it's ever growing set of Components, I try to keep it as general as possible - my main aim though is distributed processing and architectures. (See the primes example).
> As I continue to develop circuits and improve it's > core design as well as building it's ever growing set > of Components, I try to keep it as general as > possible - my main aim though is distributed > processing and architectures. (See the primes example).
Aaron, just wanted to demonstrate to you the example (primes) that I mentioned above:
Total Primes: 1001 (23/s after 44.16s) Total Events: 43373 (983/s after 44.16s) Distribution: c1096b40-7606-4ba7-9593-1385e14ef339: 348 8313b43f-d45d-4a0a-8d87-e6a93d3dfb0b: 653
The example uses circuits to distribute work amongst any arbitrary number of nodes connected to the system. This is all manually implemented by simply taking advantage of the circuits framework with my basic understanding ( so far ) of distributed processing, virtual synchrony, and other techniques.
This is the same thing just run on a single node (no distribution):
Total Primes: 1001 (17/s after 62.13s) Total Events: 28579 (460/s after 62.13s) Distribution: 701be749-9833-40a4-9181-5ee18047b1ad: 1001
As you can see, running 2 instances almost halves the time. If you do try this though, you'll note that the CPU isn't being used very heavily at all - This could be improved - but the example merely demonstrates distributed event processing and syncronization.
>You have missed an important point. A well designed application does >neither create so many threads nor processes. The creation of a thread >or forking of a process is an expensive operation.
That actually depends on the operating system. As one example, thread creation on Windows is not all that expensive.
>You should use a pool of threads or processes.
Even so, this is good advise. -- Tim Roberts, t...@probo.com Providenza & Boekelheide, Inc.
> Christian Heimes wrote: > > mk wrote: > >> Am I doing smth wrong in code below? Or do I have to use > >> multiprocessing.Pool to get any decent results?
> > You have missed an important point. A well designed application does > > neither create so many threads nor processes.
> Except I was not developing "well designed application" but writing the > test the goal of which was measuring the thread / process creation cost.
> > The creation of a thread > > or forking of a process is an expensive operation.
> Sure. The point is, how expensive? While still being relatively > expensive, it turns out that in Python creating a thread is much, much > cheaper than creating a process via multiprocessing on Linux, while this > seems to be not necessarily true on Mac OS X.
> > You should use a pool > > of threads or processes.
> Probably true, except, again, that was not quite the point of this > exercise..
> > The limiting factor is not the creation time but the communication and > > synchronization overhead between multiple threads or processes.
> Which I am probably going to test as well.
I had an idea. You could use 'multiprocessing' for synchronization, and just use an mmap for the actual communication. (I think it's a good idea.)
Aaron Brady <castiro...@gmail.com> writes: > I had an idea. You could use 'multiprocessing' for synchronization, > and just use an mmap for the actual communication. (I think it's a > good idea.)
On Dec 31, 6:19 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> Aaron Brady <castiro...@gmail.com> writes: > > I had an idea. You could use 'multiprocessing' for synchronization, > > and just use an mmap for the actual communication. (I think it's a > > good idea.)
I thought the same thing! Paul Boddie introduced me to POSH some months ago. I don't recall the implementation of synchro. objects in POSH, but IIRC, 'multiprocessing' uses native OS handles opening in multiple processes. It could be that there would be substantial overlap, or the trade-offs could be significant. For one, the above combination only permits strings to be shared, not objects.
> Aaron Brady <castiro...@gmail.com> writes: >> I had an idea. You could use 'multiprocessing' for synchronization, >> and just use an mmap for the actual communication. (I think it's a >> good idea.)
mk <mrk...@gmail.com> wrote: > After reading http://www.python.org/dev/peps/pep-0371/ I was under > impression that performance of multiprocessing package is similar to > that of thread / threading. However, to familiarize myself with both > packages I wrote my own test of spawning and returning 100,000 empty > threads or processes (while maintaining at most 100 processes / threads > active at any one time), respectively.
> The results I got are very different from the benchmark quoted in PEP > 371. On twin Xeon machine the threaded version executed in 5.54 secs, > while multiprocessing version took over 222 secs to complete!
> Am I doing smth wrong in code below?
Yes!
The problem with your code is that you never start more than one process at once in the multiprocessing example. Just check ps when it is running and you will see.
My conjecture is that this is due to the way fork() works under unix. I think that when the parent forks it yields the CPU to the child. Because you are giving the child effectively no work to do it returns immediately, re-awakening the parent, thus serialising your jobs.
If you give the children some work to do you'll see a quite different result. I gave each child time.sleep(1) to do and cut down the total number to 10,000.
$ ./test_multiprocessing.py == Process 1000 working == == Process 2000 working == == Process 3000 working == == Process 4000 working == == Process 5000 working == == Process 6000 working == == Process 7000 working == == Process 8000 working == == Process 9000 working == == Process 10000 working == === Main thread waiting for all processes to finish === Total time: 101.382129192
$ ./test_threading.py == Thread 1000 working == == Thread 2000 working == == Thread 3000 working == == Thread 4000 working == == Thread 5000 working == == Thread 6000 working == == Thread 7000 working == == Thread 8000 working == == Thread 9000 working == == Thread 10000 working == Total time: 100.659118176
So almost identical results and as expected - we ran 10,000 sleep(1)s in 100 seconds so we must have been running 100 simultaneously.
If you replace the "time.sleep(1)" with "for _ in xrange(1000000): pass" you get this much more interesting answer on my dual core linux laptop, showing nicely the effect of the contention on the python global interpreter lock and how multiprocessing avoids it.
$ ./test_multiprocessing.py == Process 1000 working == == Process 2000 working == == Process 3000 working == == Process 4000 working == == Process 5000 working == == Process 6000 working == == Process 7000 working == == Process 8000 working == == Process 9000 working == == Process 10000 working == === Main thread waiting for all processes to finish === Total time: 266.808327913
$ ./test_threading.py == Thread 1000 working == == Thread 2000 working == == Thread 3000 working == == Thread 4000 working == == Thread 5000 working == == Thread 6000 working == == Thread 7000 working == == Thread 8000 working == == Thread 9000 working == == Thread 10000 working == Total time: 834.81882
En Sat, 03 Jan 2009 11:31:12 -0200, Nick Craig-Wood <n...@craig-wood.com> escribió:
> mk <mrk...@gmail.com> wrote: >> The results I got are very different from the benchmark quoted in PEP >> 371. On twin Xeon machine the threaded version executed in 5.54 secs, >> while multiprocessing version took over 222 secs to complete!
>> Am I doing smth wrong in code below?
> Yes!
> The problem with your code is that you never start more than one > process at once in the multiprocessing example. Just check ps when it > is running and you will see.
Oh, very good analysis! Those results were worriying me a little.