I know there's a performance penalty for running Python on a multicore CPU, but how bad is it? I've read the key paper ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate if the GIL just limited Python to running on one CPU at a time, but it's worse than that; there's excessive overhead due to a lame locking implementation. Running CPU-bound multithreaded code on a dual-core CPU runs HALF AS FAST as on a single-core CPU, according to Beasley.
My main server application, which runs "sitetruth.com" has both multiple processes and multiple threads in each process. The system rates web sites, which involves reading and parsing up to 20 pages from each domain. Analysis of each domain is performed in a separate process, but each process uses multiple threads to read process several web pages simultaneously.
Some of the threads go compute-bound for a second or two at a time as they parse web pages. Sometimes two threads (but never more than three) in the same process may be parsing web pages at the same time, so they're contending for CPU time.
So this is nearly the worst case for the lame GIL lock logic. Has anyone tried using "affinity" ("http://pypi.python.org/pypi/affinity") to lock each Python process to a single CPU? Does that help?
> I know there's a performance penalty for running Python on a >multicore CPU, but how bad is it? I've read the key paper >("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate >if the GIL just limited Python to running on one CPU at a time, >but it's worse than that; there's excessive overhead due to >a lame locking implementation. Running CPU-bound multithreaded >code on a dual-core CPU runs HALF AS FAST as on a single-core >CPU, according to Beasley.
It's not clear that Beasley's performance numbers apply to any platform except OS X, which has a particularly poor implementation of the threading primitives CPython uses to implement the GIL.
You should check to see if it actually applies to your deployment environment.
The GIL has been re-implemented recently. Python 3.2, I think, will include the new implementation, which should bring OS X performance up to the level of other platforms. It may also improve certain other aspects of thread switching.
On Feb 3, 9:02 am, John Nagle <na...@animats.com> wrote:
> I know there's a performance penalty for running Python on a > multicore CPU, but how bad is it? I've read the key paper > ("www.dabeaz.com/python/GIL.pdf"), of course.
> On Feb 3, 9:02 am, John Nagle<na...@animats.com> wrote: >> I know there's a performance penalty for running Python on a >> multicore CPU, but how bad is it? I've read the key paper >> ("www.dabeaz.com/python/GIL.pdf"), of course.
> It's a shame that Python 3.x is dead to you, otherwise you'd be able > to enjoy the new GIL implementation in 3.2: http://www.dabeaz.com/python/NewGIL.pdf
The patch was rejected for 2.7 (and earlier) because it could break code as explained in the discussion. One would have to apply and compile their own binary.
John Nagle <na...@animats.com> writes: > Analysis of each domain is > performed in a separate process, but each process uses multiple > threads to read process several web pages simultaneously.
> Some of the threads go compute-bound for a second or two at a time as > they parse web pages.
You're probably better off using separate processes for the different pages. If I remember, you were using BeautifulSoup, which while very cool, is pretty doggone slow for use on large volumes of pages. I don't know if there's much that can be done about that without going off on a fairly messy C or C++ coding adventure. Maybe someday someone will do that.
Paul Rubin wrote: > John Nagle <na...@animats.com> writes: >> Analysis of each domain is >> performed in a separate process, but each process uses multiple >> threads to read process several web pages simultaneously.
>> Some of the threads go compute-bound for a second or two at a time as >> they parse web pages.
> You're probably better off using separate processes for the different > pages. If I remember, you were using BeautifulSoup, which while very > cool, is pretty doggone slow for use on large volumes of pages. I don't > know if there's much that can be done about that without going off on a > fairly messy C or C++ coding adventure. Maybe someday someone will do > that.
I already use separate processes for different domains. I could live with Python's GIL as long as moving to a multicore server doesn't make performance worse. That's why I asked about CPU dedication for each process, to avoid thrashing at the GIL.
There's enough intercommunication between the threads working on a single site that it's a pain to do them as subprocesses. And I definitely don't want to launch subprocesses for each page; the Python load time would be worse than the actual work. The subprocess module assumes you're willing to launch a subprocess for each transaction.
The current program organization is that there's a scheduler process which gets requests, prioritizes them, and runs the requested domains through the site evaluation mill. The scheduler maintains a pool of worker processes which get work request via their input pipe, in Pickle format, and return results, again in Pickle format. When not in use, the worker processes sit there dormant, so there's no Python launch cost for each transaction. If a worker process crashes, the scheduler replaces it with a fresh one, and every few hundred uses, each worker process is replaced with a fresh copy, in case Python has a memory leak. It's a lot like the way FCGI works.
Scheduling is managed using an in-memory table in MySQL, so the load can be spread over a cluster if desired, with a scheduler process on each machine.
So I already have a scalable architecture. The only problem is excess overhead on multicore CPUs.
John Nagle wrote: > Paul Rubin wrote: >> John Nagle <na...@animats.com> writes: >>> Analysis of each domain is >>> performed in a separate process, but each process uses multiple >>> threads to read process several web pages simultaneously.
>>> Some of the threads go compute-bound for a second or two at a time as >>> they parse web pages.
>> You're probably better off using separate processes for the different >> pages. If I remember, you were using BeautifulSoup, which while very >> cool, is pretty doggone slow for use on large volumes of pages. I don't >> know if there's much that can be done about that without going off on a >> fairly messy C or C++ coding adventure. Maybe someday someone will do >> that.
> I already use separate processes for different domains. I could > live with Python's GIL as long as moving to a multicore server > doesn't make performance worse. That's why I asked about CPU dedication > for each process, to avoid thrashing at the GIL.
I believe it's already been said that the GIL thrashing is mostly MacOS specific. You might also find something in the affinity module
Steve Holden wrote: > John Nagle wrote: >> Paul Rubin wrote: >>> John Nagle <na...@animats.com> writes: >>>> Analysis of each domain is >>>> performed in a separate process, but each process uses multiple >>>> threads to read process several web pages simultaneously.
>>>> Some of the threads go compute-bound for a second or two at a time as >>>> they parse web pages. >>> You're probably better off using separate processes for the different >>> pages. If I remember, you were using BeautifulSoup, which while very >>> cool, is pretty doggone slow for use on large volumes of pages. I don't >>> know if there's much that can be done about that without going off on a >>> fairly messy C or C++ coding adventure. Maybe someday someone will do >>> that. >> I already use separate processes for different domains. I could >> live with Python's GIL as long as moving to a multicore server >> doesn't make performance worse. That's why I asked about CPU dedication >> for each process, to avoid thrashing at the GIL.
> I believe it's already been said that the GIL thrashing is mostly MacOS > specific. You might also find something in the affinity module
No, the original analysis was MacOS oriented, but the same mechanism applies for fighting over the GIL on all platforms. There was some pontification that it might be a MacOS-only issue, but no facts were presented. It might be cheaper on C implementations with mutexes that don't make system calls for the non-blocking cases.
John Nagle <na...@animats.com> writes: > There's enough intercommunication between the threads working on > a single site that it's a pain to do them as subprocesses. And I > definitely don't want to launch subprocesses for each page; the > Python load time would be worse than the actual work. The > subprocess module assumes you're willing to launch a subprocess > for each transaction.
Why not just use socketserver and have something like a fastcgi?
On Feb 4, 10:46 am, John Nagle <na...@animats.com> wrote:
> There's enough intercommunication between the threads working on > a single site that it's a pain to do them as subprocesses. And I > definitely don't want to launch subprocesses for each page; the > Python load time would be worse than the actual work. The > subprocess module assumes you're willing to launch a subprocess > for each transaction.
You could perhaps use a process pool inside each domain worker to work on the pages? There is multiprocessing.Pool and other implementations.
> John Nagle writes: >> Analysis of each domain is >> performed in a separate process, but each process uses multiple >> threads to read process several web pages simultaneously.
>> Some of the threads go compute-bound for a second or two at a time as >> they parse web pages.
> You're probably better off using separate processes for the different > pages. If I remember, you were using BeautifulSoup, which while very > cool, is pretty doggone slow for use on large volumes of pages. I don't > know if there's much that can be done about that without going off on a > fairly messy C or C++ coding adventure. Maybe someday someone will do > that.
Well, if multi-core performance is so important here, then there's a pretty simple thing the OP can do: switch to lxml.
Stefan Behnel <stefan...@behnel.de> writes: > Well, if multi-core performance is so important here, then there's a pretty > simple thing the OP can do: switch to lxml.
Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it only works on well-formed XML. The point of Beautiful Soup is that it works on all kinds of garbage hand-written legacy HTML with mismatched tags and other sorts of errors. Beautiful Soup is slower because it's full of special cases and hacks for that reason, and it is written in Python. Writing something that complex in C to handle so much potentially malicious input would be quite a lot of work to write at all, and very difficult to ensure was really safe. Look at the many browser vulnerabilities we've seen over the years due to that sort of problem, for example. But, for web crawling, you really do need to handle the messy and wrong HTML properly.
Le Tue, 02 Feb 2010 15:02:49 -0800, John Nagle a écrit :
> I know there's a performance penalty for running Python on a multicore > CPU, but how bad is it? I've read the key paper > ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate if > the GIL just limited Python to running on one CPU at a time, but it's > worse than that; there's excessive overhead due to a lame locking > implementation. Running CPU-bound multithreaded code on a dual-core CPU > runs HALF AS FAST as on a single-core CPU, according to Beasley.
That's on certain types of workloads, and perhaps on certain OSes, so you should try benching your own workload to see whether it applies.
Two closing remarks: - this should (hopefully) be fixed in 3.2, as exarkun noticed - instead of spawning one thread per Web page, you could use Twisted or another event loop mechanism in order to process pages serially, in the order of arrival
Paul Rubin <no.em...@nospam.invalid> writes: > Stefan Behnel <stefan...@behnel.de> writes: >> Well, if multi-core performance is so important here, then there's a pretty >> simple thing the OP can do: switch to lxml.
> Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it > only works on well-formed XML. The point of Beautiful Soup is that it > works on all kinds of garbage hand-written legacy HTML with mismatched > tags and other sorts of errors. Beautiful Soup is slower because it's > full of special cases and hacks for that reason, and it is written in > Python. Writing something that complex in C to handle so much > potentially malicious input would be quite a lot of work to write at > all, and very difficult to ensure was really safe. Look at the many > browser vulnerabilities we've seen over the years due to that sort of > problem, for example. But, for web crawling, you really do need to > handle the messy and wrong HTML properly.
If the difference is great enough, you might get a benefit from analyzing all pages with lxml and throwing invalid pages into a bucket for later processing with BeautifulSoup.
On Mon, 2010-02-08 at 01:10 -0800, Paul Rubin wrote: > Stefan Behnel <stefan...@behnel.de> writes: > > Well, if multi-core performance is so important here, then there's a pretty > > simple thing the OP can do: switch to lxml.
> Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it > only works on well-formed XML. The point of Beautiful Soup is that it > works on all kinds of garbage hand-written legacy HTML with mismatched > tags and other sorts of errors. Beautiful Soup is slower because it's > full of special cases and hacks for that reason, and it is written in > Python. Writing something that complex in C to handle so much > potentially malicious input would be quite a lot of work to write at > all, and very difficult to ensure was really safe. Look at the many > browser vulnerabilities we've seen over the years due to that sort of > problem, for example. But, for web crawling, you really do need to > handle the messy and wrong HTML properly.
Actually, lxml has an HTML parser which does pretty well with the standard level of broken one finds most often on the web. And, when it falls down, it's easy to integrate BeautifulSoup as a slow backup for when things go really wrong (as J Kenneth King mentioned earlier):
At least in my experience, I haven't actually had to parse anything that lxml couldn't handle yet, however. -- John Krukoff <jkruk...@ltgc.com> Land Title Guarantee Company
John Nagle wrote: > I know there's a performance penalty for running Python on a > multicore CPU, but how bad is it? I've read the key paper > ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate > if the GIL just limited Python to running on one CPU at a time, > but it's worse than that; there's excessive overhead due to > a lame locking implementation. Running CPU-bound multithreaded > code on a dual-core CPU runs HALF AS FAST as on a single-core > CPU, according to Beasley.
I couldn't reproduce these results on Linux. Not sure what "HALF AS FAST" is; I suppose it means "it runs TWICE AS LONG" - this is what I couldn't reproduce.
If I run Beazley's program on Linux 2.6.26, on a 4 processor Xeon (3GHz) machine, I get 30s for the sequential execution, 40s for the multi-threaded case, and 32s for the multi-threaded case when pinning the Python process to a single CPU (using taskset(1)).
So it's 6% overhead for threading, and 25% penalty for multicore CPUs - far from the 100% you seem to expect.
On Sun, 2010-02-21 at 22:22 +0100, Martin v. Loewis wrote: > John Nagle wrote: > > I know there's a performance penalty for running Python on a > > multicore CPU, but how bad is it? I've read the key paper > > ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate > > if the GIL just limited Python to running on one CPU at a time, > > but it's worse than that; there's excessive overhead due to > > a lame locking implementation. Running CPU-bound multithreaded > > code on a dual-core CPU runs HALF AS FAST as on a single-core > > CPU, according to Beasley.
> I couldn't reproduce these results on Linux. Not sure what "HALF AS > FAST" is; I suppose it means "it runs TWICE AS LONG" - this is what I > couldn't reproduce.
> If I run Beazley's program on Linux 2.6.26, on a 4 processor Xeon (3GHz) > machine, I get 30s for the sequential execution, 40s for the > multi-threaded case, and 32s for the multi-threaded case when pinning > the Python process to a single CPU (using taskset(1)).
> So it's 6% overhead for threading, and 25% penalty for multicore CPUs - > far from the 100% you seem to expect.
It's far from scientific, but I've seen behaviour that's close to a 100% performance penalty on a dual-core linux system:
Short story: a particular test suite of mine used to run in around 25 seconds, but a bit of ctypes magic to set thread affinity dropped the running time to under 13 seconds.
> Short story: a particular test suite of mine used to run in around 25 > seconds, but a bit of ctypes magic to set thread affinity dropped the > running time to under 13 seconds.
Indeed, it's not scientific - but with a few more details, you could improve it quite a lot: what specific Linux distribution (the posting doesn't even say it's Linux), what specific Python version had you been using? (less important) what CPUs? If you can: what specific test suite?
A lot of science is about repeatability. Making a systematic study is (IMO) over-valued - anecdotal reports are useful, too, as long as they allow for repeatable experiments.
> Short story: a particular test suite of mine used to run in around 25 > seconds, but a bit of ctypes magic to set thread affinity dropped the > running time to under 13 seconds.
Indeed, it's not scientific - but with a few more details, you could improve it quite a lot: what specific Linux distribution (the posting doesn't even say it's Linux), what specific Python version had you been using? (less important) what CPUs? If you can: what specific test suite?
A lot of science is about repeatability. Making a systematic study is (IMO) over-valued - anecdotal reports are useful, too, as long as they allow for repeatable experiments.
On Sun, 2010-02-21 at 23:05 +0100, Martin v. Loewis wrote: > > It's far from scientific, but I've seen behaviour that's close to a 100% > > performance penalty on a dual-core linux system:
> > Short story: a particular test suite of mine used to run in around 25 > > seconds, but a bit of ctypes magic to set thread affinity dropped the > > running time to under 13 seconds.
> Indeed, it's not scientific - but with a few more details, you could > improve it quite a lot: what specific Linux distribution (the posting > doesn't even say it's Linux), what specific Python version had you been > using? (less important) what CPUs? If you can: what specific test suite?
I'm on Ubuntu Karmic, Python 2.6.4, an AMD Athlon 7750 dual core.
Unfortunately the test suite is for a proprietary application. I've been able to reproduce similar behaviour with an open-source test suite, using the current trunk of the "pyfilesystem" project:
In this project "OSFS" is an object-oriented interface to the local filesystem. The test case "TestOSFS.test_cases_in_separate_dirs" runs three theads, each doing a bunch of IO in a different directory.
Running the tests normally:
rfk@durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS.test_cases_in_separate_dirs . ---------------------------------------------------------------------- Ran 1 test in 9.787s
That's the best result from five runs - I saw it go as high as 12 seconds. Watching it in top, I see CPU usage at around 150%.
Now using threading2 to set the process cpu affinity at the start of the test run:
rfk@durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS.test_cases_in_separate_dirs . ---------------------------------------------------------------------- Ran 1 test in 3.792s
Again, best of five. The variability in times here is much lower - I never saw it go above 4 seconds. CPU usage is consistently 100%.