My main server application, which runs "sitetruth.com"
has both multiple processes and multiple threads in each process.
The system rates web sites, which involves reading and parsing
up to 20 pages from each domain. Analysis of each domain is
performed in a separate process, but each process uses multiple
threads to read process several web pages simultaneously.
Some of the threads go compute-bound for a second or two at a time as
they parse web pages. Sometimes two threads (but never more than three)
in the same process may be parsing web pages at the same time, so
they're contending for CPU time.
So this is nearly the worst case for the lame GIL lock logic.
Has anyone tried using "affinity" ("http://pypi.python.org/pypi/affinity")
to lock each Python process to a single CPU? Does that help?
John Nagle
It's not clear that Beasley's performance numbers apply to any platform
except OS X, which has a particularly poor implementation of the
threading primitives CPython uses to implement the GIL.
You should check to see if it actually applies to your deployment
environment.
The GIL has been re-implemented recently. Python 3.2, I think, will
include the new implementation, which should bring OS X performance up
to the level of other platforms. It may also improve certain other
aspects of thread switching.
Jean-Paul
It's a shame that Python 3.x is dead to you, otherwise you'd be able
to enjoy the new GIL implementation in 3.2: http://www.dabeaz.com/python/NewGIL.pdf
Actually, it looks like you probably still can:
+ patch for 2.5.4: http://thread.gmane.org/gmane.comp.python.devel/109929
+ patch for 2.7? http://bugs.python.org/issue7753
(Can't comment on affinity, though, sorry)
The patch was rejected for 2.7 (and earlier) because it could break code
as explained in the discussion. One would have to apply and compile
their own binary.
You're probably better off using separate processes for the different
pages. If I remember, you were using BeautifulSoup, which while very
cool, is pretty doggone slow for use on large volumes of pages. I don't
know if there's much that can be done about that without going off on a
fairly messy C or C++ coding adventure. Maybe someday someone will do
that.
I already use separate processes for different domains. I could
live with Python's GIL as long as moving to a multicore server
doesn't make performance worse. That's why I asked about CPU dedication
for each process, to avoid thrashing at the GIL.
There's enough intercommunication between the threads working on
a single site that it's a pain to do them as subprocesses. And I
definitely don't want to launch subprocesses for each page; the
Python load time would be worse than the actual work. The
subprocess module assumes you're willing to launch a subprocess
for each transaction.
The current program organization is that there's a scheduler
process which gets requests, prioritizes them, and runs the requested
domains through the site evaluation mill. The scheduler maintains a
pool of worker processes which get work request via their input pipe, in Pickle
format, and return results, again in Pickle format. When not in
use, the worker processes sit there dormant, so there's no Python
launch cost for each transaction. If a worker process crashes, the
scheduler replaces it with a fresh one, and every few hundred uses,
each worker process is replaced with a fresh copy, in case Python
has a memory leak. It's a lot like the way
FCGI works.
Scheduling is managed using an in-memory
table in MySQL, so the load can be spread over a cluster if desired,
with a scheduler process on each machine.
So I already have a scalable architecture. The only problem
is excess overhead on multicore CPUs.
John Nagle
http://pypi.python.org/pypi/affinity/0.1.0
to ensure that each process in your pool runs on only one processor.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/
No, the original analysis was MacOS oriented, but the same mechanism
applies for fighting over the GIL on all platforms. There was some
pontification that it might be a MacOS-only issue, but no facts
were presented. It might be cheaper on C implementations with mutexes
that don't make system calls for the non-blocking cases.
John Nagle
Why not just use socketserver and have something like a fastcgi?
You could perhaps use a process pool inside each domain worker to work
on the pages? There is multiprocessing.Pool and other
implementations.
For examples, in this library, you can s/ThreadPool/ProcessPool/g and
this example would work: <http://www.onideas.ws/stream.py/#retrieving-
web-pages-concurrently>.
If you want to DIY, with multiprocessing.Lock/Pipe/Queue, I don't
understand why it would be more of a pain to write your threads as
processes.
// aht
http://blog.onideas.ws
Well, if multi-core performance is so important here, then there's a pretty
simple thing the OP can do: switch to lxml.
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Stefan
Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
only works on well-formed XML. The point of Beautiful Soup is that it
works on all kinds of garbage hand-written legacy HTML with mismatched
tags and other sorts of errors. Beautiful Soup is slower because it's
full of special cases and hacks for that reason, and it is written in
Python. Writing something that complex in C to handle so much
potentially malicious input would be quite a lot of work to write at
all, and very difficult to ensure was really safe. Look at the many
browser vulnerabilities we've seen over the years due to that sort of
problem, for example. But, for web crawling, you really do need to
handle the messy and wrong HTML properly.
That's on certain types of workloads, and perhaps on certain OSes, so you
should try benching your own workload to see whether it applies.
Two closing remarks:
- this should (hopefully) be fixed in 3.2, as exarkun noticed
- instead of spawning one thread per Web page, you could use Twisted or
another event loop mechanism in order to process pages serially, in the
order of arrival
Regards
Antoine.
If the difference is great enough, you might get a benefit from
analyzing all pages with lxml and throwing invalid pages into a bucket
for later processing with BeautifulSoup.
Actually, lxml has an HTML parser which does pretty well with the
standard level of broken one finds most often on the web. And, when it
falls down, it's easy to integrate BeautifulSoup as a slow backup for
when things go really wrong (as J Kenneth King mentioned earlier):
http://codespeak.net/lxml/lxmlhtml.html#parsing-html
At least in my experience, I haven't actually had to parse anything that
lxml couldn't handle yet, however.
--
John Krukoff <jkru...@ltgc.com>
Land Title Guarantee Company
I couldn't reproduce these results on Linux. Not sure what "HALF AS
FAST" is; I suppose it means "it runs TWICE AS LONG" - this is what I
couldn't reproduce.
If I run Beazley's program on Linux 2.6.26, on a 4 processor Xeon (3GHz)
machine, I get 30s for the sequential execution, 40s for the
multi-threaded case, and 32s for the multi-threaded case when pinning
the Python process to a single CPU (using taskset(1)).
So it's 6% overhead for threading, and 25% penalty for multicore CPUs -
far from the 100% you seem to expect.
Regards,
Martin
It's far from scientific, but I've seen behaviour that's close to a 100%
performance penalty on a dual-core linux system:
http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2
Short story: a particular test suite of mine used to run in around 25
seconds, but a bit of ctypes magic to set thread affinity dropped the
running time to under 13 seconds.
Cheers,
Ryan
--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
ry...@rfk.id.au | http://www.rfk.id.au/ramblings/gpg/ for details
Indeed, it's not scientific - but with a few more details, you could
improve it quite a lot: what specific Linux distribution (the posting
doesn't even say it's Linux), what specific Python version had you been
using? (less important) what CPUs? If you can: what specific test suite?
A lot of science is about repeatability. Making a systematic study is
(IMO) over-valued - anecdotal reports are useful, too, as long as they
allow for repeatable experiments.
Regards,
Martin
Indeed, it's not scientific - but with a few more details, you could
I'm on Ubuntu Karmic, Python 2.6.4, an AMD Athlon 7750 dual core.
Unfortunately the test suite is for a proprietary application. I've
been able to reproduce similar behaviour with an open-source test suite,
using the current trunk of the "pyfilesystem" project:
http://code.google.com/p/pyfilesystem/
In this project "OSFS" is an object-oriented interface to the local
filesystem. The test case "TestOSFS.test_cases_in_separate_dirs" runs
three theads, each doing a bunch of IO in a different directory.
Running the tests normally:
rfk@durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS.test_cases_in_separate_dirs
.
----------------------------------------------------------------------
Ran 1 test in 9.787s
That's the best result from five runs - I saw it go as high as 12
seconds. Watching it in top, I see CPU usage at around 150%.
Now using threading2 to set the process cpu affinity at the start of the
test run:
rfk@durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS.test_cases_in_separate_dirs
.
----------------------------------------------------------------------
Ran 1 test in 3.792s
Again, best of five. The variability in times here is much lower - I
never saw it go above 4 seconds. CPU usage is consistently 100%.