what would be best practise for speeding up a larger number of http-get
requests done via urllib? Until now they are made in sequence, each request
taking up to one second. The results must be merged into a list, while the
original sequence needs not to be kept.
I think speed could be improved by parallizing. One could use multiple
threads.
Are there any python best practises, or even existing modules, for creating
and handling a task queue with a fixed number of concurrent threads?
Thanks and regards!
I believe code of this type has been published here in various threads.
The fairly obvious thing to do is use a queue.queue for tasks and
another for results and a pool of threads that read, fetch, and write.
Using multiple threads is one approach. There are a few thread pool
implementations lying about; one is part of Twisted,
<http://twistedmatrix.com/documents/current/api/twisted.python.threadpool.ThreadPool.html>.
Another approach is to use non-blocking or asynchronous I/O to make
multiple requests without using multiple threads. Twisted can help you
out with this, too. There's two async HTTP client APIs available. The
older one:
http://twistedmatrix.com/documents/current/api/twisted.web.client.getPage.html
http://twistedmatrix.com/documents/current/api/twisted.web.client.HTTPClientFactory.html
And the newer one, introduced in 9.0:
http://twistedmatrix.com/documents/current/api/twisted.web.client.Agent.html
Jean-Paul
> The fairly obvious thing to do is use a queue.queue for tasks and another
> for results and a pool of threads that read, fetch, and write.
Thanks, indeed.
Is a list thrad-safe or do I need to lock when adding the results of my
worker threads to a list? The order of the elements in the list does not
matter.
Jens
> The fairly obvious thing to do is use a queue.queue for tasks and another
> for results and a pool of threads that read, fetch, and write.
Thanks, indeed.
The built-in list type is thread-safe, but is doesn't provide the waiting
features that queue.Queue provides.
Regards
Antoine.
> Terry said "queue". not "list". Use the Queue class (it's thread-safe)
> in the "Queue" module (assuming you're using Python 2.x; in Python 3.x
> it's called the "queue" module).
Yes yes, I know. I use a queue to realize the thread pool queue, that works
all right.
But each worker thread calculates a result and needs to make it avaialable
to the application in the main thread again. Therefore, it appends its
result to a common list. This seems works as well, but I was thinking of
possible conflict situations that maybe could happen when two threads append
their results to that same result list at the same moment.
Regards,
Jens
If you *do* take them off, then use a Queue.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/