I would like to keep using urllib to do my web interface (as opposed to
doing low level
asynchronous sockets) and so my first thought is to use threads.
The "threading" module doesn't allow stopping, so that seems useless.
But the "thread" module DOES. Am I going to get myself into trouble
starting and
stopping threads that are busy doing network i/o? If the number of
parallel connections
remains small (like around 10), would you maybe suggest separate
processes (which
CAN be stopped)?
Has anyone out there written a robust web crawler in Python that got
around this problem?
Thanks for any advice.
-Dustin.
Try timeoutsocket.py
--
--- Aahz <*> (Copyright 2001 by aa...@pobox.com)
Androgynous poly kinky vanilla queer het Pythonista http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6
"You do not make history; you can only hope to survive it." --G'kar
It DOES? How? the docs I have say that a thread can stop ITSELF, by
raising SystemExit, or calling sys.exit(). But nowhere
have I seen the ability for one thread to stop a
another thread. Please correct me if I'm wrong,
I would certainly be happy if I was wrong and
threads could be stopped.
Past posts on this group have suggested that the
subject thread should poll for an indication
that it should stop itself, and that one thread
trying to stop another thread from another is
a sign of poor design.
I would disagree, and ask if anyone can offer a
more rational explanation why Python threads
are designed to be not stoppable (if I am not
mistaken about this.)
Polling for a "stop" indication cannot be
done reliably for any thread using code not written
by the programmer who wants to achieve a reliable stop:
The thread may block, or call some library code
that is not under the programmer's control that
fails to poll, or blocks without the programmer's
knowledge. Even for the code that is under the
programmers control, it can be difficult to
design arbitrary algorithms that always poll,
and such coding certainly is more difficult to
understand an maintain.
IMHO, it would be far better, if as in Modula-3,
one thread could "alert" another thread, thereby
raising an "alerted" exception in that thread, which
if not caught and otherwise dealt with would end that
thread.
- Parzival
Yes, and I totally agree with you that there should be some "safe" mechanism set
up for threads to signal each other. But why is it safe to stop a process? Unix gives
you the option of CTRL-C'ing it, so obviously it must be somewhat safe (even if
its busy doing file i/o, or otherwise using resources, which is a common argument
for not allowing programmers to stop threads.)
It seems like I'm being forced to do multiple processes and interprocess-communication,
which seems dumb, cause threading seems like a better solution...
Since Python uses each platform's native notion of threads, it's restricted
to what native platform threads support. Python isn't an operating system.
That is, Python doesn't *implement* threads. Instead it wraps a portable API
*around* platform threads. It's unusual to see reliable ways for one thread
to kill another; even Java (eventually) gave up on that.
> But why is it safe to stop a process?
Primarily because an OS saves enough information about processes and their
relation to system resource state to make it *possible* to clean up after
killed processes safely. Thread packages typically do not save enough info
about threads and their use of system resrouces to do the same; in return,
because threads aren't bogged down with so much hidden recovery state,
they're typically nimbler than processes. The things that make threads
lightweight are the baggage they *don't* carry.
> ...
> It seems like I'm being forced to do multiple processes and
> interprocess-communication, which seems dumb, cause threading
> seems like a better solution...
If method A has cool feature X and method B has cool feature Y, it's not
always the case that the lack of a method C combining and X and Y (and
without introducing new drawbacks) is simply due to a dumb universe picking
on you <wink>.
although-that's-always-my-first-guess-too-ly y'rs - tim
On Wed, 23 May 2001 04:54:46 GMT, Parzival Herzog <pa...@home.com> wrote:
>Past posts on this group have suggested that the
>subject thread should poll for an indication
>that it should stop itself, and that one thread
>trying to stop another thread from another is
>a sign of poor design.
>
>I would disagree, and ask if anyone can offer a
>more rational explanation why Python threads
>are designed to be not stoppable (if I am not
>mistaken about this.)
thread a grabs the semaphor protecting the global socket.
thread b kills thread a.
thread c tries to grab the semaphor protecting the global socket.
*DEADLOCK!*.
Now, there's the possibility of creating a semaphor wrapper class that
when you request the global semaphor, it can detect that thread a has died,
and give the semaphor to thread c. Unfortunately, there's no
standard way to know if a particular thread is still alive or not. For
example, on FreeBSD a thread runs within the parent process, while on
Linux a thread has is just a process that happens to share its memory
space with another process.
The only way I've really found of doing this is shared memory and processes,
where it is Posixly positive that you can detect that a process has died.
Unfortunately, Python is excruitiatingly painful when using the Posix
shared memory model :-(.
Threads are evil. They are evil, wicked, nasty things. They are evil,
wicked nasty things because there's no easy way to free the resources
that a thread has allocated if the thread dies or is killed, unless
you do some very tedious book-keeping. I thought doing that tedious
book-keeping was why operating systems were invented, especially the
whole notion of a "process", where the allocated resources are
automagically de-allocated when the process dies. Unfortunately, some
beknighted notion of "efficiency" has made threading the de-facto
"standard" on most platforms for programs that must do multiple tasks,
even though, on Linux at least, a thread spawn and a process fork are
the exact same freakin' kernel call (just a different parameter), and
take virtually the same amount of time (both create a copy of the page
tables, the only big difference is that the process fork sets the
copy-on-write flags for the pages).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.5 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE7C1WU3DrrK1kMA04RAnQUAJ4ni9RJnqNWTP18UBxFF8jynzyq2gCfVSb0
/wvU/vhJ1qX221H5mb1+yeI=
=JsdM
-----END PGP SIGNATURE-----
What we don't want is the present requirement to insert polling code everywhere
it would be needed to guarantee adequately prompt response to an asynchronous condition,
arising in another thread, since this really cannot be doen in general anyways.
I don't see what the platform OS issue with that would be, it seems to me to be
a Python interpreter issue. Several Modula-3 implementation do it in a machine
and OS independent way, and they don't even have the advantage of an interpreter
loop.
- Parzival
"Tim Peters" <tim...@home.com> wrote in message news:mailman.990596972...@python.org...
You can set a threading.event() in the last finally clause of the thread,
see below.
>
> Threads are evil. They are evil, wicked, nasty things. They are evil,
> wicked nasty things because there's no easy way to free the resources
> that a thread has allocated if the thread dies or is killed, unless
> you do some very tedious book-keeping. I thought doing that tedious
This bookkeeping is not at all tedious in Python.
A thread exits by raising an exception. This means that all
pending finally clauses _will_ be executed:
exitEvent = threading.Event()
f = open(filename)
try:
someCodeThatMightCallExit() # and thereby raise the thread exit exception
finally:
f.close()
exitEvent.set()
> book-keeping was why operating systems were invented, especially the
> whole notion of a "process", where the allocated resources are
> automagically de-allocated when the process dies. Unfortunately, some
Hmmm, an OS in python?
> beknighted notion of "efficiency" has made threading the de-facto
> "standard" on most platforms for programs that must do multiple tasks,
> even though, on Linux at least, a thread spawn and a process fork are
> the exact same freakin' kernel call (just a different parameter), and
> take virtually the same amount of time (both create a copy of the page
> tables, the only big difference is that the process fork sets the
> copy-on-write flags for the pages).
You can do a sys.exit() and leave the cleanup to your favourite OS.
I prefer to reserve that for the main thread after
it has detected that all other threads have exited.
FWIW: the python thread model is very close to the java thread model:
http://java.sun.com/j2se/1.3/docs/api/java/lang/Thread.html
and for more on the problems of killing threads:
http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html
I think the similarity is no coincidence, but I don't know anything about the
history of python threads. The similarity probably saved a lot of
headaches for implementing JPython threads.
Regards,
Ype
--
email at xs4all.nl
exitEvent.wait()
sys.exit()
> thread a grabs the semaphor protecting the global socket.
> thread b kills thread a.
> thread c tries to grab the semaphor protecting the global socket.
>
> *DEADLOCK!*.
Don't want this. Want:
---Thread a:
try:
try:
lock.acquire()
doSomething()
finally:
lock.release()
except thread.alerted, whatHappened:
print "I goofed:", whatHappened
sys.exit()
---Thread b:
thread_a.alert("You forgot your keys, you silly silly goose.")
---Thread c:
try:
lock.acquire()
doSomethingElse()
finally:
lock.release()
> Threads are evil. They are evil, wicked, nasty things. They are evil,
> wicked nasty things because there's no easy way to free the resources
> that a thread has allocated if the thread dies or is killed, unless
If thread are so evil, what are they doing in Python? Ruining fair maidens?
Shame!
> you do some very tedious book-keeping. I thought doing that tedious
> book-keeping was why operating systems were invented, especially the
> whole notion of a "process", where the allocated resources are
> automagically de-allocated when the process dies.
Actually OS processes are not that good at terminating resources, witness
the files whose buffers are not flushed, then network connections
that hang...
IMHO, try - finally is what is good for freeing resources with a minimum of
tedium.
I believe processes were invented to have support separate programs
running in separate address spaces, so that 1) they could be time-multiplexed,
and 2) while running "simultaneously", they would be isolated from each
other (i.e the OS could guarantee that a process could continue running
no matter what evil or ignorant deeds were done by other processes.
The unix tradition of using multiple processes within the same program
is a horrifically cumbersome kludge, not a convenience, and threads were
invented to get us out of that kludge, by removing the complexities of
communicating across separate address spaces within the same program.
- argumentative-ly yours,
Parzival
> thread a grabs the semaphor protecting the global socket.
> thread b kills thread a.
> thread c tries to grab the semaphor protecting the global socket.
>
> *DEADLOCK!*.
Concurrent Haskell allows throwing exceptions to other threads.
This includes the ability to kill threads and to defend from killing.
Catching such exceptions doesn't eliminate all races, so there is also
the possibility to block and unblock asynchronous exceptions around
really critical sections.
I believe that Python doesn't allow killing threads because it's
hard to provide in all threading environments. It is possible to
avoid deadlocks when threads are killed (but it's also possible to
not avoid them).
> Threads are evil.
Not at all (if done right).
> They are evil, wicked nasty things because there's no easy way to
> free the resources that a thread has allocated if the thread dies
> or is killed, unless you do some very tedious book-keeping.
Garbage collection with finalizers and modelling killing threads as
exceptions take care of this.
> Unfortunately, some beknighted notion of "efficiency" has made
> threading the de-facto "standard" on most platforms for programs that
> must do multiple tasks, even though, on Linux at least, a thread
> spawn and a process fork are the exact same freakin' kernel call
> (just a different parameter), and take virtually the same amount of
> time (both create a copy of the page tables, the only big difference
> is that the process fork sets the copy-on-write flags for the pages).
Concurrent Haskell implementation doesn't use OS threads. Its threads
are much faster than Linux threads.
--
__("< Marcin Kowalczyk * qrc...@knm.org.pl http://qrczak.ids.net.pl/
\__/
^^ SYGNATURA ZASTĘPCZA
QRCZAK
This comes up regularly, but until somebody wants it enough to write a PEP,
and somebody else enough to implement the PEP, nothing will change.
> ...
> I don't see what the platform OS issue with that would be, it
> seems to me to be a Python interpreter issue. Several Modula-3
> implementation do it in a machine and OS independent way, and
> they don't even have the advantage of an interpreter loop.
The PEP should flesh out how to accomplish that wrt Python's internals, then,
if the PEP author believes Modula-3's is an adequate solution.
> Dustin, (and I, and perhaps others) want not to have the OS kill
> some OS thread, but a way to signal a python program running in a
> thread, e.g. raise an exception handled by the subject thread, say
> "thread.alerted" that alerts that thread. The thread can handle the
> exception in any way it wants, including to terminate itself.
Note that this still devolves into needing some support at the OS
thread level for precisely those sort of interruptions - otherwise
your Python thread can be blocked in an OS service which will prevent
Python from delivering the exception.
Of course, one could make such exception handling conditional on
whatever underlying support the OS-specific thread module could
deliver.
--
-- David
--
/-----------------------------------------------------------------------\
\ David Bolen \ E-mail: db...@fitlinxx.com /
| FitLinxx, Inc. \ Phone: (203) 708-5192 |
/ 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \
\-----------------------------------------------------------------------/
I would think that relying on garbage collection with finalizers as a way
of freeing up system resources is not an infallible strategy in the general
case since it is possible to use up critical system resources before your
memory is gone and before a garbage collector runs.
<snip>
Jim