Is there a way that the program that created and started a thread also stops
it.
(My usage is a time-out).
E.g.
thread = threading.Thread(target=Loop.testLoop)
thread.start() # This thread is expected to finish within a second
thread.join(2) # Or time.sleep(2) ?
if thread.isAlive():
# thread has probably encountered a problem and hangs
# What should be here to stop thread ??????
Note that I don't want to change the target (too much), as many possible
targets exist,
together thousands of lines of code.
Thanks,
Hans
No, Python has no threadicide method, and its absence is not an
oversight. Threads often have important business left to do, such
as releasing locks on shared data; killing them at arbitrary times
tends to leave the system in an inconsistent state.
> if thread.isAlive():
> # thread has probably encountered a problem and hangs
> # What should be here to stop thread ??????
At this point, it's too late. Try to code so your threads don't hang.
Python does let you arrange for threads to die when you want to
terminate the program, with threading's Thread.setDaemon().
--
--Bryan
>>> "threadicide" method
I like this word...
Michel Claveau
> Python has no threadicide method, and its absence is not an
> oversight. Threads often have important business left to do, such
> as releasing locks on shared data; killing them at arbitrary times
> tends to leave the system in an inconsistent state.
Perhaps another reason to avoid threads and use processes instead?
If the processes are sharing resources, the exact same problems arise.
My problem with the fact that python doesn't have some type of "thread
killer" is that again, the only solution involves some type of polling
loop. I.e. "if your thread of execution can be written so that it
periodically checks for a kill condition". This really sucks, not just
because polling is a ridiculous practice, but it forces programmers in
many situations to go through a lengthy process of organizing operations
into a list. For, say I have threads that share a bunch of common
memory (yea, i'm saying this exclusively to get the procses users off my
back) that executes a series of commands on remote nodes using rsh or
something. So if i've constructed my system using threads I need to
neatly go and dump all operations into some sort of list so that I can
implement a polling mechanism, i.e.
opList = [op1, op2, op3, op4]
for op in opList:
checkMessageQueue()
op()
That works if you can easily create an opList. If you want good
response time this can become quite ugly, especially if you have a lot
going on. Say I have a function I want to run in a thread:
#Just pretend for the sake of arguement that 'op' actually means
something and is a lengthy operation
def func_to_thread():
os.system('op 1')
os.system('op 2')
os.system('op 3')
#In order to make this killable with reasonable response time we have to
organize each of our ops into a function or something equally annoying
op_1():
os.system('op 1')
op_2():
os.system('op 2')
op_3():
os.system('op 3')
opList(op_1, op_2, op_3)
def to_thread():
for op in opList:
checkMessageQueue()
op()
So with this whole "hey mr. nice thread, please die for me" concept gets
ugly quickly in complex situations and doesn't scale well at all.
Furthermore, say you have a complex systems where users can write
pluggable modules. IF a module gets stuck inside of some screwed up
loop and is unable to poll for messages there's no way to kill the
module without killing the whole system. Any of you guys thought of a
way around this scenario?
--
Carl J. Van Arsdall
cvana...@mvista.com
Build and Release
MontaVista Software
Communications through Queue.Queue objects can help. But if you research
the history of this design decision in the language you should discover
there are fairly sound rasons for not allowing arbitrary "threadicide".
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden
And what happens if the thread was halfway through a malloc call and
the data structures used to manage the state of the heap are in an
inconsistent state when the interrupt occurs?
This has been discussed many many times in the context of many many
languages and threading libraries. If you're really interested, do
the investigation Steve suggested. You'll find plenty of material.
Jean-Paul
I've been digging around with Queue.Queue and have yet to come across
any solution to this problem. Queue.Queue just offers a pretty package
for passing around data, it doesn't solve the "polling" problem. I
wonder why malloc()'s can't be done in an atomic state (along with other
operations that should be atomic, maybe that's a question for OS guys, I
dunno). Using Queue.Queue still puts me in a horribly inflexible
"polling" scenario. Yea, I understand many of the reasons why we don't
have "threadicide", and why it was even removed from java. What I don't
understand is why we can't come up with something a bit better. Why
couldn't a thread relinquish control when its safe to do so? While the
interpreter is busy doing malloc()s a thread receives a control message,
the thread waits until it knows its no longer in an atomic state and
gives control to the message handler when it can. Its about setting up
a large system that is controllable without the painstaking process of
putting message polling loops all over the place. Main threads in a
python program can setup a signal handler, accept signals, and that
signal can happily go ahead and kill a python interpreter. Why can't
this concept be taken farther and introduced into threading?
There is no system that is completely interruptible, there will always
be a state in which it is not safe to interrupt, but many systems work
around this just fine with cautious programming. Has anyone considered
an event driven approach to sending control messages to threads?
-c
--
Carl J. Van Arsdall
cvana...@mvista.com
> There is no system that is completely interruptible, there will always
> be a state in which it is not safe to interrupt, but many systems work
> around this just fine with cautious programming. Has anyone considered
> an event driven approach to sending control messages to threads?
>
The big problem (as you'll see when you ...) is providing facilities
that are platform-independent.
>> There is no system that is completely interruptible, there will always
>> be a state in which it is not safe to interrupt, but many systems work
>> around this just fine with cautious programming. Has anyone considered
>> an event driven approach to sending control messages to threads?
>>
>>
> The big problem (as you'll see when you ...) is providing facilities
> that are platform-independent.
>
>
Ah, I could see that. I think before I can make any suggestions on this
front I need to start reading python source code.
Gracias,
Carl
Right. Queue.Queue doesn't even try to solve this problem.
>Queue.Queue just offers a pretty package
>for passing around data, it doesn't solve the "polling" problem.
Exactly correct.
>I
>wonder why malloc()'s can't be done in an atomic state (along with other
>operations that should be atomic, maybe that's a question for OS guys, I
>dunno).
(Note that malloc() is just a nice example - afaik it is threadsafe on
all systems which support threading at all)
Because it turns out that doing things atomically is difficult :)
It's not even always obvious what "atomic" means. But it's not impossible
to do everything atomically (at least not provably ;), and in fact many
people try, most commonly by using mutexes and such.
However, unless one is extremely disciplined, critical sections generally
get overlooked. It's often very difficult to find and fix these, and in
fact much of the time no one even notices they're broken until long after
they have been written, since many bugs in this area only surface under
particular conditions (ie, particular environments or on particular
hardware or under heavy load).
>Using Queue.Queue still puts me in a horribly inflexible
>"polling" scenario. Yea, I understand many of the reasons why we don't
>have "threadicide", and why it was even removed from java. What I don't
>understand is why we can't come up with something a bit better. Why
>couldn't a thread relinquish control when its safe to do so?
Of course, if you think about it, CPython does exactly this already. The
GIL ensures that, at least on the level of the interpreter itself, no
thread switching can occur while data structures are in an inconsistent
state. This works pretty well, since it means that almost anything you do
at the application level in a multithreaded application won't cause random
memory corruption or other fatal errors.
So why can't you use this to implement killable threads in CPython? As it
turns out, you can ;) Recent versions of CPython include the function
PyThreadState_SetAsyncExc. This function sets an exception in another
thread, which can be used as a primitive for killing other threads.
Why does everyone say killing threads is impossible, then? Well, for one
thing, the CPython developers don't trust you to use SetAsyncExc correctly,
so it's not exposed to Python programs :) You have to wrap it yourself
if you want to call it. For another thing, the granularity of exceptions
being raised is somewhat sketchy: an exception will not be raised while
a single bytecode operation is being executed. Exceptions set with this
function will only be raised /between/ the execution of two bytecode
operations.
But this is just what you described above. The thread is checking for
messages at certain intervals but only while all of its internal state
is consistent. The problem here is that the granularity of a bytecode
being executed is pretty variable. Some operations might take only a
microsecond. Other operations might take a week. This might be useful
in some specific contexts, but it's definitely not a general solution.
>While the
>interpreter is busy doing malloc()s a thread receives a control message,
>the thread waits until it knows its no longer in an atomic state and
>gives control to the message handler when it can. Its about setting up
>a large system that is controllable without the painstaking process of
>putting message polling loops all over the place. Main threads in a
>python program can setup a signal handler, accept signals, and that
>signal can happily go ahead and kill a python interpreter. Why can't
>this concept be taken farther and introduced into threading?
Some of your text in this paragraph is a little fuzzy, but I think I
get the general idea, and hopefully from what I've written above it is
clear both why what you describe above basically /is/ being done in
CPython and why it is not actually a general solution to this problem.
>
>There is no system that is completely interruptible, there will always
>be a state in which it is not safe to interrupt, but many systems work
>around this just fine with cautious programming.
Which systems work around this just fine? I don't think there are very
many at all. Pre-emptive threading is really hard to get right, even
though sometimes it /looks/ easy ;)
Jean-Paul
There is in fact some under-the-covers mechanism in CPython (i.e.
one you can call from C extensions but not from Python code) to
raise exceptions in other threads. I've forgotten the details.
There has been discussion at various times about how to expose
something like that to Python code, but it's been inconclusive. E.g.:
http://sf.net/tracker/?func=detail&atid=105470&aid=502236&group_id=5470
A polliing loop is neither required nor helpful here.
[...]
> #Just pretend for the sake of arguement that 'op' actually means
> something and is a lengthy operation
> def func_to_thread():
> os.system('op 1')
> os.system('op 2')
> os.system('op 3')
What good do you think killing that thread would do? The
process running 'op n' has no particular binding to the thread
that called os.system(). If 'op n' hangs, it stays hung.
The problem here is that os.system doesn't give you enough
control. It doesn't have a timeout and doesn't give you a
process ID or handle to the spawned process.
Running os.system() in multiple threads strikes me as
kind of whacked. Won't they all compete to read and write
stdin/stdout simultaneously?
> #In order to make this killable with reasonable response time we have to
> organize each of our ops into a function or something equally annoying
>
> op_1():
> os.system('op 1')
>
> op_2():
> os.system('op 2')
>
> op_3():
> os.system('op 3')
>
> opList(op_1, op_2, op_3)
> def to_thread():
> for op in opList:
> checkMessageQueue()
> op()
Nonsense. If op() hangs, you never get to checkMessageQueue().
Now suppose op has a timeout. We could write
def opcheck(thing):
result = op(thing)
if result == there_was_a_timeout:
raise some_timeout_exception
How is:
def func_to_thread():
opcheck('op 1')
opcheck('op 2')
opcheck('op 3')
any less managable than your version of func_to_thread?
> So with this whole "hey mr. nice thread, please die for me" concept gets
> ugly quickly in complex situations and doesn't scale well at all.
> Furthermore, say you have a complex systems where users can write
> pluggable modules. IF a module gets stuck inside of some screwed up
> loop and is unable to poll for messages there's no way to kill the
> module without killing the whole system. Any of you guys thought of a
> way around this scenario?
Threadicide would not solve the problems you actually have, and it
tends to create other problems. What is the condition that makes
you want to kill the thread? Make the victim thread respond to that
condition itself.
--
--Bryan
If the condition is a timeout, one way to notice it is with sigalarm,
which raises an exception in the main thread. But then you need a way
to make something happen in the remote thread.
Raising an exception in your own thread is pretty trivial. SIGALRM does
no good whatsoever here. :)
Besides, CPython will only raise exceptions between opcodes. If a
misbehaving thread hangs inside an opcode, you'll never see the exception
from SIGALRM.
Jean-Paul
No; it's because killing a thread from another thread fundamentally
sloppy.
> The process of creating a thread can
> be translated into something supplied by pretty much all operating
> systems: an Amiga task, posix thread, etc.
>
> But ending a thread is then also dependent upon the OS -- and not
> all OSs have a way to do that that doesn't run the risk of leaking
> memory, leaving things locked, etc. until the next reboot.
No operating system has a good way to do it, at least not for
the kind of threads Python offers.
> The procedure for M$ Windows to end a task basically comes down to
> "send the task a 'close window' event; if that doesn't work, escalate...
> until in the end it throw its hands up and says -- go ahead and leave
> memory in a mess, just stop running that thread".
The right procedure in MS Windows is the same as under POSIX:
let the thread terminate on its own.
> > module without killing the whole system. Any of you guys thought of a
> > way around this scenario?
>
> Ask Bill Gates... The problem is part of the OS.
Or learn how to use threads properly. Linux is starting to get good
threading. Win32 has had it for quite a while.
--
--Bryan
>
>> #In order to make this killable with reasonable response time we have to
>> organize each of our ops into a function or something equally annoying
>>
>> op_1():
>> os.system('op 1')
>>
>> op_2():
>> os.system('op 2')
>>
>> op_3():
>> os.system('op 3')
>>
>> opList(op_1, op_2, op_3)
>> def to_thread():
>> for op in opList:
>> checkMessageQueue()
>> op()
>>
>
> Nonsense. If op() hangs, you never get to checkMessageQueue().
>
Yea, understood. At the same time, I can't use a timeout either, I
don't know how long op_1 or op_2 will be. This is why I want something
that is triggered on an event.
> Now suppose op has a timeout. We could write
>
> def opcheck(thing):
> result = op(thing)
> if result == there_was_a_timeout:
> raise some_timeout_exception
>
> How is:
>
> def func_to_thread():
> opcheck('op 1')
> opcheck('op 2')
> opcheck('op 3')
>
> any less managable than your version of func_to_thread?
>
>
Again, the problem I'm trying to solve doesn't work like this. I've
been working on a framework to be run across a large number of
distributed nodes (here's where you throw out the "duh, use a
distributed technology" in my face). The thing is, I'm only writing the
framework, the framework will work with modules, lots of them, which
will be written by other people. Its going to be impossible to get
people to write hundreds of modules that constantly check for status
messages. So, if I want my thread to "give itself up" I have to tell it
to give up. In order to tell it to give up I need some mechanism to
check messages that is not going to piss off a large team of
programmers. At the same time, do I really want to rely on other people
to make things work? Not really, I'd much rather let my framework
handle all control and not leave that up to programmers.
So the problem is, I have something linearly executed a large list of
python functions of various sizes ranging from short to long. Its not
about killing the thread so much as how do I make the thread listen to
control messages without polling.
>> So with this whole "hey mr. nice thread, please die for me" concept gets
>> ugly quickly in complex situations and doesn't scale well at all.
>> Furthermore, say you have a complex systems where users can write
>> pluggable modules. IF a module gets stuck inside of some screwed up
>> loop and is unable to poll for messages there's no way to kill the
>> module without killing the whole system. Any of you guys thought of a
>> way around this scenario?
>>
>
> Threadicide would not solve the problems you actually have, and it
> tends to create other problems. What is the condition that makes
> you want to kill the thread? Make the victim thread respond to that
> condition itself.
>
>
I feel like this is something we've established multiple times. Yes, we
want the thread to kill itself. Alright, now that we agree on that,
what is the best way to do that. Right now people keep saying we must
send the thread a message. That's fine and I completely understand
that, but right now the only mechanism I see is some type of polling
loop (or diving into the C API to force exceptions). So far I've not
seen any other method though. If you want to send a thread a control
message you must wait until that thread is able to check for a control
message. If something hangs in your thread you are totally screwed,
similarly, if your thread ends up in some excessively lengthy IO (IO
that could be interrupted or whatever) you have to wait for that IO to
finish before your thread can process any control messages.
>> Running os.system() in multiple threads strikes me as kind of whacked.
>> Won't they all compete to read and write stdin/stdout simultaneously?
>>
> Unfortunately this is due to the nature of the problem I am tasked with
> solving. I have a large computing farm, these os.system calls are often
> things like ssh that do work on locations remote from the initial python
> task.
[...]
> Again, the problem I'm trying to solve doesn't work like this. I've been
> working on a framework to be run across a large number of distributed
> nodes (here's where you throw out the "duh, use a distributed
> technology" in my face). The thing is, I'm only writing the framework,
> the framework will work with modules, lots of them, which will be
> written by other people. Its going to be impossible to get people to
> write hundreds of modules that constantly check for status messages.
Doesn't this sound like a case for using processes instead of threads?
Where you don't have control over the thread, you can use a process and get
the separation you need to be able to kill this task.
Alternatively you could possibly provide a base class for the threads that
handles the things you need every thread to handle. They'd not have to
write it then; they'd not even have to know too much about it.
Gerhard
http://poshmodule.sf.net might be of interest.
Have you looked at stackless yet?
You can do all kinds of crazy stuff like infinitely suspending a
tasklet (stackless' version of threads) and effectively killing it,
sending messages via channels that can be externally monitored, send
arbitrary exceptions that are immediately raised in other tasklets,
writing your own pre-emptive schedules, saving a tasklet on disk for
later examination, etc, etc.
> I'd be all for using processes but setting up communication between
> processes would be difficult wouldn't it? I mean, threads have shared
> memory so making sure all threads know the current system state is an
> easy thing.
I'm not sure about that. Sharing data between threads or processes is never
an easy thing, especially since you are saying you can't trust your module
coders to "play nice". If you can't trust them to terminate their threads
nicely when asked so, you also can't trust them to responsibly handle
shared memory. That's exactly the reason why I suggested processes.
> With processes wouldn't I have to setup some type of server/client
> design, where one process has the system state and then the other
> processes constantly probe the host when they need the current system
> state?
Anything else is bound to fail. You need to have safeguards around any
shared data. (A semaphore is a type of server/client thing...) At the very
least you need to prevent read access while it is updated; very rarely this
is an atomic action, so there are times where the system state is
inconsistent while it is being updated. (I don't think you can consider
many Python commands as atomic WRT threads, but I'm not sure about this.)
IMO, in the situation you are describing, it is an advantage that data is
not normally accessible -- this means that your module coders need to
access the data in the way you present it to them, and so you can control
that it is being accessed correctly.
Gerhard
I don't get what threading and Twisted would to do for
you. The problem you actually have is that you sometimes
need terminate these other process running other programs.
Use spawn, fork/exec* or maybe one of the popens.
> Again, the problem I'm trying to solve doesn't work like this. I've
> been working on a framework to be run across a large number of
> distributed nodes (here's where you throw out the "duh, use a
> distributed technology" in my face). The thing is, I'm only writing the
> framework, the framework will work with modules, lots of them, which
> will be written by other people. Its going to be impossible to get
> people to write hundreds of modules that constantly check for status
> messages. So, if I want my thread to "give itself up" I have to tell it
> to give up.
Threads have little to do with what you say you need.
[...]
> I feel like this is something we've established multiple times. Yes, we
> want the thread to kill itself. Alright, now that we agree on that,
> what is the best way to do that.
Wrong. In your examples, you want to kill other processes. You
can't run external programs such as ssh as Python threads. Ending
a Python thread has essentially nothing to do with it.
> Right now people keep saying we must send the thread a message.
Not me. I'm saying work the problem you actually have.
--
--Bryan
>
> Not me. I'm saying work the problem you actually have.
>
The problem I have is a large distributed system, that's the reality of
it. The short summary, I need to use and control 100+ machines in a
computing farm. They all need to share memory or to actively
communicate with each other via some other mechanism. Without giving
any other details, that's the problem I have to solve. Right now I'm
working with someone else's code. Without redesigning the system from
the ground up, I have to fix it.
Have you looked at POSH yet? http://poshmodule.sf.net
There's also an shm module that's older and maybe more reliable.
Or you might be able to just use mmap.
Thanks!
-carl
--
Carl J. Van Arsdall
cvana...@mvista.com
Distributed shared memory is a tough trick; only a few systems simulate
it.
> How does spawn, fork/exec allow me to meet that need?
I have no idea why you think threads or fork/exec will give you
distributed
shared memory.
> I'll look into it, but I was under the impression having shared memory
> in this situation would be pretty hairy. For example, I could fork of a
> 50 child processes, but then I would have to setup some kind of
> communication mechanism between them where the server builds up a queue
> of requests from child processes and then services them in a FIFO
> fashion, does that sound about right?
That much is easy. What it has to with what you say you require
remains a mystery.
> > Threads have little to do with what you say you need.
> >
> > [...]
> >
> >> I feel like this is something we've established multiple times. Yes, we
> >> want the thread to kill itself. Alright, now that we agree on that,
> >> what is the best way to do that.
> >>
> >
> > Wrong. In your examples, you want to kill other processes. You
> > can't run external programs such as ssh as Python threads. Ending
> > a Python thread has essentially nothing to do with it.
> >
> There's more going on than ssh here. Since I want to run multiple
> processes to multiple devices at one time and still have mass shared
> memory I need to use threads.
No, you would need to use something that implements shared
memory across multiple devices. Threads are multiple lines of
execution in the same address space.
> There's a mass distributed system that
> needs to be controlled, that's the problem I'm trying to solve. You can
> think of each ssh as a lengthy IO process that each gets its own
> device. I use the threads to allow me to do IO to multiple devices at
> once, ssh just happens to be the IO. The combination of threads and ssh
> allowed us to have a *primitive* distributed system (and it works too,
> so I *can* run external programs in python threads).
No, you showed launching it from a Python thread using os.system().
It's not running in the thread; it's running in a separate process.
> I didn't say is
> was the best or the correct solution, but it works and its what I was
> handed when I was thrown into this project. I'm hoping in fifteen years
> or when I get an army of monkeys to fix it, it will change. I'm not
> worried about killing processes, that's easy, I could kill all the sshs
> or whatever else I want without batting an eye.
After launching it with os.sytem()? Can you show the code?
--
--Bryan
So, I have a distributed build system. The system is tasked with
building a fairly complex set of packages that form a product. The
system needs to build these packages for 50 architectures using cross
compilation as well as support for 5 different hosts. Say there are
also different versions of this with tweaks for various configurations,
so in the end I might be trying to build 200+ different things at once.
I have a computing farm of 40 machines to do this for me.. That's the
high-level scenario without getting too detailed. There are also
subsystems that help us manage the machines and things, I don't want to
get into that, I'm going to try to focus on a scenario more abstract
than cluster/resource management stuff.
Alright, so manually running builds is going to be crazy and
unmanageable. So what the people who came before me did to manage this
scenario was to fork on thread per build. The threads invoke a series
of calls that look like
os.system(ssh <host> <command>)
or for more complex operations they would just spawn a process that ran
another python script)
os.system(ssh <host> <script>)
The purpose behind all this was for a couple things:
* The thread constantly needed information about the state of the
system (for example we don't want to end up building the same
architecture twice)
* We wanted a centralized point of control for an entire build
* We needed to be able to use as many machines as possible from a
central location.
Python threads worked very well for this. os.system behaves a lot like
many other IO operations in python and the interpreter gives up the
GIL. Each thread could run remote operations and we didn't really have
any problems. There wasn't much of a need to do fork, all it would have
done is increased the amount of memory used by the system.
Alright, so this scheme that was first put in place kind of worked.
There were some problems, for example when someone did something like
os.system(ssh <host> <script>) we had no good way of knowing what the
hell happened in the script. Now granted, they used shared files to do
some of it over nfs mounts, but I really hate that. It doesn't work
well, its clunky, and difficult to manage. There were other problems
too, but I just wanted to give a sample.
Alright, so things aren't working, I come on board, I have a boss who
wants things done immediately. What we did was created what we called a
"Python Execution Framework". The purpose of the framework was to
mitigate a number of problems we had as well as take the burden of
distribution away from the programmers by providing a few layers of
abstraction (i'm only going to focus on the distributed part of the
framework, the rest is irrelevant to the discussion). The framework
executes and threads modules (or lists of modules). Since we had
limited time, we designed the framework with "distribution environment"
in mind but realized that if we shoot for the top right away it will
take years to get anything implemented.
Since we knew we eventually wanted a distributed system that could
execute framework modules entirely on remote machines we carefully
design and prepared the system for this. This involves some abstraction
and some simple mechanisms. However right now each ssh call will be
executed from a thread (as they will be done concurrently, just like
before). The threads still need to know about the state of the system,
but we'd also like to be able to issue some type of control that is more
event driven -- this can be sending the thread a terminate message or
sending the thread a message regarding the completion of a dependency
(we use conditions and events to do this synchronization right now). We
hoped that in the case of a catastrophic event or a user 'kill' signal
that the the system could take control of all the threads (or at least,
ask them to go away), this is what started the conversation in the first
place. We don't want to use a polling loop for these threads to check
for messages, we wanted to use something event driven (I mistakenly used
the word interrupt in earlier posts, but I think it still illustrates my
point). Its not only important that the threads die, but that they die
with grace. There's lots of cleanup work that has to be done when
things exit or things end up in an indeterminable state.
So, I feel like I have a couple options,
1) try moving everything to a process oriented configuration - we think
this would be bad, from a resource standpoint as well as it would make
things more difficult to move to a fully distributed system later, when
I get my army of code monkeys.
2) Suck it up and go straight for the distributed system now - managers
don't like this, but maybe its easier than I think its going to be, I dunno
3) See if we can find some other way of getting the threads to terminate.
4) Kill it and clean it up by hand or helper scripts - we don't want to
do this either, its one of the major things we're trying to get away from.
Alright, that's still a fairly high-level description. After all that,
if threads are still stupid then I think I'll much more easily see it
but I hope this starts to clear up confused. I don't really need a
distributed shared memory environment, but right now I do need shared
memory and it needs to be used fairly efficiently. For a fully
distributed environment I was going to see what various technologies
offered to pass data around, I figured that they must have some
mechanism for doing it or at least accessing memory from a central
location (we're setup to do this now we threads, we just need to expand
the concept to allow nodes to do it remotely). Right now, based on what
I have to do I think threads are the right choice until I can look at a
better implementation (i hear twisted is good at what I ultimately want
to do, but I don't know a thing about it).
Alright, if you read all that, thanks, and thanks for your input.
Whether or not I've agreed with anything, me and a few colleagues
definitely discuss each idea as its passed to us. For that, thanks to
the python list!
-carl
Instead of using os.system, maybe you want to use one of the popens or
the subprocess module. For each ssh, you'd spawn off a process that
does the ssh and communicates back to the control process through a
set of file descriptors (Unix pipe endpoints or whatever). The
control process could use either threads or polling/select to talk to
the pipes and keep track of what the subprocesses were doing.
I don't think you need anything as complex as shared memory for this.
You're just writing a special purpose chat server.
Paul, have you used POSH? Does it work well? Any major
gotchas?
I looked at the paper... well, not all 200+ pages, but I checked
how they handle a couple parts that I thought hard and they
seem to have good ideas. I didn't find the SourceForge project
so promising. The status is alpha, the ToDo's are a little scary,
and project looks stalled. Also it's *nix only.
--
--Bryan
> Alright, if you read all that, thanks, and thanks for your input. Whether
> or not I've agreed with anything, me and a few colleagues definitely
> discuss each idea as its passed to us. For that, thanks to the python
> list!
I think you should spend a few hours and read up on realtime OS features
and multitasking programming techniques. Get a bit away from the bottom
level, forget about the specific features of your OS and your language and
try to come up with a set of requirements and a structure that fits them.
Regarding communicating with a thread (or process, that's about the same,
only the techniques vary), for example -- there are not that many options.
Either the thread/process polls a message queue or it goes to sleep once it
has done whatever it needed to do until something comes in through a queue
or until a semaphore gets set. What is better suited for you depends on
your requirements and overall structure. Both doesn't seem to be too clear.
If you have threads that take too long and need to be killed, then I'd say
fix the code that runs there...
Gerhard
I haven't used it. I've been wanting to try. I've heard it works ok
in Linux but I've heard of problems with it under Solaris.
Now that I understand what the OP is trying to do, I think POSH is
overkill, and just using pipes or sockets is fine. If he really wants
to use shared memory, hmmm, there used to be an shm module at
http://mambo.peabody.jhu.edu/omr/omi/source/shm_source/shm.html
but that site now hangs (and it's not on archive.org), and Python's
built-in mmap module doesn't support any type of locks.
I downloaded the above shm module quite a while ago, so if I can find
it I might upload it to my own site. It was a straightforward
interface to the Sys V shm calls (also *nix-only, I guess). I guess
he also could use mmap with no locks, but with separate memory regions
for reading and writing in each subprocess, using polling loops. I
sort of remember Apache's mod_mmap doing something like that if it has
to.
To really go off the deep end, there are a few different MPI libraries
with Python interfaces.
> I looked at the paper... well, not all 200+ pages, but I checked
> how they handle a couple parts that I thought hard and they
> seem to have good ideas.
200 pages?? The paper I read was fairly short, and I looked at the
code (not too carefully) and it seemed fairly straightforward. Maybe
I missed something, or am not remembering; it's been a while.
> I didn't find the SourceForge project
> so promising. The status is alpha, the ToDo's are a little scary,
> and project looks stalled. Also it's *nix only.
Yeah, using it for anything serious would involve being willing to fix
problems with it as they came up. But I think the delicate parts of
it are parts that aren't that important, so I'd just avoid using
those.
> Also, threading's condition and event constructs are used a lot
> (i talk about it somewhere in that thing I wrote). They are easy to use
> and nice and ready for me, with a server wouldn't I have to have things
> poll/wait for messages?
How would a thread receive a message, unless it polls some kind of queue or
waits for a message from a queue or at a semaphore? You can't just "push" a
message into a thread; the thread has to "pick it up", one way or another.
Gerhard
-c
--
Carl J. Van Arsdall
cvana...@mvista.com
>>> Also, threading's condition and event constructs are used a lot
>>> (i talk about it somewhere in that thing I wrote). They are easy to use
>>> and nice and ready for me, with a server wouldn't I have to have things
>>> poll/wait for messages?
>>
>> How would a thread receive a message, unless it polls some kind of queue or
>> waits for a message from a queue or at a semaphore? You can't just "push" a
>> message into a thread; the thread has to "pick it up", one way or another.
>
> Well, I guess I'm thinking of an event driven mechanism, kinda like
> setting up signal handlers. I don't necessarily know how it works under
> the hood, but I don't poll for a signal. I setup a handler, when the
> signal comes, if it comes, the handler gets thrown into action. That's
> what I'd be interesting in doing with threads.
What you call an event handler is a routine that gets called from a message
queue polling routine. You said a few times that you don't want that.
The queue polling routine runs in the context of the thread. If any of the
actions in that thread takes too long, it will prevent the queue polling
routine from running, and therefore the event won't get handled. This is
exactly the scenario that you seem to want to avoid. Event handlers are not
anything multitask or multithread, they are simple polling mechanisms with
an event queue. It just seems that they act preemtively, when you can click
on one button and another button becomes disabled :)
There are of course also semaphores. But they also have to either get
polled like GUI events, or the thread just goes to sleep until the
semaphore wakes it up. You need to understand this basic limitation: a
processor can only execute statements. Either it is doing other things,
then it must, by programming, check the queue -- this is polling. Or it can
suspend itself (the thread or process) and tell the OS (or the thread
handling mechanism) to wake it up when a message arrives in a queue or a
semaphore gets active.
You need to look a bit under the hood, so to speak... That's why I said in
the other message that I think it would do you some good to read up a bit
on multitasking OS programming techniques in general. There are not that
many, in principle, but it helps to understand the basics.
Gerhard
I think he's refering to Unix signal handlers. These really are called
asynchronously. When the signal comes in, the system pushes some
registers on the stack, calls the signal handler, and when the signal
handler returns it pops the registers off the stack and resumes
execution where it left off, more or less. If the signal comes while
the
process is in certain system calls, the call returns with a value or
errno setting that indicated it was interrupted by a signal.
Unix signals are an awkward low-level relic. They used to be the only
way to do non-blocking but non-polling I/O, but current systems offer
much better ways. Today the sensible things to do upon receiving a
signal are ignore it or terminate the process. My opinion, obviously.
--
--Bryan
8<----------------------------------------------------------------
| point). Its not only important that the threads die, but that they die
| with grace. There's lots of cleanup work that has to be done when
| things exit or things end up in an indeterminable state.
|
| So, I feel like I have a couple options,
|
| 1) try moving everything to a process oriented configuration - we think
| this would be bad, from a resource standpoint as well as it would make
| things more difficult to move to a fully distributed system later, when
| I get my army of code monkeys.
|
| 2) Suck it up and go straight for the distributed system now - managers
| don't like this, but maybe its easier than I think its going to be, I dunno
|
| 3) See if we can find some other way of getting the threads to terminate.
|
| 4) Kill it and clean it up by hand or helper scripts - we don't want to
| do this either, its one of the major things we're trying to get away from.
8<-----------------------------------------------------------------------------
This may be a stupid suggestion - If I understand what you are doing, its
essentially running a bunch of compilers with different options on various
machines around the place - so there is a fifth option - namely to do nothing -
let them finish and just throw the output away - i.e. just automate the
cleanup...
- Hendrik
I think Carl is using Linux, so the awful overhead of process creation
in Windows doesn't apply. Forking in Linux isn't that big a deal.
os.system() usually forks a shell, and the shell forks the actual
command, but even two forks per ssh is no big deal. The Apache web
server usually runs with a few hundred processes, etc. Carl, just how
many of these ssh's do you need active at once? If it's a few hundred
or less, I just wouldn't worry about these optimizations you're asking
about.
Ya' think? Looks like you have no particular need for shared
memory, in your small distributed system.
> I think this conversation is getting hairy and confusing so I'm
> going to try and paint a better picture of what's going on. Maybe this
> will help you understand exactly what's going on or at least what I'm
> trying to do, because I feel like we're just running in circles.
[...]
So step out of the circles already. You don't have a Python thread
problem. You don't have a process overhead problem.
[...]
> So, I have a distributed build system. [...]
Not a trivial problem, but let's not pretend we're pushing the
state of the art here.
Looks like the system you inherited already does some things
smartly: you have ssh set up so that a controller machine can
launch various build steps on a few dozen worker machines.
[...]
> The threads invoke a series
> of calls that look like
>
> os.system(ssh <host> <command>)
>
> or for more complex operations they would just spawn a process that ran
> another python script)
>
> os.system(ssh <host> <script>)
[...]
> Alright, so this scheme that was first put in place kind of worked.
> There were some problems, for example when someone did something like
> os.system(ssh <host> <script>) we had no good way of knowing what the
> hell happened in the script.
Yeah, that's one thing we've been telling you. The os.system()
function doesn't give you enough information nor enough control.
Use one of the alternatives we've suggested -- probably the
subprocess.Popen class.
[...]
> So, I feel like I have a couple options,
>
> 1) try moving everything to a process oriented configuration - we think
> this would be bad, from a resource standpoint as well as it would make
> things more difficult to move to a fully distributed system later, when
> I get my army of code monkeys.
>
> 2) Suck it up and go straight for the distributed system now - managers
> don't like this, but maybe its easier than I think its going to be, I dunno
>
> 3) See if we can find some other way of getting the threads to terminate.
>
> 4) Kill it and clean it up by hand or helper scripts - we don't want to
> do this either, its one of the major things we're trying to get away from.
The more you explain, the sillier that feeling looks -- that those
are your options. Focus on the problems you actually have. Track
what build steps worked as expected; log what useful information
you have about the ones that did not.
That "resource standpoint" thing doesn't really make sense. Those
os.system() calls launch *at least* one more process. Some
implementations will launch a process to run a shell, and the
shell will launch another process to run the named command. Even
so, efficiency on the controller machine is not a problem given
the scale you have described.
--
--Bryan
Actually it does in the C API, but it isn't exported to python.
ctypes can fix that though.
> and its absence is not an oversight. Threads often have important
> business left to do, such as releasing locks on shared data; killing
> them at arbitrary times tends to leave the system in an inconsistent
> state.
Here is a demo of how to kill threads in python in a cross platform
way. It requires ctypes. Not sure I'd use the code in production but
it does work...
"""
How to kill a thread demo
"""
import threading
import time
import ctypes
class ThreadKilledError(Exception): pass
_PyThreadState_SetAsyncExc = ctypes.pythonapi.PyThreadState_SetAsyncExc
_c_ThreadKilledError = ctypes.py_object(ThreadKilledError)
def _do_stuff(t):
"""Busyish wait for t seconds. Just sleeping delays the exeptions in the example"""
start = time.time()
while time.time() - start < t:
time.sleep(0.01)
class KillableThread(threading.Thread):
"""
Show how to kill a thread
"""
def __init__(self, name="thread", *args, **kwargs):
threading.Thread.__init__(self, *args, **kwargs)
self.name = name
print "Starting %s" % self.name
def kill(self):
"""Kill this thread"""
print "Killing %s" % self.name
_PyThreadState_SetAsyncExc(self.id, _c_ThreadKilledError)
def run(self):
self.id = threading._get_ident()
while 1:
print "Thread %s running" % self.name
_do_stuff(1.0)
if __name__ == "__main__":
thread1 = KillableThread(name="thread1")
thread1.start()
_do_stuff(0.5)
thread2 = KillableThread(name="thread2")
thread2.start()
_do_stuff(2.0)
thread1.kill()
thread1.join()
_do_stuff(2.0)
thread2.kill()
thread2.join()
print "Done"
--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick
While you are correct that signals are not the solution to this problem,
the details of this post are mostly incorrect.
If a thread never performs any I/O operations, signal handlers will still
get invokes on the arrival of a signal.
Signal handlers have to run in some context, and that ends up being the
context of some thread. A signal handler is perfectly capable of exiting
the thread in which it is running. It is also perfectly capable of
terminating any subprocess that thread may be responsible for.
However, since sending signals to specific threads is difficult at best
and impossible at worst, combined with the fact that in Python you
/still/ cannot handle a signal until the interpreter is ready to let you
do so (a fact that seems to have been ignored in this thread repeatedly),
signals end up not being a solution to this problem.
Jean-Paul
No, it has a method for raising an exception. This isn't quite the
same as a method for killing a thread. Also, this has been mentioned
in this thread before. Unfortunately:
import os, threading, time, ctypes
class ThreadKilledError(Exception):
pass
_PyThreadState_SetAsyncExc = ctypes.pythonapi.PyThreadState_SetAsyncExc
_c_ThreadKilledError = ctypes.py_object(ThreadKilledError)
def timeSleep():
time.sleep(100)
def childSleep():
os.system("sleep 100") # time.sleep(100)
def catchException():
while 1:
try:
while 1:
time.sleep(0.0)
except Exception, e:
print 'Not exiting because of', e
class KillableThread(threading.Thread):
"""
Show how to kill a thread -- almost
"""
def __init__(self, name="thread", *args, **kwargs):
threading.Thread.__init__(self, *args, **kwargs)
self.name = name
print "Starting %s" % self.name
def kill(self):
"""Kill this thread"""
print "Killing %s" % self.name
_PyThreadState_SetAsyncExc(self.id, _c_ThreadKilledError)
def run(self):
self.id = threading._get_ident()
try:
return threading.Thread.run(self)
except Exception, e:
print 'Exiting', e
def main():
threads = []
for f in timeSleep, childSleep, catchException:
t = KillableThread(target=f)
t.start()
threads.append(t)
time.sleep(1)
for t in threads:
print 'Killing', t
t.kill()
for t in threads:
print 'Joining', t
t.join()
if __name__ == '__main__':
main()
Jean-Paul
For some insight into what you might need to do to monitor asynchronous
communications, take a look at the parallel/pprocess module, which I
wrote as a convenience for spawning processes using a thread
module-style API whilst providing explicit channels for interprocess
communication:
http://www.python.org/pypi/parallel
Better examples can presumably be found in any asynchronous
communications framework, I'm sure.
> I don't think you need anything as complex as shared memory for this.
> You're just writing a special purpose chat server.
Indeed. The questioner might want to look at the py.execnet software
that has been presented now at two consecutive EuroPython conferences
(at the very least):
http://indico.cern.ch/contributionDisplay.py?contribId=91&sessionId=41&confId=44
Whether this solves the questioner's problems remains to be seen, but
issues of handling SSH-based communications streams do seem to be
addressed.
Paul
Note that you see many of the same problems with signal handlers
(including only being able to call reentrant functions from them).
Most advanced Unix programming books say you should treat signal
handlers in a manner similar to what people are advocating for remote
thread stoppage in this thread: unless you're doing something trivial,
your signal handler should just set a global variable. Then your
process can check that variable in the main loop and take more complex
action if it's set.
Actually I don't understand the need for SSH. This is traffic over a
LAN, right? Is all of the LAN traffic encrypted? That's unusual; SSH
is normally used to secure connections over the internet, but the
local network is usually trusted. Hopefully it's not wireless.
-c
--
Carl J. Van Arsdall
cvana...@mvista.com
Most places I've worked do use a lot of encryption on the LAN. Not
everything on the LAN is encrypted (e.g outgoing http connections) but
a lot of things are. Trusting the whole network is a bad idea, since
it allows a compromise of one machine to turn into a compromise of the
whole LAN.
Who are you and what have you done with the real Paul Rubin?
> This is traffic over a
> LAN, right? Is all of the LAN traffic encrypted? That's unusual; SSH
> is normally used to secure connections over the internet, but the
> local network is usually trusted. Hopefully it's not wireless.
I think not running telnet and rsh daemons is a good policy anyway.
--
--Bryan
| On Thu, 27 Jul 2006 08:48:37 -0400, Jean-Paul Calderone
| <exa...@divmod.com> declaimed the following in comp.lang.python:
|
| >
| > If a thread never performs any I/O operations, signal handlers will still
| > get invokes on the arrival of a signal.
| >
| I thought I'd qualified that scenario -- from the point of view of
| one OS I've programmed, where signals were only handled as part of the
| I/O system (or, in more general, as part of any /blocking/ system call
| from an application -- not even a <ctrl-c> would have an effect unless
| the application invoked a blocking call, or explicitly polled the signal
| bits of its task header)
| --
I have to support this - a processor is only doing one instruction at a time, if
its not a multi core device - and the only ways that the "operating system part"
of the system can get control back from the *user program part* of the system
are:
a) when the User makes an OS call (like blocking I/O, or any OS request)
b) when the user code is interrupted by some hardware thingy and the OS part
handles the interrupt.
So on a processor that does not have protected instructions - if an idiot writes
something to the following effect :
*instructions to disable interrupts*
followed by :
*instructions that go into an infinite loop AND that make no OS calls*
the whole bang shoot stops dead - Reset your machine...
Dennis - did your OS not have a ticker running?
- Hendrik
A common recovery mechanism in embedded systems is a watchdog timer,
which is a hardware device that must be poked by the software every
so often (e.g. by writing to some register). If too long an interval
goes by without a poke, the WDT hard-resets the cpu. Normally the
software would poke the WDT from its normal periodic timing routine.
A loop like you describe would stop the timing routine from running,
eventually resulting in a reset.
I don't run any wireless networks, but given the apparently poor state
of wireless network security (as far as the actual implemented
standards in commercially available products are concerned), I'd want
to be using as much encryption as possible if I did.
Anyway, the py.execnet thing is presumably designed to work over the
Internet and over local networks, with the benefit of SSH being that it
applies well to both domains. Whether it's a better solution for the
questioner's problem than established alternatives such as PVM (which
I've never had the need to look into, even though it seems
interesting), various distributed schedulers or anything else out
there, I can't really say.
Paul
*grin* - Yes of course - if the WDT was enabled - its something that I have not
seen on PC's yet...
- Hendrik
You could use ssh's port forwarding features and just open a normal
TCP connection to a local port that the local ssh server listens to.
Then the ssh server forwards the traffic through an encrypted tunnel
to the other machine. Your application doesn't have to know anything
about ssh.
In fact there's a VPN function (using tun/tap) in recent versions of
ssh that should make it even simpler, but I hvean't tried it yet.
They are available for PC's, as plug-in cards, at least for the ISA
bus in the old days, and almost certainly for the PCI bus today.
That is cool, I was not aware of this - added to a long running server it will
help to make the system more stable - a hardware solution to hard to find bugs
in Software - (or even stuff like soft errors in hardware - speak to the
Avionics boys about Neutrons) do you know who sells them and what they are
called? -
Sorry if this is getting off topic on this thread... ( in a way it is on topic -
because a reset will stop a thread every time...)
- Hendrik
| On Fri, 28 Jul 2006 08:27:18 +0200, "H J van Rooyen"
| <ma...@microcorp.co.za> declaimed the following in comp.lang.python:
|
| >
| > Dennis - did your OS not have a ticker running?
| >
| That ancient machine, while round-robin, multi-priority,
| pre-emptive, seemed still to only "deliver" signals on deliberate
| blocking calls -- perhaps to prevent potential corruption if the signal
| had been delivered(handled) in the middle of some multi-instruction
| sequence. The OS level would "see" the <ctrl-c>, and set the signal bit
| in the task header -- but without the blocking I/O (typically), the code
| to activate a registered signal handler would not be invoked. Operation
| was something like: submit I/O request, AND(signal bits, signal mask) --
| invoke handler if non-zero, block for I/O return [or return directly for
| asynchronous I/O request]
- Hah! - so it *could* have responded - it just chose not to - so it was pre
emptive - but hey - what is different between modern OS's and what you are
describing? - it seems to me that there is just a lot of extra memory control,
as well as control over who is allowed to do what - in an effort to make things
more stable - and all this stuff just eats cycles and slows you down... (or
alternatively, makes the hardware more complex and expensive...)
But to get back to the OP's problem - basically the thread has to see some sort
of variable change, or receive a message (by examining something to see if there
is a message there) and then kill itself, or the OS must be told to stop giving
control back to the thread in question - which option will leave all the loose
ends in the thread loose...
So its either: "hey mr nice thread please stop" - or "hey Mr OS - kill that
thread ..." - now from the OS viewpoint - if the application implements some
threading itself - it may not even know that the thread exists - OS threads are
known variously as "Tasks" or "Processes" or "Running Programmes" - so using the
big guns on a thread may not be possible without killing the parent too...
So if you want to use the OS to kill the thread - it has to be a formal OS
thread - something started with a call to the OS, and not something that an
application implements by itself - and I am not familiar enough with Python
threading and dummy threading to pretend to know what is "under the hood" - but
I haven't seen an additional process appearing on my Linux box when I start a
thread - so its either something that Python does on its own without
"registering" the new thread with Linux - or I haven't looked closely enough...
So if somebody else can take over here, we might convince the OP that "hey mr
nice thread" is the way to go, even in the case that the "thread" in question is
an OS Process - after all - there has to be inter - task communication in any
case - so the cleanest solution is to build the kill in right from scratch...
Why do I think of COBOL:
read master_file_record at end go to end_routine....
HTH - Hendrik
I usually try froogle.com to find stuff like that.
The intel 810 chipset (and all after that) has a builtin watchdog timer -
unfortunetally on some motherboards it's disabled (I guess in the BIOS).
How do I know that?
Once I got Linux installed on a new machine.... and although the install
went without a problem, after the first boot the machine would reboot on
exactly 2 minutes.
After a bit of poking around I found that hotplug detected the WDT support
and loaded the driver for it (i8xx_tco), and it seems the WDT chip was set
to start ticking right away after the driver poked it.
--
damjan
Yikes! "some poking around" - with two minutes to do it in - must have scarred
you for life!
- Hendrik
> "Paul Rubin" <http://phr...@NOSPAM.invalid> Writes:
>
> | "H J van Rooyen" <ma...@microcorp.co.za> writes:
> | > *grin* - Yes of course - if the WDT was enabled - its something that
> | > I have not seen on PC's yet...
> |
> | They are available for PC's, as plug-in cards, at least for the ISA
> | bus in the old days, and almost certainly for the PCI bus today.
>
> That is cool, I was not aware of this - added to a long running server it will
> help to make the system more stable - a hardware solution to hard to find bugs
> in Software - (or even stuff like soft errors in hardware - speak to the
> Avionics boys about Neutrons) do you know who sells them and what they are
> called? -
When you're talking about a bunch of (multiprocessing) machines on a
LAN, you can have a "watchdog machine" (or more than one, for
redundancy) periodically checking all others for signs of health -- and,
if needed, rebooting the sick machines via ssh (assuming the sickness is
in userland, of course -- to come back from a kernel panic _would_
require HW support)... so (in this setting) you _could_ do it in SW, and
save the $100+ per box that you'd have to spend at some shop such as
<http://www.pcwatchdog.com/> or the like...
Alex
Thanks - will check it out - seems a lot of money for 555 functionality
though....
Especially if like I, you have to pay for it with Rand - I have started to call
the local currency Runt...
(Typical South African Knee Jerk Reaction - everything is too expensive here...
:- ) )
- Hendrik
> Thanks - will check it out - seems a lot of money for 555 functionality
> though....
>
> Especially if like I, you have to pay for it with Rand - I have started
> to call the local currency Runt...
Depending on what you're up to, you can make such a thing yourself
relatively easily. There are various possibilities, both for the
reset/restart part and for the kick-the-watchdog part.
Since you're talking about a "555" you know at least /some/ electronics :)
Two 555s (or similar):
- One wired as a retriggerable monostable and hooked up to a control line
of a serial port. It needs to be triggered regularly in order to not
trigger the second timer.
- The other wired as a monostable and hooked up to a relay that gets
activated for a certain time when it gets triggered. That relay controls
the computer power line (if you want to stay outside the case) or the reset
switch (if you want to build it into your computer).
I don't do such things with 555s... I'm more a digital guy. There are many
options to do that, and all a lot cheaper than those boards, if you have
more time than money :)
Gerhard
Cheers!
-c
--
Carl J. Van Arsdall
cvana...@mvista.com
There's some pretty tricky issues with desktop-class PC hardware about
what to do if you need to reconfigure or reboot one remotely. Real
server hardware is better equipped for this but costs a lot more.
I remember something called "PC-Weasel" which was an ISA-bus plug-in
card that was basically a VGA card with an ethernet port. That let
you see the bootup screens remotely, adjust the cmos settings, etc. I
remember trying without success to find something like that for the
PCI bus. Without something like that, all you can really do if a PC
in server gets wedged is remote-reset or power cycle it; even that of
course takes special hardware, but many colo places are already set up
for that.
| Yea, there are other free solutions you might want to check out, I've
| been looking at ganglia and nagios. These require constant
| communication with a server, however they are customizable in that you
| can have the server take action on various events.
|
| Cheers!
|
| -c
Thanks - will have a look - Hendrik
| On 2006-08-03 06:07:31, H J van Rooyen wrote:
|
| > Thanks - will check it out - seems a lot of money for 555 functionality
| > though....
| >
| > Especially if like I, you have to pay for it with Rand - I have started
| > to call the local currency Runt...
|
| Depending on what you're up to, you can make such a thing yourself
| relatively easily. There are various possibilities, both for the
| reset/restart part and for the kick-the-watchdog part.
|
| Since you're talking about a "555" you know at least /some/ electronics :)
*grin* You could say that - original degree was Physics and Maths ...
| Two 555s (or similar):
| - One wired as a retriggerable monostable and hooked up to a control line
| of a serial port. It needs to be triggered regularly in order to not
| trigger the second timer.
| - The other wired as a monostable and hooked up to a relay that gets
| activated for a certain time when it gets triggered. That relay controls
| the computer power line (if you want to stay outside the case) or the reset
| switch (if you want to build it into your computer).
|
| I don't do such things with 555s... I'm more a digital guy. There are many
| options to do that, and all a lot cheaper than those boards, if you have
| more time than money :)
Like wise - some 25 years of amongst other things designing hardware and
programming 8051 and DSP type processors in assembler...
The 555 came to mind because it has been around for ever - and as someone once
said (Steve Circia ?) -
"My favourite programming language is solder"... - a dumb state machine
implemented in hardware beats a processor every time when it comes to
reliability - its just a tad inflexible...
The next step above the 555 is a PIC... then you can steal power from the RS-232
line - and its a small step from "PIC" to "PIG"...
Although this is getting bit off topic on a language group...
;-) Hendrik
> The next step above the 555 is a PIC... then you can steal power from the
> RS-232 line - and its a small step from "PIC" to "PIG"...
I see... you obviously know what to do, if you want to :)
But I'm not sure such a device alone is of much help in a typical server. I
think it's probably just as common that only one service hangs. To make it
useful, the trigger process has to be carefully designed, so that it
actually has a chance of failing when you need it to fail. This probably
requires either code changes to the various services (so that they each
trigger their own watchdog) or some supervisor program that only triggers
the watchdog if it receives responses from all relevant services.
Gerhard
This is true - its trivial to just kill the whole machine like this, but its
kind of like using a sledgehammer to crack a nut - and as you so rightly point
out - if the process that tickles the watchdog to make it happy is not (very)
tightly coupled to the thing you want to monitor - then it may not work at all -
specially if interrupts are involved - in fact something like a state machine
that looks for alternate occurrences of (at least) two things is required - the
interrupt gives it a kick and sets a flag, the application sees the flag and
gives it the alternate kick and clears the flag, and so on, with the internal
tasks in the machine "passing the ball" in this (or some other) way - that way
you are (relatively) sure the thing is still running... but it needs careful
design or it will either kill the machine for no good reason, (when something
like disk accesses slow the external (user) processes down ) , or it will fail
to fire if it is something that is driven from a call back - the app may be
crazy, but the OS may still be doing call-backs and timing stuff faithfully -
you cant be too careful...
- Hendrik