python/signal question

640 views
Skip to first unread message

Daniel Povey

unread,
Jan 2, 2017, 12:14:34 AM1/2/17
to kaldi-developers
I have a question..
if a python script creates subprocesses using the 'subprocess' module,
and then the python script terminates abnormally e.g. via abort() or
some kind of signal, does the signal get propagated somehow to the
child processes? And if so, what factors influence this?
We're going to change it so that queue.pl kills its jobs when killed,
so there will be a time when this kind of thing matters.

Dan

Vassil Panayotov

unread,
Jan 2, 2017, 3:21:12 AM1/2/17
to kaldi-de...@googlegroups.com
I'm not very knowledgeable about this either, but it seems this can be achieved using a combination of the 'atexit' and 'signal' modules (http://stackoverflow.com/questions/2546276/python-process-wont-call-atexit). Apparently this should be enough to handle e.g. SIGTERM and SIGABRT, but I think there is no way to catch SIGKILL(e.g. the "kill -9 <pid>" command).

Vassil


--
You received this message because you are subscribed to the Google Groups "kaldi-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-developers+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/kaldi-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-developers/CAEWAuySXLxZPvaPRDBNWTo3BGseM0%3Dt26B0JDkpz0X6YcQHOyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Vassil Panayotov

unread,
Jan 2, 2017, 4:24:15 AM1/2/17
to kaldi-de...@googlegroups.com
Actually, after some more "research", it seems that there could be less complicated ways to propagate signals to children.. People talk about using process groups(http://stackoverflow.com/a/4791612) and other methods such as "prctl"(http://evans.io/legacy/posts/killing-child-processes-on-parent-exit-prctl/).

Ho Yin Chan

unread,
Jan 2, 2017, 6:45:42 AM1/2/17
to kaldi-de...@googlegroups.com

Vesely Karel

unread,
Jan 2, 2017, 11:19:04 AM1/2/17
to kaldi-de...@googlegroups.com

Hi,
this is an interesting problem. I created a toy example to play with and noticed this behavior:

  • When I use 'queue.pl' that runs a python script which forks a child process. After a 'qdel', the child process is cleaned properly. That's good news!
  • When I run the same script locally in shell and manually kill the parent process, the child process continues to live after the parent. This is a problem!
To fix this, we cannot use 'atexit', the registered functions are NOT EXTECUTED when the parent process is killed by an external signal (I tested this).

What works well is to define a new signal handler function, which kills the processes according to the parent process group-ID (see the attached script).

Does it solve the problem, Dan? (In your case, are the python child processes cleaned properly when run in the cluster?)

Best regards,
Karel.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-develope...@googlegroups.com.
To post to this group, send email to kaldi-de...@googlegroups.com.
run_example.py

Daniel Povey

unread,
Jan 2, 2017, 4:07:58 PM1/2/17
to kaldi-developers
> When I use 'queue.pl' that runs a python script which forks a child process.
> After a 'qdel', the child process is cleaned properly. That's good news!
> When I run the same script locally in shell and manually kill the parent
> process, the child process continues to live after the parent. This is a
> problem!

GridEngine has a special mechanism to kill all children of a qsubbed
job or qlogin session; by default it uses group-ids in some complex
way that I don't fully understand, but I do know that if you have any
users with user-ids in this range:
qconf -sconf | grep gid_range
gid_range 50000-51000
it will cause hard-to-debug problems on the queue (happened to me).

But I wasn't concerned about jobs on the cluster, I was concerned
about scripts like steps/nnet3/train.py, and what happens when you
ctrl-c into them and the like.

> To fix this, we cannot use 'atexit', the registered functions are NOT
> EXTECUTED when the parent process is killed by an external signal (I tested
> this).
>
> What works well is to define a new signal handler function, which kills the
> processes according to the parent process group-ID (see the attached
> script).
>
> Does it solve the problem, Dan? (In your case, are the python child
> processes cleaned properly when run in the cluster?)

I think we'll have to use a mechanism like this. And possibly also an
atexit function depending on how we exit the program when we detect
errors in jobs launched from background threads. [might call
os._exit].

Anyway, Vimal will probably do this when he has a chance.

Dan

ondrej platek

unread,
Jan 2, 2017, 5:37:20 PM1/2/17
to kaldi-de...@googlegroups.com
@Dan: AFAIK one can use trap to capture signals and recursively call the same signal on the process children. 
The problem is that one has to register the "callbacks" for each signal number separately.

The second problem is that SIGKILL - 9 cannot be trapped. All the other signals e.g SIGTERM - 15 can be trapped, so use them. 



Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-developers+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/kaldi-developers.

For more options, visit https://groups.google.com/d/optout.



--
Ondřej Plátek
Reply all
Reply to author
Forward
0 new messages