Limiting time of multicore run and related cleanup

139 views
Skip to first unread message

Pavel

unread,
Apr 2, 2015, 3:15:33 PM4/2/15
to julia...@googlegroups.com
What would be a good way to limit the total runtime of a multicore process managed by pmap?

I have pmap processing a collection of optimization runs (with fminbox) and most of the time everything runs smoothly. On occasion however 1-2 out of e.g. 8 CPUs take too long to complete one optimization, and fminbox/conj. grad. does not have a way to limit run time as recently discussed:
http://julia-programming-language.2336112.n4.nabble.com/fminbox-getting-quot-stuck-quot-td12163.html

To deal with this in a crude way, at the moment I call Julia from a shell (bash) script with timeout:

    timeout 600 julia -p 8 juliacode.jl

When doing this, is there anything to help find and stop zombie-processes (if any) after timeout forces a multicore pmap run to terminate? Anything within Julia related to how the processes are spawned? Any alternatives to shell timeout? I know NLopt has a time limit option but that is not implemented within Julia (but in the underlying C-library).

Pavel

unread,
Apr 29, 2015, 11:38:05 PM4/29/15
to julia...@googlegroups.com
Here is my current bash-script (same timeout-way due to the lack of alternative suggestions):

    timeout 600 julia -p $(nproc) juliacode.jl >>results.log 2>&1
    killall -9 -v julia >>cleanup.log 2>&1

Does that seem reasonable? Perhaps Linux experts may think of some scenarios where this would not be sufficient as far as the runaway/non-responding process cleanup?

Amit Murthy

unread,
Apr 29, 2015, 11:48:15 PM4/29/15
to julia...@googlegroups.com
Your solution seems reasonable enough.

Another solution : You could schedule a task in your julia code which will interrupt the workers after a timeout
@schedule begin
  sleep(600)
  if pmap_not_complete
     interrupt(workers())
  end
end

Start this task before executing the pmap

Note that this will work only for additional processes created on the local machine. For SSH workers, `interrupt` is a message sent to the remote workers, which will be unable to process it if the main thread is computation bound.  

Pavel

unread,
Apr 30, 2015, 12:32:22 AM4/30/15
to julia...@googlegroups.com
The task-option is interesting. Let's say there are 8 CPU cores. Julia's ncpus() returns 9 when started with `julia -p 8`, that is to be expected. All 8 cores are 100% loaded during the pmap call. Would `interrupt(workers())` leave one running?

Amit Murthy

unread,
Apr 30, 2015, 2:38:23 AM4/30/15
to julia...@googlegroups.com
`interrupt(workers())` is the equivalent of sending a SIGINT to the workers. The tasks which are consuming 100% CPU are interrupted and they terminate with an InterruptException.

All processes are still in a running state after this.

Amit Murthy

unread,
Apr 30, 2015, 2:48:09 AM4/30/15
to julia...@googlegroups.com
 `interrupt` will work for local workers as well as SSH ones. I had mentioned otherwise above.
Reply all
Reply to author
Forward
0 new messages