On Fri, Nov 14, 2014 at 10:05 AM, Peter Kourzanov
<
peter.k...@gmail.com> wrote:
> What I am doing right now is the following sequence to start N workers using
> 1 single ssh session and 1 single LSF job:
>
> LSF.addprocs(["queue"],dir="julia")
> ssh from client to a LSF access machine (and run a "bjulia LSF" script)
> submit a LSF job to start Julia (which executes a "run" bash script)
> "run" determines the number of CPUs on the scheduled host and starts N
> worker processes
> Julia from the client waits for the printing of N (done by "run", from the
> ssh stream) and then collects the conn_info (printed by each worker, from
> the same stream)
> The connections for client Julia processes go directly, hopefully bypassing
> the original ssh stream (tunnel=false)
>
> What would make sense is to integrate (4) into julia --worker handling
> (triggering on -p N --worker combination)
In many cases, the number of hardware cores on the system is not the
same as the number of Julia workers one wants to start. With
hyperthreading, there's an obvious factor of two one may want to use,
and with queuing systems such as LSF, other jobs may be scheduled on
the same node. In the case of LSF, there's an environment variable set
that specifies how many processes LSF says should be running on the
node.
Given that this is LSF specific, moving this into Julia may be
difficult, because Julia would a priori not know at which variable to
look. Maybe a Julia command line option like "julia -p
'"$(env['LSF_NUM_PROCS'])"'" would make sense? This option would
automatically be added by the "bjulia LSF" script.
-erik
> and change base/multi.jl to do (5)
> Then there would be no need to have a "run" script and all ClusterManagers
> get the benefit of more parallelization (at no additional cost of extra ssh
> sessions, LSF jobs etc.)
>
> On Friday, November 14, 2014 2:58:48 PM UTC+1, Amit Murthy wrote:
>>
>> Is each ssh session counted as a job? Or is it each ssh connection.
>>
>> If it is the latter, then use of ssh flags to multiplex sessions over the
>> same socket connection, something like, `-o ControlMaster auto -o
>> ControlPath [path] -o ControlPersist 5` should also work.
>>
>> I'll take a look at the gist in a while.
>>
>>
>> On Fri, Nov 14, 2014 at 7:11 PM, Peter Kourzanov <
peter.k...@gmail.com>
>> wrote:
>>>
>>> Its exactly what I am arguing for. For the most efficient implementation
>>> you don't want to spawn N*ssh (or LSF, or SGE) sessions to start N workers
>>> on 1 host. Some of the cluster systems severely limit the number of jobs
>>> anyway.
>>>
>>> It requires extensions to --workers handling, and to the code that
>>> processes worker registrations (see the patch to multi.jl in the Gist at the
>>> beginning of this thread).
>>>
>>> Shall I create a PR for this?
>>
>>
>
--
Erik Schnetter <
schn...@cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/