--
You received this message because you are subscribed to the Google Groups "pmix" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmix+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmix/C60F84D3-BCFE-4E7F-8F67-E23E828C7450%40pmix.org.
Hi Josh,
I have no objection to alternate launchers if, as you mention,
the vendor is willing to support and maintain them. The only
downside is that it will fall to the community to remove them once
they are no longer maintained.
Thanks,
Mike
To view this discussion on the web visit https://groups.google.com/d/msgid/pmix/0101017fc1a123e0-408234e2-c5af-464d-940c-444fc619eb3e-000000%40us-west-2.amazonses.com.
There are some environments that I have encountered where the native launcher is the only available mechanism within the allocation. Meaning that ssh is not functional within the allocation. I would say that it is not common, but it happens.
If the question is of support for the non-ssh launchers, then I would suggest trying to tackle that somewhat differently. The owner of the component should maintain it and test it if they want it in the default distribution. If it breaks then it is their responsibility to fix it. If they neglect it then it may be removed.
We (IBM) would need the plm framework to stay in place so we can have custom launchers (which are indeed faster than tree spawn ssh at some scales). I worry that if you all remove support for non-ssh launchers then this framework will go away in the process of simplifying the code. That would certainly make it more difficult to support PRRTE in those environments moving foward.Could you make the ssh launcher the highest priority component? Then users would have to 'opt-in' to using the native launcher.
On 3/25/22 08:37, Ralph Castain wrote:
> After encountering yet another Slurm-induced breakage of the launch system, I find myself wondering about a better long-term solution than continuing to chase the various launch environments. Over the years, we have often encountered problems where the RMs make a change that breaks our integration, often not detected for long periods of time until user complaints reach us. This leads to a complex web of "if-else" clauses as we try to navigate what works for which versions of the RM.
Has this been filed with SchedMD, or brought up in a PMIx ticket? I'm
still trying to learn how to best follow along on these problems.
> One launch method (ssh/rsh), however, always works. In the dim dark past, there were launch time benefits to using the "native" launcher - but that has not been true for quite a long time now. The ssh tree spawn is generally just as fast as the host environment.
It's not documented, but a best-practice suggestion for sites we work
with is to avoid SSH-based communication between cluster nodes. I'm not
saying our attitude is correct, just saying it's a divergent approach,
and one that contradicts your stance here.
Cloud-bursted environments are also likely to eschew SSH-based support,
although, again, that's up the integration scripting.
Also - there are advantages to 'srun' based launches, most notably in
how the statistics for the job steps are captured and stored. If you
rely on SSH-based launches, you won't be able to distinguish between
successive application launches within the job, and all use will be
aggregated and accounted against the external step.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmix/e1f8c9a4-5866-4726-9073-6691eff3a6b6n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmix/0101017fdba9e2d7-063f0a3f-ae3d-4f63-bc8f-7c606af0d7b0-000000%40us-west-2.amazonses.com.