Remove all non-ssh launchers?

50 views
Skip to first unread message

Ralph Castain

unread,
Mar 25, 2022, 10:37:48 AM3/25/22
to 'Thomas Naughton' via pmix
After encountering yet another Slurm-induced breakage of the launch system, I find myself wondering about a better long-term solution than continuing to chase the various launch environments. Over the years, we have often encountered problems where the RMs make a change that breaks our integration, often not detected for long periods of time until user complaints reach us. This leads to a complex web of "if-else" clauses as we try to navigate what works for which versions of the RM.

One launch method (ssh/rsh), however, always works. In the dim dark past, there were launch time benefits to using the "native" launcher - but that has not been true for quite a long time now. The ssh tree spawn is generally just as fast as the host environment.

So I'm wondering - should we just remove all these other launchers and always use ssh/rsh?? It would _greatly_ simplify the code and code maintenance, and I can't seen an immediate downside. I realize that some installations have constraints on ssh between various nodes of a cluster, but that (a) is an installation-specific issue, and (b) might be accommodated by adjusting the ssh agent (which the installation is free to specify).

Any thoughts?
Ralph

Michael Karo

unread,
Mar 25, 2022, 11:04:12 AM3/25/22
to pm...@googlegroups.com

Hi Ralph,

I recently made some changes to pbs_tmrsh to support what you are
proposing (though I have yet to commit them). Tree launch has yet to be
addressed, but I was able to launch my applications when it was
disabled. While I was leading the ALPS team, I always ensured the design
and implementation was agnostic to the underlying RM. I still believe it
was the correct direction.

Short answer... I support your proposal.

Thanks,

Mike

Josh Hursey

unread,
Mar 25, 2022, 11:10:38 AM3/25/22
to pm...@googlegroups.com
There are some environments that I have encountered where the native launcher is the only available mechanism within the allocation. Meaning that ssh is not functional within the allocation. I would say that it is not common, but it happens.

If the question is of support for the non-ssh launchers, then I would suggest trying to tackle that somewhat differently. The owner of the component should maintain it and test it if they want it in the default distribution. If it breaks then it is their responsibility to fix it. If they neglect it then it may be removed.

We (IBM) would need the plm framework to stay in place so we can have custom launchers (which are indeed faster than tree spawn ssh at some scales). I worry that if you all remove support for non-ssh launchers then this framework will go away in the process of simplifying the code. That would certainly make it more difficult to support PRRTE in those environments moving foward.

Could you make the ssh launcher the highest priority component? Then users would have to 'opt-in' to using the native launcher.


--
You received this message because you are subscribed to the Google Groups "pmix" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmix+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmix/C60F84D3-BCFE-4E7F-8F67-E23E828C7450%40pmix.org.


--
Josh Hursey
IBM Spectrum MPI Developer

Michael Karo

unread,
Mar 25, 2022, 11:15:33 AM3/25/22
to pm...@googlegroups.com


Hi Josh,

I have no objection to alternate launchers if, as you mention, the vendor is willing to support and maintain them. The only downside is that it will fall to the community to remove them once they are no longer maintained.

Thanks,

Mike

Thomas Naughton

unread,
Mar 25, 2022, 11:33:59 AM3/25/22
to pm...@googlegroups.com
Hi,

I see where you are coming from, but I think I like the approach of keeping
the framework for customization but giving highest priority to SSH unless
someone explicitly requests otherwise.

I also recognize that in systems that do not already have ssh-launch
support, while they are rare these days, if you land on one it is exactly
these types of systems that often benefit users to have an "overlay"
runtime option.

As for pruning, maybe things get taken out of release branches if they are
not tested and that removes the problems for releases. Eventually this
begs question of when to remove in main branch, but maybe that's a CI
question for main branch?

My $0.02,
--tjn

_________________________________________________________________________
Thomas Naughton naug...@ornl.gov
Research Staff (865) 576-4184
>> <mailto:pmix%2Bunsu...@googlegroups.com>.
>> To view this discussion on the web visit
>> hxxps://groups.google.com/d/msgid/pmix/C60F84D3-BCFE-4E7F-8F67-E23E828C7450%40pmix.org.
>>
>>
>>
>> --
>> Josh Hursey
>> IBM Spectrum MPI Developer
>> --
>> You received this message because you are subscribed to the Google Groups
>> "pmix" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to pmix+uns...@googlegroups.com.
>> To view this discussion on the web visit
>> hxxps://groups.google.com/d/msgid/pmix/0101017fc1a123e0-408234e2-c5af-464d-940c-444fc619eb3e-000000%40us-west-2.amazonses.com
>> <hxxps://groups.google.com/d/msgid/pmix/0101017fc1a123e0-408234e2-c5af-464d-940c-444fc619eb3e-000000%40us-west-2.amazonses.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pmix" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pmix+uns...@googlegroups.com.
> To view this discussion on the web visit
> hxxps://groups.google.com/d/msgid/pmix/0c553d65-5271-89c4-6f2c-16cd971adf43%40gmail.com.
>

Tim Wickberg

unread,
Mar 25, 2022, 1:20:00 PM3/25/22
to pm...@googlegroups.com
On 3/25/22 08:37, Ralph Castain wrote:
> After encountering yet another Slurm-induced breakage of the launch system, I find myself wondering about a better long-term solution than continuing to chase the various launch environments. Over the years, we have often encountered problems where the RMs make a change that breaks our integration, often not detected for long periods of time until user complaints reach us. This leads to a complex web of "if-else" clauses as we try to navigate what works for which versions of the RM.

Has this been filed with SchedMD, or brought up in a PMIx ticket? I'm
still trying to learn how to best follow along on these problems.

> One launch method (ssh/rsh), however, always works. In the dim dark past, there were launch time benefits to using the "native" launcher - but that has not been true for quite a long time now. The ssh tree spawn is generally just as fast as the host environment.

It's not documented, but a best-practice suggestion for sites we work
with is to avoid SSH-based communication between cluster nodes. I'm not
saying our attitude is correct, just saying it's a divergent approach,
and one that contradicts your stance here.

Cloud-bursted environments are also likely to eschew SSH-based support,
although, again, that's up the integration scripting.

Also - there are advantages to 'srun' based launches, most notably in
how the statistics for the job steps are captured and stored. If you
rely on SSH-based launches, you won't be able to distinguish between
successive application launches within the job, and all use will be
aggregated and accounted against the external step.

> So I'm wondering - should we just remove all these other launchers and always use ssh/rsh?? It would _greatly_ simplify the code and code maintenance, and I can't seen an immediate downside. I realize that some installations have constraints on ssh between various nodes of a cluster, but that (a) is an installation-specific issue, and (b) might be accommodated by adjusting the ssh agent (which the installation is free to specify).

I'd rather find ways to work together to avoid this breakage, or at
least get some degree of common CI infrastructure testing this and
better reporting when issues do occur.

Our QA group have been expanding our own test suite, but we do still
lack anything testing PMIx or various MPI flavors, and that is something
I am expecting to address longer-term.

- Tim

Ralph Castain

unread,
Mar 28, 2022, 3:54:50 AM3/28/22
to 'Thomas Naughton' via pmix
I have filed an issue on this so people not on the mailing list are aware of
the proposed change: https://github.com/openpmix/prrte/issues/1308

Per the added comment:

We would retain the support for reading allocations, but would remove the
native-based launch mechanisms - thus, we would only support two methods
for starting the DVM:

* use ssh for launching the PRRTE daemons. Eliminates many of the problems
  we have had over the years (whether it be scaling limitations or changing cmd
  lines or whatever). Essentially matches what MPICH does.

* bootstrap of daemons that startup with the OS. Supports systems using
  PRRTE as their RTE.

If I don't hear something back by April 9, I'll assume people don't have a
problem with this and start removing the PLM components.

Ralph

Ralph Castain

unread,
Mar 28, 2022, 1:19:00 PM3/28/22
to pmix
On Friday, March 25, 2022 at 8:10:38 AM UTC-7 Josh Hursey wrote:
There are some environments that I have encountered where the native launcher is the only available mechanism within the allocation. Meaning that ssh is not functional within the allocation. I would say that it is not common, but it happens.


Yeah, I've encountered some of those myself - it's a valid point. Even though there are workarounds, it does make things less automatic.

If the question is of support for the non-ssh launchers, then I would suggest trying to tackle that somewhat differently. The owner of the component should maintain it and test it if they want it in the default distribution. If it breaks then it is their responsibility to fix it. If they neglect it then it may be removed.

The problem is that we have no visibility into the status of the components. Frankly, I have no idea if the LSF component works (for example). The only feedback we receive is when someone complains, and that usually happens after a release has been out long enough for systems to upgrade and hit it.

Given that we don't have universal support from the vendors of the target systems, that makes things a tad difficult. It usually means I have to react, typically under some pressure as we are in a "system is broken" mode by then. Hence the motivation behind the proposal.

I suppose one possibility would be to ask the vendors to setup either a CI, a nightly regression, or at least a release candidate test that got reported back to the community. Not sure how acceptable that would be, and it would undoubtedly take some effort to implement.


We (IBM) would need the plm framework to stay in place so we can have custom launchers (which are indeed faster than tree spawn ssh at some scales). I worry that if you all remove support for non-ssh launchers then this framework will go away in the process of simplifying the code. That would certainly make it more difficult to support PRRTE in those environments moving foward.

Could you make the ssh launcher the highest priority component? Then users would have to 'opt-in' to using the native launcher.

We could, though that also has its downside as noted by others on the mailing list. Still, there is no perfect solution.

Ralph Castain

unread,
Mar 28, 2022, 1:39:34 PM3/28/22
to pmix
On Friday, March 25, 2022 at 10:20:00 AM UTC-7 Tim Wickberg wrote:
On 3/25/22 08:37, Ralph Castain wrote:
> After encountering yet another Slurm-induced breakage of the launch system, I find myself wondering about a better long-term solution than continuing to chase the various launch environments. Over the years, we have often encountered problems where the RMs make a change that breaks our integration, often not detected for long periods of time until user complaints reach us. This leads to a complex web of "if-else" clauses as we try to navigate what works for which versions of the RM.

Has this been filed with SchedMD, or brought up in a PMIx ticket? I'm
still trying to learn how to best follow along on these problems.


First, let me be clear that I wasn't picking on Slurm - we hit this on every environment. I just happened to be working on one related to Slurm when I decided it was time to ask this question.

No, I haven't filed this with SchedMD. We have raised some of the issues there in the past, but the response has been somewhat cool to our situation - and I actually do grok the reasons, even if it does cause us problems. Most of the problems get filed either on the PMIx or PRRTE repos (people aren't necessarily clear which one does what, so it bounces around), and can also show up on the MPI repositories (though we are trying to have those communities at least refile them to PRRTE). Some appear on this mailing list, though that is becoming less common as we push them towards the repos.

The issues we run into (and again, this isn't a Slurm-specific situation) is that vendors change their command lines and/or the meaning of environmental variables, adding/subtracting the latter at times, to meet their own needs. Unfortunately, when they do that they sometimes break our integration. After all, we have to create (using the Slurm example) an "srun" cmd line and then fork/exec it. If the cmd line syntax changes (e.g., the exact name of an option, or the argument it takes), then we are hosed. Same for envars.

Examples from the last few years:

* the "--cpu-bind" option changed its spelling
* the meaning of the SLURM_CPU_BIND option changed to set verbosity instead of the actual binding policy - which now is done with SLURM_CPU_BIND_TYPE
 * add cmd line option to force addition of all cpus on the node to the PRRTE daemon

Further complicating the problem is that there is no way to detect which method we should use for the given Slurm installation. Best we can do is search the releases to find when it happened and then add configure logic to detect that version and pepper the code with "#if" clauses. Maintenance headache.

> One launch method (ssh/rsh), however, always works. In the dim dark past, there were launch time benefits to using the "native" launcher - but that has not been true for quite a long time now. The ssh tree spawn is generally just as fast as the host environment.

It's not documented, but a best-practice suggestion for sites we work
with is to avoid SSH-based communication between cluster nodes. I'm not
saying our attitude is correct, just saying it's a divergent approach,
and one that contradicts your stance here.

Cloud-bursted environments are also likely to eschew SSH-based support,
although, again, that's up the integration scripting.

Also - there are advantages to 'srun' based launches, most notably in
how the statistics for the job steps are captured and stored. If you
rely on SSH-based launches, you won't be able to distinguish between
successive application launches within the job, and all use will be
aggregated and accounted against the external step.

This is the most commonly cited issue with ssh-based launches, although I'm not sure how well that pertains to PRRTE. In the case of PRRTE, the user starts the DVM once and then runs as many applications within it as they like. Thus, the daemons are persistent for the life of the allocation, which means that the RM itself has zero visibility into the different executions regardless of how the daemons were started.

That said, it is true that using "srun" means that Slurm does know about the daemons, and therefore it has the ability to at least report an aggregate number for resource utilization. It is a valid point. However, there must be some accounting method to handle the case where a user ssh-launches an application since the system cannot preclude it (unless setup to do so), otherwise it would be a too-obvious way of bypassing charges. So how do you handle that scenario?

Ralph Castain

unread,
Mar 28, 2022, 1:43:09 PM3/28/22
to pmix
Like I said on a prior response, the problem really is that we have no visibility into the status of these launchers. Right now, so far as I know, the only launchers that are regularly tested against recent environments (i.e., not a stone age version) are ssh and ALPS since we have CI utilizing each of those.

There may well be more testing going on behind the scenes, but we (as a community) have no way of knowing if that is happening, nor the status of the results. Hence, I have no idea how to "prune" a release branch until after it has already been released - and the complaints come rolling in.

Tim Wickberg

unread,
Mar 28, 2022, 2:34:05 PM3/28/22
to pm...@googlegroups.com
I know these are just meant as examples, but I still feel compelled to
correct them:

> Examples from the last few years:
>
> * the "--cpu-bind" option changed its spelling

It did change the preferred spelling, but we still support --cpu_bind as
well, and will do so into the indefinite future.

> * the meaning of the SLURM_CPU_BIND option changed to set verbosity
> instead of the actual binding policy - which now is done with
> SLURM_CPU_BIND_TYPE

There was a change to the _output_ environment variables at some point,
but the input variable remains unchanged. Looking at the prrte code,
setting this variable explicitly in launch_daemons() in
src/mca/plm/slurm/plm_slurm_module.c isn't accomplishing anything.

Looking at e0810859d7b1, it's unclear why that change was made, and it
doesn't seem related to the rest of the changes pushed there.

>  * add cmd line option to force addition of all cpus on the node to the
> PRRTE daemon

That's fair. That was a tough set of calls to make to correct some
unfortunate design decisions, and likely the source of most of the
frustration we've been fielding from PMIx and OpenMPI recently.
> That said, it is true that using "srun" means that Slurm does know about
> the daemons, and therefore it has the ability to at least report an
> aggregate number for resource utilization. It is a valid point. However,
> there must be some accounting method to handle the case where a user
> ssh-launches an application since the system cannot preclude it (unless
> setup to do so), otherwise it would be a too-obvious way of bypassing
> charges. So how do you handle that scenario?

The "external" job step tracks these in Slurm, assuming you've setup
pam_slurm_adopt on the compute node correctly.

- Tim

Ralph Castain

unread,
Mar 28, 2022, 2:43:34 PM3/28/22
to 'Thomas Naughton' via pmix


> On Mar 28, 2022, at 11:34 AM, Tim Wickberg <t...@schedmd.com> wrote:
>
> I know these are just meant as examples, but I still feel compelled to correct them:
>
>> Examples from the last few years:
>> * the "--cpu-bind" option changed its spelling
>
> It did change the preferred spelling, but we still support --cpu_bind as well, and will do so into the indefinite future.

Ummm...we had consistent reports of error messages about "cpu_bind" being an unknown option and aborting the launch, and we verified that ourselves. I don't know if you modified this on a subsequent release or not, so perhaps it is no longer an issue. Might just be another example of us chasing the ball.

>
>> * the meaning of the SLURM_CPU_BIND option changed to set verbosity instead of the actual binding policy - which now is done with SLURM_CPU_BIND_TYPE
>
> There was a change to the _output_ environment variables at some point, but the input variable remains unchanged. Looking at the prrte code, setting this variable explicitly in launch_daemons() in
> src/mca/plm/slurm/plm_slurm_module.c isn't accomplishing anything.
>
> Looking at e0810859d7b1, it's unclear why that change was made,

Well, working with the latest release, we (a) get verbose warnings about cpu binding and (b) wind up with the daemons being bound if we don't set the binding to "none" using the "bind_type" envar. Sadly, I cannot tell which version to use, so we either have binding issues with earlier Slurm versions, or we wind up with verbose messages due to setting both envars. If you have a backwards compatible solution, we'd welcome hearing of it :-)

> and it doesn't seem related to the rest of the changes pushed there.

Sometimes I fix more than one thing at a time as I fight thru a user-reported problem to find the root cause. In this case, I started seeing the warnings and problems while trying to fix a problem a user had reported in a Slurm environment.

>
>> * add cmd line option to force addition of all cpus on the node to the PRRTE daemon
>
> That's fair. That was a tough set of calls to make to correct some unfortunate design decisions, and likely the source of most of the frustration we've been fielding from PMIx and OpenMPI recently.
>> That said, it is true that using "srun" means that Slurm does know about the daemons, and therefore it has the ability to at least report an aggregate number for resource utilization. It is a valid point. However, there must be some accounting method to handle the case where a user ssh-launches an application since the system cannot preclude it (unless setup to do so), otherwise it would be a too-obvious way of bypassing charges. So how do you handle that scenario?
>
> The "external" job step tracks these in Slurm, assuming you've setup pam_slurm_adopt on the compute node correctly.

I assume people generally would do that, or else it would leave a rather gaping hole. Still, I'm okay with leaving the native launcher integration - just hoping we can find some better way of supporting it.

>
> - Tim
>
> --
> You received this message because you are subscribed to the Google Groups "pmix" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pmix+uns...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pmix/1dd09d01-3465-49d8-6638-8ae7ba9b0484%40schedmd.com.

Josh Hursey

unread,
Mar 30, 2022, 12:30:20 PM3/30/22
to pm...@googlegroups.com
Re: LSF testing.

We (IBM) do some LSF testing internally with PRRTE on a relatively regular basis, but this is not exposed to the community which I can see might be a problem when validating changes. I'm working with our team to get an LSF environment setup that we can use to support either the current IBM CI runners or a new runner dedicated just to LSF.


Ralph Castain

unread,
Apr 6, 2022, 7:53:57 PM4/6/22
to 'Thomas Naughton' via pmix
Thanks for the update! I think the solution to his probably lies in a conbination of some policy as well as some automated testing. Perhaps we could adopt a policy of removing support for PLM's in release branches if we don't have some indication of viability? Wouldn't require a lot - a simple verification statement from the related vendor, or some testing/CI reports would do - we'd also need some idea of what environment versions this covered so we could either protect against it or at least include that in our "README".

Here's the status so far as I know at this time:

* LSF - Josh reports it is regularly tested and working.
* Torque/PBS - last indication I had is that there _might_ be a problem of some type, but nothing concrete. I am glad to hear that Mike is continuing to work on the updated version!
* Slurm - I haven't heard any complaints, and we do have the recent contribution from Tim, so this is likely working. Would be good to get some confirmation prior to release
* Gridengine - no idea if it works or not, haven't heard anything in a very long time and no tests are available
* ALPS - the CI died a while ago due to a Jenkins problem. Otherwise, no confirmation of its status is available
* ssh - we get CI-based confirmation of this on a continuous basis, though we don't necessarily verify scaling behavior

If we don't get verification, we have a couple of options. Obviously, we could just remove it and fall back to ssh in those environments. Another option would be to retain the support, but disable it unless specifically requested (either at configure or runtime). The "disable" option would allow people to experiment with it and use it should it work for them, so that's a plus. Negative is that they will file bug reports and "demand" we fix it if it doesn't work (based on painful years of experience), so we are right back in the "unsupported" hole again. I'm not sure how viable that is.

I think we are some ways off from a release as we have a number of problems that need to be addressed, so this doesn't have to be resolved right away. Still, if folks can begin thinking about how to do the required verification and what the policy should look like, it would be much appreciated.

Ralph


Reply all
Reply to author
Forward
0 new messages