SSH connection sharing

22 views
Skip to first unread message

Jeremy Cohen

unread,
Mar 4, 2016, 5:07:01 AM3/4/16
to saga-users
I'm submitting jobs to a remote cluster via SAGA-Python but also want to undertake some file operations while the jobs are running. These operations take place fairly frequently while jobs are running and I'd like to keep the number of SSH connections to a minimum and also avoid constantly creating and closing SSH connections for each file operation.

I know that there have been various previous discussions on sharing SSH connections but I wondered if someone could provide some more detailed information about how SAGA-Python actually works with SSH connections? I saw information in an earlier thread (https://groups.google.com/d/msg/saga-users/_Ldjbo6dElg/xPguk-f4dhsJ) stating that using a single Service instance should result in connections being re-used but new channels are created for data transfers. However, I couldn't find any more detailed description on how SAGA-Python handles the multiple connections that it creates. 

For example, I create a new SSH context and associated Session:

import saga
ctx = saga.Context("ssh")
ctx.user_id = 'myuser'
ctx.user_key = '/path/to/my/key'
s = saga.Session(default=False)
s.add_context(ctx)

Now I create a service instance pointing to a remote server:

svc = saga.job.Service('ssh://myserver.remote/',session=s)

I see three SSH connections created to the remote node. If I then create a Directory instance:

dir1 = saga.filesystem.Directory('sftp://myserver.remote/tmp/', session=s)

...I see a fourth SSH connection created…

On creating a further two directory instances, similar to the above, no further SSH connections are initiated. When I then call close() on each of the directory instances, the fourth SSH connection seems to remain.

When I created a saga.job.Service instance pointing to ssh://localhost/, I see four SSH connections initiated and an SFTP connection too.

Any more detailed explanation of the way the multiple connections are used and how one might go about making most efficient use of them would be great.

Many thanks,

Jeremy

Andre Merzky

unread,
Mar 11, 2016, 8:22:34 AM3/11/16
to saga-...@googlegroups.com
Hi Jeremy,

Well, I should use the occasion to answer this one as well. The
reason why I stalled on that answer though is: the whole connection
caching is somewhat of a hack, and should be considered black magic.
We are not proud of it. It is a corollary of the saga.utils.pty
layer, which in itself is something which I am sure will haunt me
further in the future, and which is in desperate need for a conceptual
overhaul -- if only we could convince ourself of investing time to do
so...

With this disclaimer (which I can't make any stronger I guess), see below.


On Fri, Mar 4, 2016 at 11:07 AM, Jeremy Cohen
<jeremy...@imperial.ac.uk> wrote:
> I'm submitting jobs to a remote cluster via SAGA-Python but also want to
> undertake some file operations while the jobs are running. These operations
> take place fairly frequently while jobs are running and I'd like to keep the
> number of SSH connections to a minimum and also avoid constantly creating
> and closing SSH connections for each file operation.

Full ack on the use case.

> I know that there have been various previous discussions on sharing SSH
> connections but I wondered if someone could provide some more detailed
> information about how SAGA-Python actually works with SSH connections? I saw
> information in an earlier thread
> (https://groups.google.com/d/msg/saga-users/_Ldjbo6dElg/xPguk-f4dhsJ)
> stating that using a single Service instance should result in connections
> being re-used but new channels are created for data transfers. However, I
> couldn't find any more detailed description on how SAGA-Python handles the
> multiple connections that it creates.

that information is somewhat outdated I'm afraid...

>
> For example, I create a new SSH context and associated Session:
>
> import saga
> ctx = saga.Context("ssh")
> ctx.user_id = 'myuser'
> ctx.user_key = '/path/to/my/key'
> s = saga.Session(default=False)
> s.add_context(ctx)
>
> Now I create a service instance pointing to a remote server:
>
> svc = saga.job.Service('ssh://myserver.remote/',session=s)
>
> I see three SSH connections created to the remote node. If I then create a
> Directory instance:
>
> dir1 = saga.filesystem.Directory('sftp://myserver.remote/tmp/', session=s)
>
> ...I see a fourth SSH connection created…
>
> On creating a further two directory instances, similar to the above, no
> further SSH connections are initiated. When I then call close() on each of
> the directory instances, the fourth SSH connection seems to remain.

We basically manage a pool of connections, via a radical.utils.lease_manager:

https://github.com/radical-cybertools/saga-python/blob/devel/src/saga/session.py#L121
https://github.com/radical-cybertools/radical.utils/blob/devel/src/radical/utils/lease_manager.py

On the adaptor layer, when we need a new shell connection, we ask the
lease manager for one. Example:
https://github.com/radical-cybertools/saga-python/blob/devel/src/saga/adaptors/shell/shell_file.py#L249

The lease manager (LM) will check if a shell for that target host
exists, and is free to use (ie. is not used by any other adaptor). If
that is the case, the shell is locked for use by the adaptor, until
the lease is returned. If no shell exists, or non is free, AND if the
max pool size is not reached, the LM will instantiate a new connection
on the fly, adds that to the pool, and hands out a lease.

There is always a master channel alive per resource (although that
is, in the current configuration, not strictly necessary), so that
add's one connection to the total number of channels. that master is
not in the pool, and cannot be leased.

Now, that mechanism is not used in all places. Specifically, we
skipped places where we (lazily) assumed that the lease would be for a
long time, or just once, etc, and it would not be worthwhile to use
the LM. In order to reduce the number of created channels for your
case, we would need to check where we are not using the LM, yet, and
change that. That often implies some (small) structural code changes,
to make sure that:
- the time of channel lease is short (and finite!)
- there is no assumption on the state of the channel

As to the latter point: the channel represents a remote shell, which
has a PWD, env settings, etc. While much of that shell is abstracted
away by the PTY layer, not everything is. Specifically the instance
using a leased shell needs to make sure that PWD is pointed to the
expected location.

Hmm, I hope the above clarifies somewhat what is going on under the
hood. Specifically it should explain the behavior on the Dir
instances: they create new channels in the pool, which are not freed
after close(), but wait for reuse by other instances.

So, lets see what you make of it :)

Best, Andre.



> When I created a saga.job.Service instance pointing to ssh://localhost/, I
> see four SSH connections initiated and an SFTP connection too.
>
> Any more detailed explanation of the way the multiple connections are used
> and how one might go about making most efficient use of them would be great.
>
> Many thanks,
>
> Jeremy
>
> --
> You received this message because you are subscribed to the Google Groups
> "saga-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to saga-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
99 little bugs in the code.
99 little bugs in the code.
Take one down, patch it around.

127 little bugs in the code...

Jeremy Cohen

unread,
Mar 14, 2016, 6:08:13 PM3/14/16
to saga-users
Many thanks for the explanation Andre. 

This information makes sense and I think I now have a better understanding of some issues I've been having with SSH connections blocking or failing to be created. I have a service (running under the gevent WSGI server) that is making use of various SAGA file/directory operations and running jobs. I had wondered whether there was perhaps an issue with using the library from a gevent environment but I think the issues are actually related to how I'm managing the SAGA sessions and SSH connections.

Initially I was finding that after sequentially creating a number of saga.filesystem.File or saga.filesystem.Directory objects (approximately 10, and passing the same Session object to each), I would be unable to create any further File or Directory object instances, a request to create one would just hang indefinitely. This was my mistake in not closing the File or Directory objects after use and rectifying this resolved the problem.

However, I also observed that if I create a separate session object for each transfer, after creating around 9/10 session/file objects, further connections fail (I get a "Could not detect shell prompt (timeout)" error).

In a more complete example, undertaking a pipeline of file transfer, run job and then another file transfer, if I implement my own session and service cache that stores a session and service object once it is first created and then re-uses the same session or service objects (as required) in multiple runs of the pipeline, things seem to work as expected, however, if I create new session and service objects for each process in the pipeline, the process eventually blocks or gives a "Could not detect shell prompt (timeout)" error.

I'm assuming these issues are a result of me creating and passing in my own session object(s). I understand that when not specifying a session object as a parameter to a service, file or directory object, saga-python is using a default session. Presumably this is being cached internally and re-used by SSH connections?

Based on the details of how the lease manager operates, I wasn't clear if it would take care of managing the connections regardless of whether I'm providing a Session object or using the default session. I have ended up creating a simple cache that stores the session/service objects that I'm creating based on the host name. If I get a request to undertake an operation on a new host, I create and store a session object for that host, otherwise the session is retrieved from the cache. Is this effectively what the lease manager is handling internally for default sessions?

(I'm using my own session objects to avoid rejected logins caused when a user has many keys present.)

I wonder if you could provide any general guidelines on best practices for working with session and service objects to ensure that connections are utilised correctly/efficiently?

Many thanks,

Jeremy

Andre Merzky

unread,
Mar 14, 2016, 7:19:22 PM3/14/16
to saga-...@googlegroups.com
Hi Jeremy,

> I'm assuming these issues are a result of me creating and passing in my own
> session object(s). I understand that when not specifying a session object as
> a parameter to a service, file or directory object, saga-python is using a
> default session. Presumably this is being cached internally and re-used by
> SSH connections?

Yes, indeed: the default SAGA session is a singleton and re-used.
Also, the lease manager is actually tight to the session lifetime,
so what you observe makes sense: creating new sessions will create new
pools of ssh leases, which will eventually exhaust the number of
channels your system will allow.

> (I'm using my own session objects to avoid rejected logins caused when a
> user has many keys present.)
>
> I wonder if you could provide any general guidelines on best practices for
> working with session and service objects to ensure that connections are
> utilised correctly/efficiently?

I can see the issue of context handling -- I think this actually
points to some flaws on how the API is designed: we expected people to
use a very small set of credentials which are easy to distinguish per
remote operation. That is, however, usually not the case, and SAGA
does not handle many contexts very well (and we also cannot really
think of an effective algorithm which does automatic context
selection).

The solution we ourself are using is to configure the ssh access
outside of SAGA, via ~/.ss/config, where we ask our users to specify
the respective keys to be used:

host *.futuregrid.org
pubkeyauthentication = yes
identityfile = ~/.ssh/id_rsa_fg
user = merzky

host *.xsede.org
identityfile = ~/.ssh/id_rsa
user = tg803521

host 144.76.72.175 radical
identityfile = ~/.ssh/id_rsa_tb
hostname = 144.76.72.175
user = ubuntu

Once this is set up, we can use (and re-use) a single session with no
explicit contexts attached, and the key management is completely left
to the ssh layer. Another option we sometimes employ is the use of an
ssh agent, which also externalizes the key management (for some use
cases at least).

One way to get things working with individual sessions, at least to
some extent, would be to allow explicit session destruction, which
should close the channels in the session's lease manager, and thus
free the resources for the next session. That should be possible to
implement, and would not violate the API spec. But that does not help
for those cases where you want to use concurrent sessions for the
reasons you described.

Another option would be to iterate the API in that context (pun!). As
usual we would be hesitant to do so -- but then again, maintaining a
broken state is also not helpful :P

Either way, please let us know if using an ssh config would be
acceptable in your specific case, before we go down the rabbit
hole...

Thanks, Andre.

Jeremy Cohen

unread,
Mar 15, 2016, 6:21:18 AM3/15/16
to saga-...@googlegroups.com
Hi Andre,

>> I'm assuming these issues are a result of me creating and passing in my own
>> session object(s). I understand that when not specifying a session object as
>> a parameter to a service, file or directory object, saga-python is using a
>> default session. Presumably this is being cached internally and re-used by
>> SSH connections?
>
> Yes, indeed: the default SAGA session is a singleton and re-used.
> Also, the lease manager is actually tight to the session lifetime,
> so what you observe makes sense: creating new sessions will create new
> pools of ssh leases, which will eventually exhaust the number of
> channels your system will allow.

Thanks for clarifying this.
> hole…

I can possibly make use of the SSH config file, I've also experimented with using per-user SSH agents which might be an option. It sounds like providing session destruction may be complex so, for now, I'll see how things go based on the information you've provided here.

Thanks,
Jeremy
> You received this message because you are subscribed to a topic in the Google Groups "saga-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/saga-users/88FrrE-rTZ0/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to saga-users+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages