globus-connect-server hangs

655 views
Skip to first unread message

Simon Leary

unread,
Aug 9, 2022, 2:26:21 PM8/9/22
to Discuss
Hello,
on a fresh install of globus connect server, after configuring my endpoint and logging in, `globus-connect-server endpoint show` does nothing. This was a problem for me last night, I came back today and it worked. But then I had to change my setup and start over, and it's doing the same thing. 

The only thing in globus-connect-server.log is a couple errors from hours ago that I resolved already. If I run globus-connect-server with no args then it comes right back with the help screen.

Does anyone have experience with globus-connect-server just making a blinking cursor forever?

Simon

Jason Alt

unread,
Aug 9, 2022, 2:41:40 PM8/9/22
to Simon Leary, Discuss
A couple of things you can try:

- Restarting the GCS Manager process and HTTPD. This is not normally an issue but sometimes on update they are unresponsive. Assuming you are on EL7:
   - sudo systemctl restart gcs_manager.service
   - sudo systemctl restart httpd.service (the name of the service varies by OS)

- It could be DNS or networking in which case you may be able to bypass the issue using `--use-explicit-host localhost`:
   - globus-connect-server --use-explicit-host localhost endpoint show

- This issue can also arise if you have other nodes defined for the endpoint that are not available. If that is the case, you can use the previous bullet (`--use-explicit-host`) to deactivate those nodes.

If you have any sort of persistent issue, feel free to open a support ticket at sup...@globus.org.

Jason

Stephen Rosen

unread,
Aug 9, 2022, 2:45:23 PM8/9/22
to Simon Leary, Discuss
Hi Simon,

How long are the hangs? My guess would be that what's happening is that the command is failing to connect and then doing retries with exponential backoff.

I'm mostly wondering if it gives you an error after a period of several minutes, or if you might be giving up -- perhaps after a very long wait -- and killing the command with Ctrl+C or similar. If the failure itself takes a long time (e.g. a connect timeout of 60s), then that could compound with the retry logic to produce a very slow failure.

The backoff and sleep behavior is not the underlying issue, but it may be obscuring the "real" error. I'll look into the worst case scenario here a little to see what we'd expect to be the maximum time to failure and to confirm that the command has the retry behavior which I believe it does.

Best,
Stephen

Simon Leary

unread,
Aug 9, 2022, 2:50:27 PM8/9/22
to Discuss, jaso...@globus.org, Discuss, Simon Leary
re: jasonalt
explicit host did the trick.
How can I isolate if it's a network issue or a node issue?
In `node list` I see one node and it's 'active'. But I can't ping the public IP associated with it, that could certainly be the problem.

Simon

Simon Leary

unread,
Aug 9, 2022, 2:51:02 PM8/9/22
to Discuss, sir...@globus.org, Discuss, Simon Leary
re: sirosen:
I will be more patient and set a timer and get back to you.

Simon

Jason Alt

unread,
Aug 9, 2022, 3:01:14 PM8/9/22
to Simon Leary, Discuss
The fact that you can't ping the public IP associated with the node is certainly an issue. Make sure that the public IP address listed is the IP address of the actual node. If it is wrong, you can update it with 'globus-connect-server --use-explicit-host localhost node update --ip-address <real_ip_address>`. If the IP address is the real address of the node, then it is likely a local firewall issue. Additionally, if you are behind some type of corporate NAT firewall things can get even more complex and we can help describe the fix if that is the case.

Jason

Simon Leary

unread,
Aug 9, 2022, 3:06:17 PM8/9/22
to Discuss, jaso...@globus.org, Discuss, Simon Leary
re: jasonalt
It's not the real IP. I have a public IP defined in the palo alto and a nat rule to translate the desination to this private IP. This rule allows traffic from our untrusted public net as well as inside, and it seems that foreign hosts and all hosts in my network can establish a `telnet globus-head 443`, every host but globus-head itself.

Simon

Jason Alt

unread,
Aug 9, 2022, 3:50:01 PM8/9/22
to Simon Leary, Discuss
That sounds as though the Palo Alto's hairpin routing config needs some adjustment. It is interesting that it works for other nodes on the internal subnet, just not the local node. It would be a useful data point to see if other nodes on the private subnet also can not contact themselves via their public IP. Unfortunately, resolving this in the Palo Alto is beyond my expertise, you'll need to talk to its administrator.

Jason

Jason Alt

unread,
Aug 9, 2022, 4:24:02 PM8/9/22
to Simon Leary, Discuss
A couple more details:

- Until the firewall is updated, you can set `GCS_CLI_EXPLICIT_HOST=localhost` in your shell environment to avoid always typing out `--use-explicit-host localhost`
- Transfer that use this endpoint as the source and destination will fail until it is resolved in the firewall config

Jason

Simon Leary

unread,
Aug 9, 2022, 5:22:15 PM8/9/22
to Discuss, jaso...@globus.org, Discuss, Simon Leary
re: sirosen
root@globus-head2:/opt/globus# date; ./bin/globus-connect-server endpoint show; date
Tue 09 Aug 2022 08:55:34 PM UTC
GlobusConnectionTimeoutError: ConnectTimeoutError on request
Tue 09 Aug 2022 09:07:54 PM UTC
root@globus-head2:/opt/globus#
 

Simon

Stephen Rosen

unread,
Aug 9, 2022, 6:54:55 PM8/9/22
to Simon Leary, Discuss, jaso...@globus.org
Thanks for letting that run and sharing the result!
12.5 minutes seems long for this case if it's just a failure to connect and there are no other issues. I took a look at how many retries are enabled by default (in globus-sdk, which is used by GCS), and what kind of times to expect. The defaults should make 6 connection attempts (1 + 5 retries) with a 60s connection timeout.
We'd therefore expect a failure within 7 minutes or so, significantly shorter than your case (but still pretty slow).

One thing that should work for cutting down on one source of delays: you can set GLOBUS_SDK_HTTP_TIMEOUT=5 or a similar small value (values are in seconds) to reduce the wait time when attempting to connect. (Reference: https://globus-sdk-python.readthedocs.io/en/stable/config.html#environment-variables )

The time suggests that there's some slow step other than the failure to connect.
I'm curious if cranking the HTTP timeout variable down does or does not have an impact. I would not be completely surprised if reducing that value has no effect or only a limited effect, which would indicate that there's some other explanation for the delay.
Reply all
Reply to author
Forward
0 new messages