Dead connection detection issue?

115 zobrazení
Preskočiť na prvú neprečítanú správu

Zaar Hai

neprečítané,
4. 1. 2019, 8:19:184. 1. 2019
komu: Autobahn
Good day and happy new year to everyone.

I have a very simple WAMP program with on_join doing the following:

    @component.on_join
    async def on_join(session, details):
        logger.info("Entering on_join")
        options = PublishOptions(exclude_me=False, acknowledge=True)
        logger.info("Subscribed")
        await session.subscribe(printer, "news")
        while True:
            logger.info("Sleeping")
            await asyncio.sleep(60)
            logger.info("Publishing")
            await session.publish("news", "Fake news", options=options)


My crossbar runs behind load balancer (GCP) that destroys WebSocket connections after 30 seconds, so I would expect Autobahn to detect connection termination and reconnect - actually, as per my LB configuration the program should never reach the last line.
This is what happens, but only once. Is this a bug here? I'm using autobahn 18.12.1 and crossbar docker pypy3-18.9.2.

Here is the detailed flow:

On the first minute I get:

2019-01-05T00:00:44.912: autobahn.asyncio.component.Component INFO  ## connecting once using transport type "websocket" over endpoint "tcp" 
2019-01-05T00:00:46.075: main            INFO  ## Entering on_join       
2019-01-05T00:00:46.076: main            INFO  ## Subscribed      
2019-01-05T00:00:46.321: main            INFO  ## Sleeping      
2019-01-05T00:01:15.640: autobahn.asyncio.component.Component INFO  ## session leaving 'wamp.close.transport_lost'  
2019-01-05T00:01:15.642: autobahn.asyncio.component.Component ERROR ## Connection failed: TransportLost: failed to complete connection  
2019-01-05T00:01:17.806: autobahn.asyncio.component.Component INFO  ## connecting once using transport type "websocket" over endpoint "tcp" 
2019-01-05T00:01:19.173: main            INFO  ## Entering on_join     
2019-01-05T00:01:19.174: main            INFO  ## Subscribed     
2019-01-05T00:01:19.393: main            INFO  ## Sleeping     

Note the expected 30 second between "Sleeping" and session leaving. And then connection is re-established and my on_join() is re-entered as the last log line shows - all good for the first time, however...

However then the program misbehaves - the asyncio.sleep(60) line is somehow  interrupted after ~20-30 seconds and program goes on to the next line to publish message where it hits autobahn.wamp.exception.TransportLost:

2019-01-05T00:01:19.173: main            INFO  ## Entering on_join         
2019-01-05T00:01:19.174: main            INFO  ## Subscribed         
2019-01-05T00:01:19.393: main            INFO  ## Sleeping         
2019-01-05T00:01:46.329: main            INFO  ## Publishing                 # Why does it break from sleep? And why after just 27 seconds?
2019-01-05T00:01:46.331: autobahn.wamp.protocol.BaseSession ERROR ## While notifying 'join': Traceback (most recent call last):
  File "/home/.../lib/python3.7/site-packages/txaio/aio.py", line 514, in done
    res = f.result()
  File "/tmp/repro.py", line 33, in on_join
    await session.publish("news", "Fake news", options=options)
  File "/home/.../lib/python3.7/site-packages/autobahn/wamp/protocol.py", line 1278, in publish
    raise exception.TransportLost()
autobahn.wamp.exception.TransportLost
2019-01-05T00:01:46.332: autobahn.wamp.protocol.BaseSession ERROR ## While firing onJoin: Traceback (most recent call last):
  File "/home/.../lib/python3.7/site-packages/txaio/aio.py", line 514, in done
    res = f.result()
  File "/home/.../lib/python3.7/site-packages/txaio/aio.py", line 514, in done
    res = f.result()
  File "/home/.../lib/python3.7/site-packages/txaio/aio.py", line 514, in done
    res = f.result()
  File "/tmp/repro.py", line 33, in on_join
    await session.publish("news", "Fake news", options=options)
  File "/home/.../lib/python3.7/site-packages/autobahn/wamp/protocol.py", line 1278, in publish
    raise exception.TransportLost()
autobahn.wamp.exception.TransportLost
2019-01-05T00:01:46.333: autobahn.wamp.protocol.BaseSession ERROR ## While notifying 'ready': Traceback (most recent call last):
  File "/home/.../lib/python3.7/site-packages/txaio/aio.py", line 514, in done
    res = f.result()
  File "/home/.../lib/python3.7/site-packages/txaio/aio.py", line 514, in done
    res = f.result()
  File "/home/.../lib/python3.7/site-packages/txaio/aio.py", line 514, in done
    res = f.result()
  [Previous line repeated 2 more times]
  File "/tmp/repro.py", line 33, in on_join
    await session.publish("news", "Fake news", options=options)
  File "/home/.../lib/python3.7/site-packages/autobahn/wamp/protocol.py", line 1278, in publish
    raise exception.TransportLost()
autobahn.wamp.exception.TransportLost
2019-01-05T00:01:48.665: autobahn.asyncio.component.Component INFO  ## session leaving 'wamp.close.transport_lost'  
2019-01-05T00:01:48.666: autobahn.asyncio.component.Component ERROR ## Connection failed: TransportLost: failed to complete connection  
2019-01-05T00:01:51.736: autobahn.asyncio.component.Component INFO  ## connecting once using transport type "websocket" over endpoint "tcp" 
2019-01-05T00:01:52.877: main            INFO  ## Entering on_join        5
2019-01-05T00:01:52.878: main            INFO  ## Subscribed        
2019-01-05T00:01:53.094: main            INFO  ## Sleeping     
2019-01-05T00:02:19.402: main            INFO  ## Publishing        # Another break from sleep. Same tracebacks.

And this how it continues to run. Sometimes the errors order is a bit different:

2019-01-05T00:07:15.221: main            INFO  ## Entering on_join    
2019-01-05T00:07:15.222: main            INFO  ## Subscribed  
2019-01-05T00:07:15.528: main            INFO  ## Sleeping  
2019-01-05T00:07:44.712: autobahn.asyncio.component.Component INFO  ## session leaving 'wamp.close.transport_lost' 
2019-01-05T00:07:44.714: autobahn.asyncio.component.Component ERROR ## Connection failed: TransportLost: failed to complete connection 
2019-01-05T00:08:15.561: main            INFO  ## Publishing         
2019-01-05T00:08:15.563: autobahn.wamp.protocol.BaseSession ERROR ## While notifying 'join': Traceback (most recent call last):
... (Same tracebacks as above)

tobias.o...@gmail.com

neprečítané,
11. 1. 2019, 3:36:2611. 1. 2019
komu: Autobahn
yeah, network intermediaries (like load balancers, NATs, reverse proxies, ..) will often time out idle websocket connections. idle = not sending/receiving anything (not a single octet) for some amount of time.

try adding these settings to your crossbar node: https://crossbar.io/docs/WebSocket-Options/#production-settings

(the websocket "auto_ping_*" parameters)

that config will make crossbar do regular heartbeating (every 10s), and that should prevent the timeout of the connection

---

note: more details come into consideration when on mobile networks. eg above settings will prevent timeout there too, but the mobile endpoint might go into sleep states still, which will introduce a delay eg receiving pubsub events. if you want to keep the mobile endpoint in "snappy state", you might need to increase the
auto_ping_size

in my experiments, I need a couple of kb/s to keep the endpoint in snappy state. which of course has the tradeoff of energy consumption.

anyways: this is now really advanced stuff .. if you care about this (too), check out:

* https://lists.w3.org/Archives/Public/ietf-http-wg/2012JanMar/1083.html

also: crossbar has a quite advanced websocket ping/pong mechanism


reason: it is non-trivial to do that on 100k's of connections. and mozilla being a user of autobahn websocket for notifications, we had to solve that;) they run north 100m connections. anyways, this is advanced stuff, hope it doesn't confuse too much

---

note: this is server (crossbar) initiated ping/pong, should be enough for your case. however, for quick detection of loss of connection (TCP over WAN), more is needed: client initiated ping pong as well. we have that too (in autobahn python), but the status of the feature depends:

autobahn python runs on any combination of: Twisted vs asyncio and "old API" (ApplicationRunner) vs "new API" (Component based)

App Runner on Twisted has client initiated ping/pong with currently hard-coded (but practical/production) parameter settings: https://github.com/crossbario/autobahn-python/blob/master/autobahn/twisted/wamp.py#L282

the other 3 variants: lagging behind (I think). I mostly deal with the twisted/apprunner still these days (as that is what crossbar uses internally .. historically that variant was the first)

I figure you want asyncio + component: which is definitely "the way to go" forward for a WAMP only Py3 app component (no need for Twisted when you dont need all the bells and whistles that Twisted has vs asyncio)

Zaar Hai

neprečítané,
11. 1. 2019, 4:35:1711. 1. 2019
komu: Autobahn
Hi Tobias,

Thanks for all the details.

I already had production-settings in place, with ping pong enabled as you advise. The thing with GCP Load Balancer is that they kill websockets connections regularly regardless of activity (yes, sounds weird, but it is true). Every 30s by default - it's configurable but for now I prefer to keep it short to ease debugging.

The problem that I experience is a bit different (IMHO) - autobahn detects broken connection on the background, but only for the first time. Something breaks after reconnect - on 2nd/3rd/etc. iteration asyncio.sleep(60) just miraculously finishes early after just 20-30 seconds (the lines are highlighted in original post).
I happens very consistently - first one works fine, but other don't.
Do you see a problem here?

Again, I'll sum it up in human words:

What happens on first connection (good):
  1. on_join is paused on asyncio.sleep(60)
  2. network issue detected
  3. on_join gets rightfully cancelled
  4. network connection re-established
  5. on_join called to start over
After the above reconnect, the following happens (strange)
  1. on_join is paused on asyncio.sleep(60)
  2. asyncio.sleep() finishes prematurely 
  3. session.publish is called (next line) and raises since connection is dead
  4. connection is reestablished
  5. on_join is rerun
While it works, logs noise aside, I think there is a bug here with on_join cancellation. Without looking at the code, I can speculate that you cancel it properly first time, but not the 2nd/3rd/etc. time - somehow causing it to continue. Does it make sense to you?


Client side pings are surely interesting and looking at your link, I see that asyncio/wamp has similar hard-coded transport settings: https://github.com/crossbario/autobahn-python/blob/master/autobahn/asyncio/wamp.py#L223. Is there a quick way to check whether those are enabled indeed (on client)?

Tobias Oberstein

neprečítané,
11. 1. 2019, 5:03:2011. 1. 2019
komu: autob...@googlegroups.com, Zaar Hai

ok, just tested:

crossbar with ping/pong settings: https://gist.github.com/oberstet/da507a03c26743f7738ebeb59ece5f7d

autobahn (twisted/apprunner): https://gist.github.com/oberstet/396325ebfbd4e08d1542dd278c5fb679

I have stopped and restarted crossbar twice while leaving the client run, and it automatically reconnects correctly (and also starts the ping/png again)

in the client log, you'll notice the

2019-01-11T10:48:23+0100 WebSocketProtocol.onPing(payload=<4 bytes>)

lines. this is the client responding to crossbar ping requests

the lines

2019-01-11T10:48:24+0100 Auto ping/pong: sending ping auto-ping/pong
2019-01-11T10:48:24+0100 Expecting ping in 5.0 seconds for auto-ping/pong
2019-01-11T10:48:24+0100 Auto ping/pong: received pending pong for auto-ping/pong
2019-01-11T10:48:24+0100 WebSocketProtocol.onPong(payload=<4 bytes>)

are the client initiating the ping, and receiving the crossbar response.

---

above is using autobahn/twisted/apprunner: pip install autobahn[twisted]

(yeah, I know you want asyncio, but this is to track the issues to their root cause first - I positively know above works;)

---

could you maybe try that in your setup and dump logs here?

also: are you using TLS (secure websocket), and if so terminating TLS at GCP or at crossbar?

could you test the latter? wss and tls termination at crossbar?

reason: otherwise the ping/pong messages might not even be forwarded by GCP!! a websocket intermediary is allowed to do that (only for websocket ping/pong messages, not regular websocket messages). it can completely drop ping/pongs - or only forward some - or create its own

this could be one issue.

when you terminate TLS at crossbar, and consequently use GCP only as a layer 4 LB, GCP simply cannot look inside and should be sad (stop its weird interference)

if you terminate TLS at GCP, and use GCP at layer 7, the next thing I would try is doing a manual ping/pong using regular websocket messages. gcp is NOT allowed to drop those;) that would suck nevertheless, but we could think about what we can to mitigate.

in any case: if you don't have a hard reason for L7 LB, go for L4 - that is what we do and recommend usually (also for "security reasons": you are in control of your TLS keys/certs, and your traffic is kept confidental right into your VMs).

anyways: above tests will reveal more information .. pls let me know the results, and I can comment again


Am 11.01.19 um 10:35 schrieb Zaar Hai:
--
You received this message because you are subscribed to the Google Groups "Autobahn" group.
To unsubscribe from this group and stop receiving emails from it, send an email to autobahnws+...@googlegroups.com.
To post to this group, send email to autob...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/autobahnws/88db33a0-d465-45c4-bf6a-0138b1c09d30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tobias Oberstein

neprečítané,
11. 1. 2019, 5:08:2611. 1. 2019
komu: autob...@googlegroups.com, Zaar Hai

one more hint: to enable an insane amount of logging on the crossbar side:

./crossbarfx edge start --loglevel trace

here you see the incoming client initiated pings, replies, and also the crossbar initiated pings, and receive replies:

2019-01-11T11:06:15+0100 [Router      19176 crossbar.router.service.RouterServiceAgent] RouterServiceAgent.publish("wamp.session.on_join") -> "wamp.session.on_join" on "realm1"
2019-01-11T11:06:18+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: sending ping auto-ping/pong
2019-01-11T11:06:18+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Expecting ping in 2.0 seconds for auto-ping/pong
2019-01-11T11:06:18+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: received pending pong for auto-ping/pong
2019-01-11T11:06:18+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] WebSocketProtocol.onPong(payload=<4 bytes>)
2019-01-11T11:06:21+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: sending ping auto-ping/pong
2019-01-11T11:06:21+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Expecting ping in 2.0 seconds for auto-ping/pong
2019-01-11T11:06:21+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: received pending pong for auto-ping/pong
2019-01-11T11:06:21+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] WebSocketProtocol.onPong(payload=<4 bytes>)
2019-01-11T11:06:24+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: sending ping auto-ping/pong
2019-01-11T11:06:24+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Expecting ping in 2.0 seconds for auto-ping/pong
2019-01-11T11:06:24+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: received pending pong for auto-ping/pong
2019-01-11T11:06:24+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] WebSocketProtocol.onPong(payload=<4 bytes>)
2019-01-11T11:06:25+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] WebSocketProtocol.onPing(payload=<4 bytes>)
2019-01-11T11:06:27+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: sending ping auto-ping/pong
2019-01-11T11:06:27+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Expecting ping in 2.0 seconds for auto-ping/pong
2019-01-11T11:06:27+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: received pending pong for auto-ping/pong
2019-01-11T11:06:27+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] WebSocketProtocol.onPong(payload=<4 bytes>)
2019-01-11T11:06:30+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: sending ping auto-ping/pong
2019-01-11T11:06:30+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Expecting ping in 2.0 seconds for auto-ping/pong
2019-01-11T11:06:30+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] Auto ping/pong: received pending pong for auto-ping/pong
2019-01-11T11:06:30+0100 [Router      19176 crossbar.router.protocol.WampWebSocketServerProtocol] WebSocketProtocol.onPong(payload=<4 bytes>)



Am 11.01.19 um 11:03 schrieb Tobias Oberstein:

Zaar Hai

neprečítané,
14. 1. 2019, 8:22:3014. 1. 2019
komu: Autobahn
Hi Tobias,

Sorry for the delay - our last discussion ran deep into the night here in Down Under.

I did the tests as you suggested (L7 LB). Outcomes:
  • Both asyncio+component and twisted+component behave (wrongly) the same - as I originally described
  • GCP LB forwards ping/pong correctly. I see it both on client and crossbar server
All the code and debug-level logs are here: https://gist.github.com/haizaar/05bf4fa1bc3948668fe0039e01fcd846

Here is relevant excerpt from twisted version:

2019-01-14T23:52:35+1100 openHandshakeTimeoutCall.cancel
2019-01-14T23:52:36+1100 Entering on_join
2019-01-14T23:52:36+1100 Subscribed
2019-01-14T23:52:36+1100 Sleeping              <- Didn't continue beyond sleeping - GOOD!
2019-01-14T23:52:45+1100 WebSocketProtocol.onPing(payload=<4 bytes>)
2019-01-14T23:52:55+1100 WebSocketProtocol.onPing(payload=<4 bytes>)
2019-01-14T23:53:05+1100 WebSocketProtocol.onPing(payload=<4 bytes>)
2019-01-14T23:53:05+1100 Connection to/from tcp4:ip.is.hidden:443 was lost in a non-clean fashion: Connection to the other side was lost in a non-clean fashion: Connection lost.
2019-01-14T23:53:05+1100 _connectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
2019-01-14T23:53:05+1100 WAMP-over-WebSocket transport lost: wasClean=False, code=1006, reason="connection was closed uncleanly (peer dropped the TCP connection without previous WebSocket closing handshake)"
2019-01-14T23:53:05+1100 session leaving 'wamp.close.transport_lost'
2019-01-14T23:53:05+1100 component failed: ApplicationError: ApplicationError(error=<wamp.close.transport_lost>, args=[], kwargs={}, enc_algo=None)
2019-01-14T23:53:05+1100 Traceback (most recent call last):
Failure: autobahn.wamp.exception.ApplicationError: ApplicationError(error=<wamp.close.transport_lost>, args=[], kwargs={}, enc_algo=None)

2019-01-14T23:53:05+1100 wamp.close.transport_lost: 
2019-01-14T23:53:05+1100 session on_disconnect: was_clean=False
2019-01-14T23:53:05+1100 Stopping factory <autobahn.twisted.websocket.WampWebSocketClientFactory object at 0x7f77224e09b0>
2019-01-14T23:53:05+1100 Entering re-connect loop
2019-01-14T23:53:05+1100 trying transport 0 using connect delay 2.3187428648299617
2019-01-14T23:53:07+1100 connecting once using transport type "websocket" over endpoint "tcp"
2019-01-14T23:53:07+1100 Starting factory <autobahn.twisted.websocket.WampWebSocketClientFactory object at 0x7f77224f20b8>
...
2019-01-14T23:53:08+1100 openHandshakeTimeoutCall.cancel
2019-01-14T23:53:08+1100 Entering on_join
2019-01-14T23:53:08+1100 Subscribed
2019-01-14T23:53:09+1100 Sleeping
2019-01-14T23:53:18+1100 WebSocketProtocol.onPing(payload=<4 bytes>)
2019-01-14T23:53:28+1100 WebSocketProtocol.onPing(payload=<4 bytes>)
2019-01-14T23:53:36+1100 Publishing            <- Continued beyond sleeping
2019-01-14T23:53:36+1100 While notifying 'join': Traceback (most recent call last):
  File "/home/.../lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)

Zaar Hai

neprečítané,
14. 1. 2019, 8:26:3814. 1. 2019
komu: Autobahn
P.S. Can't easily test with L4 LB unfortunately.
Today's K8s tooling made it all too convenient - K8s provisions GCP LB, then kube-lego takes care to provision LetsEncrypt certificates and updates the LB with them regularly. Couple of lines of k8s manifest and you have your HTTPS load balancer.
Btw, do you know of any attempts to integrate Crossbar with LetsEncrypt (e.g. through their certbot)?

Thank you,
Zaar

Zaar Hai

neprečítané,
14. 1. 2019, 18:38:5714. 1. 2019
komu: Autobahn


On Tuesday, 15 January 2019 00:22:30 UTC+11, Zaar Hai wrote:
Odpovedať všetkým
Odpovedať autorovi
Poslať ďalej
0 nových správ