Missed on-disconnect callbacks

Jason Beyers

unread,

Mar 11, 2014, 2:40:10 PM3/11/14

to sna...@googlegroups.com

Hi,

I've been building a test automation framework upon snakemq, and have been using it for production testing (exciting stuff). It has worked very well, but I'm encountering an issue with on-disconnect callbacks. If I leave a connection open for a long time, at least an hour, I notice that neither the server (listener) nor the client run a registered on-disconnect callback when one side exits. Disconnects are recognized just fine if I don't let much time pass after the initial connection & last client loop.

Perhaps the issue lies with how I'm running my client:

My server (listener) continuously runs the loop method of its link object, and does this forever. Meanwhile, my client runs its loop method initially to establish a connection, and then holds off running the loop until much later (multiple days). After this goes on for a while, the server seems to lose its ability to catch client disconnects when I forcibly stop the client myself. In any case, my client does run the cleanup() method on its link object when it exits.

Is my connection timing out, because I'm not running the link loop method often enough on the client?

I see three different parameters that I probably need to use & understand:

snakemq.messaging.Messaging.keepalive_interval

snakemq.messaging.Messaging.keepalive_wait

snakemq.link.Link.add_connector((host, port), reconnect_interval)

I've tried a few different combinations of the above 3 parameters and I'm still experiencing the issue - perhaps there is a lot of room for optimization here? I'm wondering if there is something else I need to try before I sink a lot more time tweaking those parameters.

Thanks,

-Jason

David Široký

unread,

Mar 11, 2014, 4:07:09 PM3/11/14

to sna...@googlegroups.com

Hi Jason!

The keepalive feature was created for situations when the opposite host operating system is unable to send a proper TCP shutdown sequence, e.g. somebody breaks the ethernet cable or the host is suddenly turned off. At least MS Windows has a very large time window of its own TCP keepalive ping/pong (in range of hours). I needed almost immediate response to such situation and the keepalive in snakemq solves that.

Set e.g. keepalive_interval=3 and keepalive_wait=1 and it will send every 3 seconds a ping message. Then it waits max. 1 second for the response. If the waiting times out then the connection is declared closed.

If you run the client loop in very long intervals then the keepalive will not work because there will be no timely response.

What operating systems do you have on both sides? I'll try to reproduce it.

David

Jason Beyers

unread,

Mar 13, 2014, 4:29:20 PM3/13/14

to sna...@googlegroups.com

Hey David,

Thanks for the response! I've been running both sides on linux, Ubuntu 12.04LTS 64-bit. It's also worth noting that I'm seeing the issue in Amazon EC2 cloud service, where intermittent connection problems are the norm. I've noticed two types of intermittent disconnects in this environment: those that both sides see, and those that only the client notices (the one we're discussing here).

The keepalive_interval/wait stuff makes good sense to me now, but could you explain the reconnect_interval parameter for add_connector? Does this parameter cause the client to attempt a reconnect every X seconds if the connection had closed (either normally or a timed-out ping/pong), assuming the client link's loop method is being called regularly?

Thanks,

-Jason

David Široký

unread,

Mar 14, 2014, 7:23:13 AM3/14/14

to sna...@googlegroups.com

I left few connections opened for 2 hours and then killed processes on one side. The OS sent correct TCP FIN/ACK sequence and opposite side registered the connection closing. My only guess is that EC2 is unfriendly to idle and long living TCP connections. Keepalive should help.

I suggest to use Wireshark to catch a raw TCP communication especially when the connection is expected to be severed. Then it might be clearer what is happening.

reconnect_interval works exactly as you wrote.

David

Reply all

Reply to author

Forward