Trying to interface to a legacy TCP server where the protocol has no "ack" or "ping". I have two goroutines, a reader and a writer. The writer waits on a []byte channel and writes them to the TCP conn, the reader Read()s from the TCP conn and processes the received data.
It works great until the connection goes down (unplug the cable, reboot a router etc). I have enabled TCP_KEEPALIVE which does in fact detect the disconnect on the Read() side, although it takes a while (I can tune this using the tcp_keepalive_time, tcp_keepalive_intvl and tcp_keepalive_probes sysctl settings (or SOL_TCP socket options).
The client's Write() goroutine is the part I'm having trouble with. I found that once the connection goes down, if I try to Write() a []byte to the TCPConn, it wasn't detecting the error (presumably the kernel is buffering the packets). I tried calling SetWriteDeadline(30s) and that solved the issue on the Mac (although it takes 2x the interval to detect it). However, on Linux, I'm just not detecting any errors on the Write() side, even with the SetWriteDeadline().
I'm assuming it's because Go is able to write the data to the socket successfully and the kernel is buffering the packet, so with a bit of Googling, I found some code on golang-nuts to invoke a custom setsockopt() on a TCPConn and I'm attempting to set the TCP_USER_TIMEOUT socket option (which has been around for a while):
Here's the code I use to set the option:
// #include <linux/tcp.h>
import "C"
...
func net_set_timeout(conn *net.TCPConn, timeout time.Duration) error {
// We need to use the File object to get at the fd
f, err := conn.File(); if err != nil {
return err
}
defer f.Close()
// Convert to an integer/milliseconds
secs := int(timeout.Nanoseconds() / 1e6)
fd := int(f.Fd())
return os.NewSyscallError("setsockopt", syscall.SetsockoptInt(fd, syscall.SOL_TCP, C.TCP_USER_TIMEOUT, secs))
}
Still doesn't detect the error. I *think* that it's probably because the TPCConn.Write() has already completed and now my writer goroutine is back waiting on the output channel. Again, eventually, the TCP_KEEPALIVE timer fires and reader side generates an error.
I did notice that SO_SNDTIMEO isn't exposed as an option, although I suspect it'll have the same problem as the TCP_USER_TIMEOUT option. The other thought is to have another goroutine that polls the socket with getsockopt(SO_ERROR) and then calls Close() on the TCPConn (which in turn brings down the reader & writer goroutines).
I really don't want to rely on the SO_KEEPALIVE option for this (and mess with the sysctl values for this), that's there to keep the connection alive and fresh in router tables. I really want to "fail fast" when there's a connectivity issue when I try to Write(), at this point I restart the connection (and associated goroutines) and carry on.
What's the correct "Go" way to handle this?
Thanks!
-W.