Question on EOF errors and replica set

137 views
Skip to first unread message

Davis Ford

unread,
Dec 8, 2016, 8:06:36 PM12/8/16
to mgo-users
Hi Gustavo,

I'm going to take another stab at this.  My end goal is to make sure my system has as little downtime as possible, and that my database driver can and will recover quickly when a primary steps down or is detected as unreachable.

I've read the very useful and informative threads here: 


All good info.  I also built a local test with a 3-node mongo 2.6.5 replica set (thanks Docker, we are on older 2.6.5 and trying to upgrade), so I could try a few things out https://gist.github.com/davisford/db624ebf7e07060a0ad000183b652e8c 

I spun up a 3-node replica set, and ran that binary which just loops a couple of goroutines and continues to try to ping (in one goroutine) and write (in another goroutine).  Then I'll do an rs.stepDown() or kill the primary and observe the behavior.

What I observed is that in this simple test environment, I cannot replicate what I see in production (that of the EOF error).  

When I kill or step down the primary, I see the calls to both .Ping() and .Insert() will block until the new primary is elected, and then it continues.  This is great, but it's not what I see in production, unfortunately.

In production, I have a very busy database, and when the replica set is re-configured, the session state becomes un-usable, and simply returns EOF.  A server restart fixes it immediately.

I even tried to kill two servers in the local 3 node setup to see if it would eventually timeout, and eventually the driver just returns the error:

error writing to collection: &errors.errorString{s:"no reachable servers"}
error pinging server: &errors.errorString{s:"no reachable servers"}

...which makes sense, since even tho there is one more available server it never had the chance to be elected to primary.  In short, I can't seem to reproduce what I see in production with the EOF errors.

In the production code, we do the recommended pattern:

a) there's a global session created from Dial, set to Strong mode
b) on web requests, we copy the session, defer close, and then execute the query

Based on my understanding of how it works -- using this pattern of copying the session, if a replica set reconfiguration happened or a primary gets disconnected, then the driver will manage the new socket connection, and when it is back up, a session.Copy() ought to recover by itself and be able to resolve itself.  But this is not what I see in practice.  In practice, I can see several minutes of EOF errors, and a hard app reset fixes the issue immediately.  

Please note this isn't something I can test in practice without making a lot of people upset because of a service interruption.

What I'm trying to accomplish is a way to recover from the EOF at runtime without requiring a hard app reset.  I was hoping my local 3-node test would allow me to reproduce the environment and failure scenarios and come up with a robust way to handle it, but I'm not able to reproduce the issue.

Any additional advice / suggestions you could offer in this respect?  Thanks in advance!
Reply all
Reply to author
Forward
0 new messages