question re: Ping and replica set failover

63 views
Skip to first unread message

Davis Ford

unread,
Dec 8, 2016, 12:04:06 PM12/8/16
to mgo-users
Gustavo,

With a replica set, when we have failover, it is my understanding that a Session.Refresh() will dynamically fix it.  I'd like to try to avoid having to check for EOF everywhere in our code base that might do a query.

My question is simple: if I put session.Ping() in an endless goroutine at a reasonable interval, will that allow me to detect the EOF error?  For example,

session, err := mgo.Dial(url)
if err != nil {
 
return nil, err
}


session
.SetMode(mgo.String, true)


go func
() {
 
for {
     time
.Sleep(interval)
     copiedSes
:= session.Copy()
     defer copiedSes
.Close()
     
if err := copiedSes.Ping(); err != nil {
        session
.Refresh()
     
}
 
}
}()


Would something like this work to allow me to detect a primary failover and continue with the current session?

My concern is that the command "ping" does not do a write, so it may succeed anyway, whereas a query that attempts to do a write will fail with an EOF, thereby subverting the whole process.  When a failover happens, if the primary isn't elected yet, reads can still happen from the secondary or primary, but writes will return an error until the system has stabilized.

Gustavo Niemeyer

unread,
Dec 8, 2016, 12:10:27 PM12/8/16
to mgo-...@googlegroups.com
Hi Davis,

That's usually a bad idea. The proper way to handle errors is to take care of the ones you do want to handle locally, because you know it's safe to do so in that exact context, and then bail off when you take something that is unknown. It's not just EOF.. it might be a timeout, or lack of file descriptors, or a bad disk, or so many other things that go wrong in the real world. So, when you do see an error, usually the best thing to do is to fallback to a safe place, and then retry the operation as a whole. This will also handle better the case you describe, with primary failovers. These are harsh scenarios that can result in dataa loss and rollbacks, so it's best to not second guess what the state of the database is when you're doing a sequence of reads/writes that make assumptions.

This advice is also not mgo specific, actually. Returning to a safe place on unknown errors, with proper cleanups and termination, is generally a good idea no matter what.




--
You received this message because you are subscribed to the Google Groups "mgo-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mgo-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Davis Ford

unread,
Dec 8, 2016, 12:16:29 PM12/8/16
to mgo-users
I get that, but I'm trying to handle one specific failure that can occur, which the system is supposed to recover from, which is a replica set failover.

Currently, I have had a failover and the it leaves the mgo session in an unusable state, and I have to restart the whole server.

So, to my point, what's the recommended way to deal with this, b/c it isn't working the way I have it.

To unsubscribe from this group and stop receiving emails from it, send an email to mgo-users+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Davis Ford

unread,
Dec 8, 2016, 12:25:29 PM12/8/16
to mgo-users
Let me try to be more specific.

So, we have bootstrap code somewhere that creates the global session:

session, err := mgo.Dial(url)

and we hold on to that.

When requests come in from the network, they'll Clone or Copy the session, and execute.  What I saw this morning, when the replica set reconfigured itself b/c I executed the command:

rs.remove('url-of-a-secondary')

This caused all mgo sessions to start returning the error EOF.  This should be recoverable, but it was not.  

I could try do a clonedSession.Refresh()  or copiedSession.Refresh() right there in the local context, but I have no idea if that would fix it or if that's the recommended way.  Instead, is the recommended way to do session.Refresh() on the parent/root session?

Also, I get there can be several reasons for a failure, intermittent network issues, too many file descriptors, etc.  But this issue that happened this morning *should* be recoverable, so I want to do whatever I can to make it robust.  

I don't like checking for Error types in golang anymore than the next guy, but when some errors are recoverable, and you can figure that out and take some action, then it makes sense to do so.  

So, would it make sense to check only for EOF as the main clue that the replica set is reconfiguring itself / or has reconfigured, and try to refresh the session at that point?  

Should it be refreshed on the global root session, or would it work with any copied/cloned session?

The idea behind using Ping is so I don't have to go into hundreds and hundreds of lines of code to add an error check on a non-nil error to determine if a session refresh is necessary, but perhaps I can do it in a single spot.

Does that make more sense as a question?

Gustavo Niemeyer

unread,
Dec 9, 2016, 8:48:53 AM12/9/16
to mgo-...@googlegroups.com

The EOF is recoverable. The Refresh needs to be done on the session that has failed, if you intend to keep using it. Or, per the other thread, Close and use a fresh new session copied from a master one. That's enforced precisely because in most case we want to error the operation and start over, instead of pretending the database is in the status we expect.

The driver has no way to tell if an EOF is a replica set recovering itself or not, because the server won't forewarn about it. It will simply shut the connection.

For that reason, we also cannot automatically recover from it in the sense of simply ignoring the error, and I'd recommend not trying to ignore that error on your end either. Requested operations may have gone through or not when you get an EOF, and the database status may be pretty different state (another server, data rollback, etc).



To unsubscribe from this group and stop receiving emails from it, send an email to mgo-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages