Hi all,
I just wanted to get a little bit of clarity around removing a replica set primary node given some behavior I saw when I tried to do so recently. Here are the steps I took on mongo 1.8.x:
1. Request primary step down: db.adminCommand( { replSetStepDown : 60 } )
2. Wait for secondary to assume primary role
3. Remove old primary node from replica set: db.remove("hostname:27018")
Now at this point, I started getting "WriteBackListener exception : socket exception" exceptions in my mongos log. I assume this is from attempting to connect to the old host, which is now down. Eventually, those stopped. So I fired everything back up and it was going fine until I tried to do a slave_ok read. Attempting to do that read, whether from the ruby driver or direct from a mongo shell, I would receive a "not master or secondary, can't read" error. I tried flushing the router config, but that didn't appear to help. At this point, I reconfigured all clients *not* to use slave_ok and point to the primary and all operations worked perfectly fine.
After coming back to the problem several hours later, I was suddenly able to do slave_ok queries without a problem.
Can somebody explain what's happening here? We process real-time data and being unable to use slave_ok queries for some indeterminate period of time after removing a replica set node is a little unsettling.
Thanks,
Damon