@Ronaldo: I just noticed the extension to your e-mail address, it's a
small world. I remember your company in the "YouPlusPlus" days.
I think I have a better idea of what you're looking to do :)
> Agree, but I keep thinking if something like:
> - disable slave_ok
> - recover the files on a slave for example, so not a huge deal that is
affecting that member.
> - make that the primary
> - make the rest of the replica members re-sync that one database only.
> - enable slave_ok
I actually like this plan, it was my original thought. The only
difficulty I could find was step #2. I would just be worried that
"recovering" the files could cause that machine to simply be hammered.
If you have three shards, at least you get to spread the writes out.
But if you're trying to write in lots of data manually, you're
probably going to max out the disk I/O on those three boxes.
I'm mostly just worried about co-ordination in that case. You
definitely have a lot of potential spots where something could go
sideways.
If you're looking for "roll-back" functionality, I'm seeing a couple
of other options. (not official 10gen responses, just my best guesses)
Option #1: --slavedelay
---------
If you're worried about having a roll-back window, you can simply put
a --slavedelay on one of the servers.
Here's an article detailing exactly that (Kenny Gorman from
Shutterfly):
http://www.kennygorman.com/wordpress/?p=699
In some ways, this reduces capacity (if you're using this to write),
but it also provides you an emergency buffer.
This solution is nice because you don't actually lose the data on all
nodes. In theory this means that you don't have to wait for a giant
copy operation.
In theory, you can also minimize data loss by replaying the opLog up
to the point of failure.
Option #2: drop / re-import for small sets
----------
Obviously, this depends on the size of the DB and your available
throughput.
But if you have a small DB, the lowest down-time may simply be a full
restore. If the DB is only a couple of Gigs, this is probably much
quicker than playing around trying to get the data back in. It's low
complexity and it works within the current system.
This is probably best for the "non-transactional" data. Stuff that
gets entered once and then changed rarely.
Config DB:
----------
The only big caveat I have with any of these methods is fixing the
config DBs. If you use option #1 or your idea, you'll need to ensure
that the config DB is somehow still correct.
Right now, I don't know of an easy, automated way to do this. I
suspect that it involves some manual juggling. (selective restore from
backup)