Thanks, Bernie, that's a good suggestion, but that's not really
reducing the number of servers that we're running, it's just combining
two of them onto single HW.
Since doing that would take just as much in terms of RAM/CPU as doing
them on separate HW, it seems functionally equivalent to just running
fewer shards.
I'm not a mongo-dev, but it would seem like that doesn't really solve
the problem, it just masks it.
To take that answer to the next step, I could reduce server count by
running multiple shards on a single disk as well. ;)
We're still in RAID 0+1 land.
The RAID5 system would store a parity block, that allows the other
blocks to be reconstructed, using the remaining data, and XOR.
While I'm sure you know how RAID works, it might be helpful to quickly
review the exact mechanism-
http://www.scottklarr.com/topic/23/how-raid-5-really-works/
So in this system, with 5 shards, we'd use six servers.
Each shard would store 1/5th of the data, plus a parity block.
When any one of the systems went down, we could reconstruct it, by
using the data remaining on the other 5.
You can increase redundancy, by increasing the number of parity
blocks.
RAID6 uses the same system, but two parity blocks, to increase the
overall reliability of the system, and ensure it can handle two
failures at the same time.
If Mongo were to support such a system, companies could deploy
dramatically fewer servers, while maintaining a very high level of
reliability and failover.