In a sharded environment, the total database storage space is spread across the shards. So if you need to store 100TB of data, you need to create enough shards to hold the data plus the oplog and indexes. Since you will be sharding your data, you also need to create replica sets so that if a shard failed, you do not lose access to a chunk of the data. The recommended minimum is 3 replica servers per set in production, although you
can get by with just two and a third member for voting only. You will need (# shard servers * # of replicas per shard) servers. Depending on performance and data locality needs these can be VMs or in a cloud hosted environment if filling racks of servers locally is not a good option. You also have to make sure the servers have enough memory to hold your indexes without having to constantly swap. How many shards you need will depend on how much you want to distribute the write load and the maximum storage space you can get on a single server. If you will be running lots of map/reduce jobs that write the results back into collections, or you will be adding data often and want to spread out the write load more, then more shards can help with that.
For example, for $15k you can get an R720 from Dell with 12x3TB drives in RAID10, 96GB ram and a 6 core processor. The box will have (unformatted) 18TB available and the oplog will take 5% of the formatted free space (~1.6TB). If you can fit 10TB of data on it (total guess, depends on number and size of your indexes), you need 10 of them to fit 100TB. So if you wanted to use 3 replicas in a set, then you would need 10 shards * 3 replicas or 30 servers to fit that much data. Another alternative would be to use less 1U servers with more ram and then attach external disk arrays to allow for more disk space per shard than a single server can hold.
If that solution does not have the write performance you need, you can
put smaller (and faster) disks per server and add more shards, say 4TB
data per server and 25 shards. If your data set grows later, you can always add more shards to increase the total disk available. If you need more read performance you
can add more replicas to each set to spread out the read load. If you don't want to spend the $500K - $1M on hardware and manage multiple racks of physical servers then a hosted or cloud solution will be better way to go for you. You can really scale as far as you want, the cost will be the limiting factor.
You can look at the docs about deploying to EC2:
http://www.mongodb.org/display/DOCS/Amazon+EC2+QuickstartYou might also try setting up a subset of your data in EC2 to get an idea of sizing and performance before you drop a ton of money on building out a large cluster to fit the whole set.
On Monday, January 21, 2013 11:40:39 PM UTC-8, Ayberk Cansever wrote:
Hi guys,
In a big company, we have ~100TB of data to be analyzed and we want to use Mongo DB. We should analyze the data with aggregetion. The data is so big that we are confused of the architectural design. What kind of an architecture do you suggest. How many servers? Disk size per server? Do you have any idea?
Ayberk