Re: 100TB of data with Mongo DB, how can it be possibe???

Ian Beaver

unread,

Jan 23, 2013, 2:08:58 PM1/23/13

to mongod...@googlegroups.com

In a sharded environment, the total database storage space is spread across the shards. So if you need to store 100TB of data, you need to create enough shards to hold the data plus the oplog and indexes. Since you will be sharding your data, you also need to create replica sets so that if a shard failed, you do not lose access to a chunk of the data. The recommended minimum is 3 replica servers per set in production, although you can get by with just two and a third member for voting only. You will need (# shard servers * # of replicas per shard) servers. Depending on performance and data locality needs these can be VMs or in a cloud hosted environment if filling racks of servers locally is not a good option. You also have to make sure the servers have enough memory to hold your indexes without having to constantly swap. How many shards you need will depend on how much you want to distribute the write load and the maximum storage space you can get on a single server. If you will be running lots of map/reduce jobs that write the results back into collections, or you will be adding data often and want to spread out the write load more, then more shards can help with that.

For example, for $15k you can get an R720 from Dell with 12x3TB drives in RAID10, 96GB ram and a 6 core processor. The box will have (unformatted) 18TB available and the oplog will take 5% of the formatted free space (~1.6TB). If you can fit 10TB of data on it (total guess, depends on number and size of your indexes), you need 10 of them to fit 100TB. So if you wanted to use 3 replicas in a set, then you would need 10 shards * 3 replicas or 30 servers to fit that much data. Another alternative would be to use less 1U servers with more ram and then attach external disk arrays to allow for more disk space per shard than a single server can hold.

If that solution does not have the write performance you need, you can put smaller (and faster) disks per server and add more shards, say 4TB data per server and 25 shards. If your data set grows later, you can always add more shards to increase the total disk available. If you need more read performance you can add more replicas to each set to spread out the read load. If you don't want to spend the $500K - $1M on hardware and manage multiple racks of physical servers then a hosted or cloud solution will be better way to go for you. You can really scale as far as you want, the cost will be the limiting factor.

You can look at the docs about deploying to EC2:

http://www.mongodb.org/display/DOCS/Amazon+EC2+Quickstart

You might also try setting up a subset of your data in EC2 to get an idea of sizing and performance before you drop a ton of money on building out a large cluster to fit the whole set.

On Monday, January 21, 2013 11:40:39 PM UTC-8, Ayberk Cansever wrote:

Hi guys,

In a big company, we have ~100TB of data to be analyzed and we want to use Mongo DB. We should analyze the data with aggregetion. The data is so big that we are confused of the architectural design. What kind of an architecture do you suggest. How many servers? Disk size per server? Do you have any idea?

Ayberk

Ian Beaver

unread,

Jan 23, 2013, 2:52:59 PM1/23/13

to mongod...@googlegroups.com

EDIT: the oplog would be ~800MB, not ~1.6TB in my example.

Moore

unread,

Jan 24, 2013, 1:59:18 AM1/24/13

to mongod...@googlegroups.com

IMO, a most effective compress lz4 processing is needed for that giant data.

or a column compress db should be valued to storage it.

NucleonSoftware

unread,

Jan 24, 2013, 6:10:11 AM1/24/13

to mongod...@googlegroups.com

Hi

Apache Hadoop or Apache Hive (Hadoop DB) projects can be a better solution for this kind of huge (100 TB ) data analysis. There are also some ODBC drivers for Apache Hive and you can can execute HiveQL.
We have a business intelligence and data analysis product: Nucleon BI Studio. It supports MongoDB and ODBC connections, currently implementing native connection support to Apache Hive.

for more info:
http://www.nucleonsoftware.com

Kind regards
Enver Tat

On Tuesday, January 22, 2013 8:40:39 AM UTC+1, Ayberk Cansever wrote:

Hi guys,

Wojons Tech

unread,

Jan 24, 2013, 11:30:42 AM1/24/13

to mongod...@googlegroups.com

I would agree you are comming up on the limits of mongodb when you are looking at 100tb of data. Are you trying to store the data to run a program against your data with ad hoc query or are you looking to use map reduce? Make sure to insert your data without indexes and to use spares and background indexes. Also try to use extra databases to avoid write locks. You will need some nice storage solution to handle all the needed io and memory will also be needed make sure the index fits in memory but if you are can try to see if you can get some of your working data in memory also. You will want servers with lots of drives or some ssd but you may also want to get twin servers from super micro put 2 4tb drives in each of the nodes and i think they offer one with 12 nodes in a 4u box. For number of shards you will want to go big because

Reply all

Reply to author

Forward