A 36 Trillion Objects Database: Feasible?

Femi TAIWO

unread,

Apr 16, 2012, 5:38:19 AM4/16/12

to mongodb-user

I have been playing around with MongoDB for a couple of days now in my
search for the best way to implement a datastore to hold up to 36
trillion records and query the same often.

Has anyone succeeded in scaling to contain this much data?

Here's the real scenario of the problem I'm trying to solve.

I have about 6 million users in the database.

I need to store the results of the comparison done between these
users. It is a 1 to 1 comparison, which would sum logically to 6
million x 6 million results (hence my approximately 36 trillion
records).
All data stored in the 5 attributes of each object are pure text, two
are 20 alphanumeric characters, while the other three are integers
below 1000.

These results need to be calculated ahead, and not on the fly because
queries will be run often to check for example, objects with a score
value between 400 and 500.

I have been running the calculations for a few days too and currently
have over 66 million results/objects in the collection.

I have indexed the attributes used in the querying the store and
performance is fantastic. But 66 million is a small number compared to
the estimated final figure. The process to generate the entire results
will take a fairly large amount of time. I have a Dell R410 Server
(64GB RAM, Quad Core) to work with and storage is not a limitation.

The data, once calculated will be stored and not changed, so it's like
a WORM CD :) write once, read many.

Please am I making sense? and in the right direction? Is MongoDB the
best choice for this?

Thanks!

Femi

Brandon Diamond

unread,

Apr 16, 2012, 11:42:08 AM4/16/12

to mongodb-user

Hi Femi,

This is indeed quite possible using MongoDB. The only practical limit
is that of memory -- if you're indexing trillions of documents (each
of which are roughly 100 bytes, as per your description), that's
roughly 3 petabytes of data. Thus, you'll definitely need to shard
and, further, do so in such a way as each individual shard's working
set fits in primary memory; otherwise, your queries will hit disk and
be slow.

Provided that this data can be effectively sharded and that you are
okay with running a sharded cluster, MongoDB will be more than capable
of efficiently serving this data.

Thanks,
- Brandon

Max Schireson

unread,

Apr 16, 2012, 12:13:08 PM4/16/12

to mongod...@googlegroups.com

While it may not be impossible, it may be impractical on a single machine (which seems to be what you're aiming for).

I think Linux limits you to 128TB of virtual address space per process; that means you will need at least 30 mongod proocesses managing different shards and if you use journalling twice that many.

In practice I have not seen people running individual shards that big (think single digit TB as a practical maximum so that at least frequently used parts of indexes can be in memory) which probably means hundreds of servers.

You'll also need petabytes of storage - say 10 pb raw with some redundancy (RAID) assuming 100 bytes per record. This will be at least 3000 spindles, which will likely fill 10 racks or so even if you can handle high power density. In practice you might attach say 10 drives to each of a few hundred servers but you're talking about a pretty big data center.

I think there are likely much more efficient data structures for this than 36 trillion documents. Happy to think about suggestions if you supply more details.

-- Max

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Femi TAIWO

unread,

Apr 16, 2012, 2:07:23 PM4/16/12

to mongod...@googlegroups.com

@Brandon & Max

Thanks for responding to my questions and putting the numbers in perspective. I thought I was on something cheap earlier :)

@Max

Thanks! And you are right. There must be more efficient structures than 36 trillion records in this case. I will definitely use shards to manage this data but I only have access to several of those servers (max 30 x 8TB) not hundreds. I certainly have storage limitations based on how much data storage was projected.

I am currently designing the rules that will HOPEFULLY reduce the possible amount of comparisons hopefully to a tiny fraction of my original projections.

@Brandon

Poking around about sharded clusters, I have started studying and trying it out - I see I do have a lot of research to do.

I may also have look for a way to archive rarely visited data so that I don't keep querying the entire collections for these documents.

Once again, thanks for responding to my questions. I will keep sharing my thoughts and observations over the next few days as I run tests and work on the design.

Femi.

--
Regards,

Femi TAIWO

Reply all

Reply to author

Forward