700 million row collection

2,735 views
Skip to first unread message

j.e.frank

unread,
Nov 4, 2011, 12:19:37 PM11/4/11
to mongodb-user
I'm trying MongoDB for an application that MySQL didn't seem capable
of handling. I have been doing some research but I want to see if
anyone here has recommendations for performance tuning. Here are the
questions that I have, and then I will supply more detail about the
problem I'm trying to solve.

-- How can I get the best performance for inserting/updating around
700 million rows? It needs to be queryable as well, so it needs an
index, but perhaps I can use a temp collection to insert and then swap
them after indexing.

-- Are there easy ways to partition data in MongoDB without using
sharding in a cluster?

I receive a data file (something like CSV) once a week that has about
700 million rows in it. The natural key of the data is composed of
two columns, let's call them "product_id" and "store_id". I need to
update my database with the information in these 700 million rows,
either by inserting new data or updating existing data. In the
meantime, I need to support a service that queries this data, mainly
by product_id.

In MySQL, my thought was to insert all records to a table with no
index, and then create the index and rename the table. The insert
finished in about 30 hours, but I killed the "create index" statement
after it had been running for about a day.

I could improve the situation by creating separate tables and doing
manual partitioning, e.g. by ranges of product_id, so that is one
avenue I may explore. But I thought this was a great opportunity to
try out a "big data" database, and MongoDB seemed like a good choice.

I was a bit surprised that at least in my naive implementation in
MongoDB, the insert took longer, about 36 hours, again on a collection
with no index. I could probably speed that up if I turned journaling
off (I am using MongoDB 2.0.1) plus I am not doing any bulk inserts,
just inserting each row one at a time. I'm pretty sure that I am
using the Java driver correctly so that it doesn't have to wait for
confirmation from the server when it does the insert, but I could be
wrong. I'm currently running the "ensureIndex" command after I've
finished inserting, and based on the progress in the mongod log, it
will take around an hour and a half to two hours.

I have looked at MongoDB's sharding, and I'd prefer to stay away from
setting up a cluster or any of that complexity. I could definitely
try the same manual partition idea that would work on MySQL, but if
I'm going to do that, I am not sure there remains a reason to use
MongoDB for this case.

NPSF3000

unread,
Nov 5, 2011, 12:25:56 PM11/5/11
to mongodb-user
Newb here.

Sounds slow - In a simple VM I exceed 20k/s - extrapolates to <10
hours for your volume.

First thing's I'd look at are:

Are you maxing your client? Try multi-threading [be careful, too many
threads will reduce performance].

Are you maxing your storage - Iometer etc.

How's your memory? Mongo IIRC always indexes your ID
[barring capped collections] - 700Mn docs could create a big index -
if this started being swapped...

Once again, I'm a newb!

Best wishes in finding and resolving your problem.

Massimo Lusetti

unread,
Nov 5, 2011, 2:15:50 PM11/5/11
to mongod...@googlegroups.com
On Fri, Nov 4, 2011 at 5:19 PM, j.e.frank <j.e....@gmail.com> wrote:

> I'm trying MongoDB for an application that MySQL didn't seem capable

[..]


> I have looked at MongoDB's sharding, and I'd prefer to stay away from
> setting up a cluster or any of that complexity.  I could definitely
> try the same manual partition idea that would work on MySQL, but if
> I'm going to do that, I am not sure there remains a reason to use
> MongoDB for this case.

My own experience with a collection of about 300 million rows is the
same as yours... simply mongo without sharding simply doesn't cut.

You can search in the archives for my threads about that... I would be
happy to listen some others words on this but I think the only thing
you can do is sharding (even if I didn't tryied myself) or use some
sort of map-reduce functions.

BTW I end up with PostgreSQL which handle the data load nicely even if
I still have the prototype in MongoDB.

Cheers
--
Massimo
http://meridio.blogspot.com

Massimo Lusetti

unread,
Nov 5, 2011, 2:18:50 PM11/5/11
to mongod...@googlegroups.com
On Sat, Nov 5, 2011 at 5:25 PM, NPSF3000 <npsf...@googlemail.com> wrote:

> Are you maxing your client?  Try multi-threading [be careful, too many
> threads will reduce performance].

I don't think multi-thread could help here at all...

> Are you maxing your storage - Iometer etc.
>
> How's your memory?  Mongo IIRC always indexes your ID
>  [barring capped collections] - 700Mn docs could create a big index -
> if this started being swapped...

Yep that's the problem... the way index are built/thought they need to
stay in memory for fast lookup otherwise there's too much penalty so
use shards or do map-reduce before storing...

NPSF3000

unread,
Nov 5, 2011, 3:00:17 PM11/5/11
to mongodb-user
> I don't think multi-thread could help here at all...

It depends on the bottleneck - if he can only process x amount of data
client side then the DB is irrelevant beyond that.

Running a two core vm on my machine I found that running two insert
threads gave a 50-100% performance boost [this vm also includes the
mongod with]. I was inserting 10's of millions of documents, with two
or three fields a piece.

Again I'm a newb... but given my performance running on fairly crap
hardware I'd want to isolate the problem before calling it a day. If
with 32 gigs of ram, quad cpu and ssd's [available @ <$10 a day] you
can't get 700Mn done in a day 10gen need to review their architecture.

On Nov 6, 4:18 am, Massimo Lusetti <mluse...@gmail.com> wrote:

Daniel W

unread,
Nov 5, 2011, 3:13:09 PM11/5/11
to mongodb-user
bulk insert should help.

Martin Poirier

unread,
Nov 5, 2011, 3:57:26 PM11/5/11
to mongod...@googlegroups.com
Just another simple comment. I'm starting using mongo for a new project at the job. We don't have to insert 700 million row a batch so we start with simple database configuration (one replicate set for HA). Eventually we will move to sharding for sure. I'm pretty sure mongo is a good solution for what you want to do (I don't know if it's the best...) but one thing sure you cannot expect to have all the power of the mongo system if you don't want to use the feature that enable it. In your case you really need to use shards to distribute write among different machine.

You are willing to do a manual partitionning of your data from within your application logic... you are probably like me, a developper more than a sysadmin, so you always try to find a solution with your coding skill. But sharding from mongo does exactly that it's just question of configuring the shard and it's a lot less complicated then you might think (unless you did read everything about it...).

So let just say that don't expect a machine gun to kill anyone if you don't want to put bullets in it. (Sorry for the gun related comparaison, that was the best one that come into my mind, even if I'm a really peacefull guy!).

Martin

2011/11/5 Daniel W <dweitz...@gmail.com>
bulk insert should help.
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


Antoine Girbal

unread,
Nov 6, 2011, 6:19:43 PM11/6/11
to mongodb-user
If you want to make the initial import much faster you can try:
- turn off journaling, gain about 30% (or put journaling on a fast
drive). Note that once it's inserted you should really turn journaling
back on.
- use WriteConcern.NONE so that the db doesnt have to acknowledge
writes.
- use bulk insert. For example insert 100 records at a time, this will
reduce processing time.

Besides that, you should try to find out where the bottleneck is.
I would look at cpu, load, iostat on both server and client.
Sometimes bottleneck is not where you may think.. if disk is
bottleneck then db wont make much difference.

On Nov 5, 11:29 pm, NPSF3000 <npsf3...@googlemail.com> wrote:
> Just found an interesting link:
>
> http://www.royans.net/arch/mongodb-migration-from-mysql-at-wordnik/
>
> 5Bn rows, non-trivial data, 32Gig server, used bulk insets...
> SUSTAINED 100,000 insets/s - under a day to do the import.
>
> So yeah... it's doable.
>
> On Nov 6, 5:57 am, Martin Poirier <mpoir...@woozworld.com> wrote:
>
>
>
>
>
>
>
> > Just another simple comment. I'm starting using mongo for a new project at
> > the job. We don't have to insert 700 million row a batch so we start with
> > simple database configuration (one replicate set for HA). Eventually we
> > will move to sharding for sure. I'm pretty sure mongo is a good solution
> > for what you want to do (I don't know if it's the best...) but one thing
> > sure you cannot expect to have all the power of the mongo system if you
> > don't want to use the feature that enable it. In your case you really need
> > to use shards to distribute write among different machine.
>
> > You are willing to do a manual partitionning of your data from within your
> > application logic... you are probably like me, a developper more than a
> > sysadmin, so you always try to find a solution with your coding skill. But
> > sharding from mongo does exactly that it's just question of configuring the
> > shard and it's a lot less complicated then you might think (unless you did
> > read everything about it...).
>
> > So let just say that don't expect a machine gun to kill anyone if you don't
> > want to put bullets in it. (Sorry for the gun related comparaison, that was
> > the best one that come into my mind, even if I'm a really peacefull guy!).
>
> > Martin
>
> > 2011/11/5 Daniel W <dweitzenf...@gmail.com>

Massimo Lusetti

unread,
Nov 7, 2011, 3:28:01 AM11/7/11
to mongod...@googlegroups.com
On Mon, Nov 7, 2011 at 12:19 AM, Antoine Girbal <ant...@10gen.com> wrote:

> If you want to make the initial import much faster you can try:
> - turn off journaling, gain about 30% (or put journaling on a fast
> drive). Note that once it's inserted you should really turn journaling
> back on.
> - use WriteConcern.NONE so that the db doesnt have to acknowledge
> writes.
> - use bulk insert. For example insert 100 records at a time, this will
> reduce processing time.
>
> Besides that, you should try to find out where the bottleneck is.
> I would look at cpu, load, iostat on both server and client.
> Sometimes bottleneck is not where you may think.. if disk is
> bottleneck then db wont make much difference.

You're right but the fact is that due to how mongo handle indexes
having a 700million entries in the index it tries to keep them in
memory (at least if you use it to query) and that can lead to the OS
swap memory pages to disks.
That for sure kills the performance.

I need to have 300million (and counting) SHA2-256 hash in a index to
check for uniqueness and cannot make a reliable (performance wise) use
of this in Mongo.

j.e.frank

unread,
Nov 7, 2011, 11:52:13 AM11/7/11
to mongodb-user
Is there any way to turn off indexing for the _id field, in
expectation that I will index on something else later?

On Nov 7, 4:28 am, Massimo Lusetti <mluse...@gmail.com> wrote:

Antoine Girbal

unread,
Nov 7, 2011, 1:51:50 PM11/7/11
to mongodb-user
to remove index on _id either:
- use "autoIndexId: false" option when creating collection
- drop index after creation before inserting docs

Note virtually every model needs a way to uniquely identify a
document.
If you have such a field, you should store it under _id, it can be of
any type really, even a subobject.

Antoine Girbal

unread,
Nov 7, 2011, 1:58:00 PM11/7/11
to mongodb-user
mongodb does not try to force index in memory.
ram is managed by the OS in a pure LRU fashion, so that it contains
the most recently accessed pages.
If you constantly need to check against random value uniqueness, then
with any db you will need the full index in ram for good performance.
What can make the difference is how compact the index is so that it
takes less ram:
- Make sure you store values as small as possible, for example use a
binary field instead of string representation.
- do not clutter ram with other indexes (if the hash is unique, use it
as _id so that you dont have an extra index)
- reindex once in a while to compact the index storage


On Nov 7, 12:28 am, Massimo Lusetti <mluse...@gmail.com> wrote:

j.e.frank

unread,
Nov 9, 2011, 3:06:42 PM11/9/11
to mongodb-user
I've changed the way that I want to represent the data to try to
improve performance. Previously, I had a unique index on product_id
and store_id. All I really need to know is what stores a given
product is available in. So now I am using product_id as the ID, and
a "bitfield" where each bit represents yes or no for a particular
store_id. This will only scale to 64 possible stores, since I'm using
a 64-bit number to store the data, which is not great but is good
enough for my purposes. This reduces the number of documents from
around 700 million to around 30 million. However, the access pattern
when populating the data (from a 700 million row CSV file) now has to
lookup the record first, do some "bitwise OR" logic, and then write
the record back. (I tried using the $bit operator to do the OR logic
without loading the record into the middle tier, but couldn't figure
out how to do the "upsert" with $bit logic together -- it worked fine
if there was already a record in the collection, but it wouldn't
create one).

I have not been successful in tuning my MongoDB to get any of my
experiments to have acceptable performance for this dataset. Where
could I find some tips for (e.g.) increasing the RAM that MongoDB
uses, or other tuning parameters that could make a difference? As an
aside, I tried out Kyoto Cabinet and was able to load all the data in
about 2 hours. I recognize that KC is a much narrower tool, but I may
have to go in that direction if I can't get MongoDB to perform better.

Antoine Girbal

unread,
Nov 9, 2011, 3:26:55 PM11/9/11
to mongodb-user
I dont understand why you need to lookup a record first, but that will
likely kill performance.
If you are doing pure inserts, you dont need too much ram in theory,
unless you have many indexes and are accessing them randomly.
Could you provide an example document and the indexes you have on it?
Could you provide with db.col.stats() after inserts?
Also for your earlier attempts, could you provide time taken for
different settings (journaling on/off, write safety, etc)?

To understand bottleneck it may be useful to try out fastest possible
mode:
- turn off journaling
- create collection and drop index on _id (or autoIndexId false). Make
sure there is no index in db.col.stats()
- run bulk inserts (100 docs) from client, with write safety off.

What driver do you use? I suggest using a fast driver like Java.

Antoine Girbal

unread,
Nov 9, 2011, 4:24:03 PM11/9/11
to mongodb-user
quick note: if you have index on productId, and its a 8 byte mostly
unique value, then index is 5.6GB at minimum if you just index on
that.
Counting index overhead it will probably be more like 10GB, the index
tree in mongo usually is 50% empty.
Unless you use always increasing values (like ObjectId) in which case
it's close to 90% filled.
You can actually look at index size after insert is finished.
Now in order to get good performance on inserts with random product
id, you need to have full index in ram (for any db), so I would say at
least 16GB of ram.
If instead you use an ObjectId as _id, or you pre-sort your products
by productId, or you dont have an index at all, then the ram
requirements are much lower to have fast insert.

j.e.frank

unread,
Nov 9, 2011, 5:09:49 PM11/9/11
to mongodb-user
Only in this new data layout do I need to lookup a record and then
update it. I need to look up the record first so that I can perform
the bitwise OR on the field that keeps track of which stores the given
product is in. For example, let's say the product with ID 1234 is in
stores X and Y, and that I have mapped these stores to bit positions 0
and 3 respectively in the bitfield. The first record in my data file
(not the database) is the combination of 1234,X. So I look up the
record in the DB with key 1234, find that there is none, and insert it
with bitfield value 1 (i.e. a 1 in the 0th bit position, corresponding
to "yes" in store X). The next record in the datafile has 1234,Y.
This means that I need my bitfield now to have 1's in positions 0 and
3, which makes the value (as a number) 9. So I need to load up the
value and do bitwise OR before I can update it, since many rows in the
datafile map to the same record in the database and they all modify
the bitfield. As I said, I tried to do this with the $bit modifier,
which I could get to work for existing records but not for records
that needed to be created.

I will have to do some more experiments in the other data layout,
where I don't use a bitfield, because that one should be able to do
lightning-fast inserts if I turn off journaling and indexing. I'm
just concerned about then doing an "ensureIndex" on that collection
taking forever.

Martin Poirier

unread,
Nov 9, 2011, 7:51:10 PM11/9/11
to mongod...@googlegroups.com
if you want to have unlimited store you could store it like this

{
  produc_id: (int),
  store_ids: [(int), ...]
  ...
}

From you client you just keep track if you have insert the product and then do a $addToSet: { store_ids: (int) } for each store you get.

Of course you'll probably need to add a index to store_ids if you are looking by store.

You could also have two collection one just containing the store with all the product_id you have referning to the product_id.

No limit of store and you'll be probably have a lot less index since you probably add to put a index on the store_id with your design

Martin

2011/11/9 j.e.frank <j.e....@gmail.com>

Jason Nethercott

unread,
Nov 9, 2011, 10:05:07 PM11/9/11
to mongodb-user
If you can, order your 700 million source set by product. Then in
your load app, cycle through the source records while it's the same
store to build your bitset. Write that record to Mongo, and then do
the next product. Then you are down to only 30m inserts... no finds
required.

To make it even faster, split do the load in the main thread, and then
spawn threads to convert the array list of stores for the product into
a bitset, then do the Mongo insert in that thread.

Just some ideas... I work on high-volume, complex ETLs all the time.
Threads are you friend. Minimize/cache reads where possible too... I
assume your store to bit is an in-memory hash already?

j.e.frank

unread,
Nov 10, 2011, 10:07:42 AM11/10/11
to mongodb-user
Unfortunately the data feed does not come ordered by product, and I
don't have control of it. It is ordered by store. I could possibly
do some kind of pre-work to transform the data feed but my thinking
was that if I'm going to do all the work to sort by product, why not
figure out each product's list of stores while I'm at it. It's hard
to see how it would be more efficient to make 2 steps, but perhaps I
am mistaken.

Yes, I have the mapping from store_id to bitmask as an in-memory data
structure. That part is working great. The problem is just
interacting with the database in an efficient way.

On Nov 9, 10:05 pm, Jason Nethercott <netherc...@gmail.com> wrote:
> If you can, order your700millionsource set by product.  Then in
> your load app, cycle through the source records while it's the same
> store to build your bitset.  Write that record to Mongo, and then do
> the next product.  Then you are down to only 30m inserts... no finds
> required.
>
> To make it even faster, split do the load in the main thread, and then
> spawn threads to convert the array list of stores for the product into
> a bitset, then do the Mongo insert in that thread.
>
> Just some ideas... I work on high-volume, complex ETLs all the time.
> Threads are you friend.  Minimize/cache reads where possible too... I
> assume your store to bit is an in-memory hash already?
>
> On Nov 9, 4:06 pm, "j.e.frank" <j.e.fr...@gmail.com> wrote:
>
>
>
>
>
>
>
> > I've changed the way that I want to represent the data to try to
> > improve performance.  Previously, I had a unique index on product_id
> > and store_id.  All I really need to know is what stores a given
> > product is available in.  So now I am using product_id as the ID, and
> > a "bitfield" where each bit represents yes or no for a particular
> > store_id.  This will only scale to 64 possible stores, since I'm using
> > a 64-bit number to store the data, which is not great but is good
> > enough for my purposes.  This reduces the number of documents from
> > around700millionto around 30million.  However, the access pattern
> > when populating the data (from a700millionrow CSV file) now has to

Jason Nethercott

unread,
Nov 10, 2011, 12:17:55 PM11/10/11
to mongodb-user
The key is to keep the persistence layer (database) out of it until
the end. Here is the algorithm I would use... and it would only be a
couple dozen lines of code, but require a large JVM footprint...
around 85mb if my math is right. And, of course the speed of this
process is directly related to the core speed of the processor this is
running on. I.E. don't run it on your old XT. :)

Assign a 0-63 number to each store.
Assign a 0-30 million? number to each product.
Create a linked hashset of 64 bitsets.
Cycle through your 700 million source file.
Get the number for the product
Get the number for the store, and grab it's bit set.
Set the product number bit in the bitset for the store.
When finished, now loop through your 30 million products
loop through your 64 store bit sets
Start with you integer as zero, and logical or for each of the 64
stores that has the bit set for the product loop.
Save your result to the database.

The tricky part will be the number for each product. If you have auto-
inc type values for each product then you are good... otherwise it
might be tough.

NPSF3000

unread,
Nov 6, 2011, 2:29:20 AM11/6/11
to mongodb-user
Just found an interesting link:

http://www.royans.net/arch/mongodb-migration-from-mysql-at-wordnik/

5Bn rows, non-trivial data, 32Gig server, used bulk insets...
SUSTAINED 100,000 insets/s - under a day to do the import.

So yeah... it's doable.

On Nov 6, 5:57 am, Martin Poirier <mpoir...@woozworld.com> wrote:
> Just another simple comment. I'm starting using mongo for a new project at
> the job. We don't have to insert 700 million row a batch so we start with
> simple database configuration (one replicate set for HA). Eventually we
> will move to sharding for sure. I'm pretty sure mongo is a good solution
> for what you want to do (I don't know if it's the best...) but one thing
> sure you cannot expect to have all the power of the mongo system if you
> don't want to use the feature that enable it. In your case you really need
> to use shards to distribute write among different machine.
>
> You are willing to do a manual partitionning of your data from within your
> application logic... you are probably like me, a developper more than a
> sysadmin, so you always try to find a solution with your coding skill. But
> sharding from mongo does exactly that it's just question of configuring the
> shard and it's a lot less complicated then you might think (unless you did
> read everything about it...).
>
> So let just say that don't expect a machine gun to kill anyone if you don't
> want to put bullets in it. (Sorry for the gun related comparaison, that was
> the best one that come into my mind, even if I'm a really peacefull guy!).
>
> Martin
>
> 2011/11/5 Daniel W <dweitzenf...@gmail.com>
Reply all
Reply to author
Forward
0 new messages