MongoDB for write intensive application (100K concurrent inserts)

2,807 views
Skip to first unread message

Chris Howells

unread,
Feb 17, 2011, 4:58:23 AM2/17/11
to mongodb-user
Hi,

I need to do 100k concurrent inserts, from about fourty clients, for a
period of a few hours. After the few hours period has finished, that
data will not be inserted to or updated again. Another disparate job
will run some time later and act in a similar manner, inserting data
into another collection. Each insert is around 30KB of data, which
MongoDB seems to turn into around 80KB of data on disk. Ideally the
fourty clients would each insert data into the same collection.

80KB X 100K is some 800MB a second, so we need to spread the write
load over multiple machines, so we need sharding.

I've read the Wiki, "Scaling MongoDB", the thread from last week
"Scaling mongo with tons of writes" [1] and the associated bug report
([2]) and I'm thoroughly confused now about what Mongo can currently
achieve with concurrent write scaling of a collection. If I had say 10
Mongo servers to do sharding over, and 40 clients hitting those
servers with 100K concurrent inserts of 80KB of data, to the same
collection, could Mongo currently be able to cope? (mongos and config
server configuration TBD).

[1] - http://groups.google.com/group/mongodb-user/browse_thread/thread/e8eb68de4b4c55e5?hl=en#
[2] - http://jira.mongodb.org/browse/SERVER-939

Many thanks.

An addenum, I wrote a test application using the C++ libraries, and
tested it with Mongo 1.6.5 on a spare server that I had sitting around
(Poweredge R710, Dual quad 2.4Ghz Xeon, 32GB RAM, 6 x 146GB 15k SAS
disks in RAID 0 on an LSI RAID controller). Doing 100k inserts takes
around 7-15 seconds, even if I've pre-allocated the volumes (non-
capped). iostat shows that I'm getting around 500MB/sec on writes
(which actually doesn't seem all that fast for 6 15K SAS disks in RAID
0, but I haven't done any tuning other than having Mongo write to ext4
or XFS with 'noatime' and 'nodiratime', just an out of the box CentOS
5.5 64bit. Clearly a single moderately specced server isn't fast
enough, even with a disk configuration (RAID 0) that I wouldn't
actually use in production.).

Nat

unread,
Feb 17, 2011, 8:50:51 AM2/17/11
to mongodb-user
- What kind of network card do you use? With 500mb/s, even 10GBps is
quite difficult to reach that speed.
- You probably don't need that many clients to insert data. You should
try to insert them in batch instead. It will use less bandwidth and
cpu to insert the same amount of data.

On Feb 17, 5:58 pm, Chris Howells <chris+mo...@chrishowells.co.uk>
wrote:
> Hi,
>
> I need to do 100k concurrent inserts, from about fourty clients, for a
> period of a few hours. After the few hours period has finished, that
> data will not be inserted to or updated again. Another disparate job
> will run some time later and act in a similar manner, inserting data
> into another collection. Each insert is around 30KB of data, which
> MongoDB seems to turn into around 80KB of data on disk.  Ideally the
> fourty clients would each insert data into the same collection.
>
> 80KB X 100K is some 800MB a second, so we need to spread the write
> load over multiple machines, so we need sharding.
>
> I've read the Wiki, "Scaling MongoDB", the thread from last week
> "Scaling mongo with tons of writes" [1] and the associated bug report
> ([2]) and I'm thoroughly confused now about what Mongo can currently
> achieve with concurrent write scaling of a collection. If I had say 10
> Mongo servers to do sharding over, and 40 clients hitting those
> servers with 100K concurrent inserts of 80KB of data, to the same
> collection, could Mongo currently be able to cope? (mongos and config
> server configuration TBD).
>
> [1] -http://groups.google.com/group/mongodb-user/browse_thread/thread/e8eb...
> [2] -http://jira.mongodb.org/browse/SERVER-939

Chris Howells

unread,
Feb 17, 2011, 9:21:06 AM2/17/11
to mongodb-user
Hi Nat,

On Feb 17, 1:50 pm, Nat <nat.lu...@gmail.com> wrote:
> - What kind of network card do you use? With 500mb/s, even 10GBps is
> quite difficult to reach that speed.

We'll be using standard gigabit ethernet, probably with 802.3ad LACP
(line aggregation).

We will have 40 machines generating the data. For 800MB/sec that's
only (800/40) = 20MB per second each.
If we have say 10 mongodb servers that's (800/10) = 80MB per second
each.

I'm not concerned about this aspect, just about how MongoDB copes with
sharding writes to collections.

> - You probably don't need that many clients to insert data. You should
> try to insert them in batch instead. It will use less bandwidth and
> cpu to insert the same amount of data.

We need 40 clients because they are processing data, something which
is CPU bound on the clients, and generating results which need to be
stored in a database. Batch inserting could be, but will considerably
increase application complexity.

I'm reading through MongoDB The Definitive Guide which talks about
sharding a collection on page 149: "Now when we start adding data, it
will automatically distribute itself across our shards". But bug
SERVER-939 says "If you create a new database, the most available
shard at that time will be picked to host that entire database".

Therefore, I'm confused as this seems to be a direct contradiction.

Thanks.

Nat

unread,
Feb 17, 2011, 9:42:39 AM2/17/11
to mongodb-user


On Feb 17, 10:21 pm, Chris Howells <chris+mo...@chrishowells.co.uk>
wrote:
> Hi Nat,
>
> On Feb 17, 1:50 pm, Nat <nat.lu...@gmail.com> wrote:
>
> > - What kind of network card do you use? With 500mb/s, even 10GBps is
> > quite difficult to reach that speed.
>
> We'll be using standard gigabit ethernet, probably with 802.3ad LACP
> (line aggregation).
>
> We will have 40 machines generating the data. For 800MB/sec that's
> only (800/40) = 20MB per second each.
> If we have say 10 mongodb servers that's (800/10) = 80MB per second
> each.
You might not be able to reach this speed unless you put them in
separated network. It's also worth mentioning that the overhead is not
that high. 30k -> 80k seems to be on a high side. What kind of data do
you keep? A lot of array data probably?

> I'm not concerned about this aspect, just about how MongoDB copes with
> sharding writes to collections.
>
> > - You probably don't need that many clients to insert data. You should
> > try to insert them in batch instead. It will use less bandwidth and
> > cpu to insert the same amount of data.
>
> We need 40 clients because they are processing data, something which
> is CPU bound on the clients, and generating results which need to be
> stored in a database. Batch inserting could be, but will considerably
> increase application complexity.
>
> I'm reading through MongoDB The Definitive Guide which talks about
> sharding a collection on page 149: "Now when we start adding data, it
> will automatically distribute itself across our shards". But bug
> SERVER-939 says "If you create a new database, the most available
> shard at that time will be picked to host that entire database".

This JIRA is about distributing small collections to shard. Currently
(1.6.x) only large collections can be benefit from sharding. Check out
more detailed discussion at http://groups.google.com/group/mongodb-user/browse_thread/thread/c530153c9c176c8f

Chris Howells

unread,
Feb 17, 2011, 10:38:37 AM2/17/11
to mongodb-user
Hi Nat,

On Feb 17, 2:42 pm, Nat <nat.lu...@gmail.com> wrote:
> You might not be able to reach this speed unless you put them in
> separated network.

The switches that we use have 32Gbps backplane. On a 48 port switch
this should be enough to have 32 ports constantly maxed out at 1Gbps
each, so I don't think it will be too much of a problem.

> It's also worth mentioning that the overhead is not
> that high. 30k -> 80k seems to be on a high side. What kind of data do
> you keep? A lot of array data probably?

I'm sorry, I think I've mis-calculated the 30KB.

Each document has about 15 attributes and values, a mixture of text
and ints. On top of that, each 60 sub documents of the same data. The
parent object contains about 20KB of data, in addition.

Probably better to example with some code; here's a snippet of the C++
application.

http://chrishowells.co.uk/sample.txt

(Where 'request' is about 450 bytes of data, 'response' about 450
bytes of data, and 'data' about 20KB of data. Note that I've shorted
the attribute names down to just the first character, they're longer
in reality).

So actually 80KB sounds reasonable.

> This JIRA is about distributing small collections to shard. Currently
> (1.6.x) only large collections can be benefit from sharding. Check out
> more detailed discussion athttp://groups.google.com/group/mongodb-user/browse_thread/thread/c530...

Ah right thank you, I think that makes sense.

I'll see if I can get something working with write sharding a large
collection then.

Many thanks.

Scott Hernandez

unread,
Feb 17, 2011, 11:50:17 AM2/17/11
to mongod...@googlegroups.com

It is a little confusing. The real issue here is the time-frame you
are talking about. At first there is only one chunk so the data will
all be written there; a chunk will only reside on a single shard
(server). Then, over time, and during the inserts, the other shards
will get data distributed to them and inserts will be better
distributed without have to have a rebalance; where existing data is
moved between shards.

You can make sure writes initially get distributed better by
pre-splitting chunks and making sure those chunks are assigned to
different shards before you ever insert data. That way when the first
data come it not start with a single chunk going to a single shard
(server).

http://www.mongodb.org/display/DOCS/Splitting+Chunks#Pre-Splitting

All of this is dependent on how you choose your shard key.
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Chris Howells

unread,
Feb 18, 2011, 8:43:30 AM2/18/11
to mongodb-user
Hi Scott,

On Feb 17, 4:50 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> You can make sure writes initially get distributed better by
> pre-splitting chunks and making sure those chunks are assigned to
> different shards before you ever insert data. That way when the first
> data come it not start with a single chunk going to a single shard
> (server).

Hi Scott,

Many thanks for the reply. I understand a lot better now. So, I've
been testing on a PowerEdge R710 (32GB RAM, 8 x 2.4GHz cores) with 2 x
mongod, 3 x config servers and 2 x mongos. /data is xfs, mounted
noatime,nodiratime on 6 x 15K SAS RAID 0 array.

I have a C++ stress test app which I run twice, each connecting to one
mongos. I am sharding on a key whih has 10 possible values (0-(, and 5
chunks on each shard. But I am seeing it taking over 10 seconds to do
100K inserts, which is around 10X slower than writing just to a single
mongod on the same box. This seems wrong. Should it be like this, or
am I doing something wrong? There is barely any disk IO going on and
the each process is maxing out at only about 60% CPU - and there's
enough cores for each process to have one each.

mongod --dbpath /data/config1/ --port 20000
mongod --dbpath /data/config2/ --port 20001
mongod --dbpath /data/config3/ --port 20002
mongos --port 30000 --configdb localhost:20000,localhost:
20001,localhost:20002
mongos --port 30001 --configdb localhost:20000,localhost:
20001,localhost:20002
mongod --dbpath /data/shard1/ --port 10000
mongod --dbpath /data/shard2/ --port 10001
mongo localhost:30000/admin

The C++ app does this.

const char *data =
"mongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotestmongotest";
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 10; ++j) {
BSONObjBuilder page;
page.genOID();
page.append("ResultID", 100);
page.append("StartDateTime", 33);
page.append("shard", j);
page.append("data", data);
BSONObj pageO = page.obj();
c.insert("chris.test", pageO);
}
}

(complete source at http://chrishowells.co.uk/people.txt)

[root@man-dt-01850 ~]# time ./people

real 0m11.599s
user 0m0.328s
sys 0m0.153s




I set up sharding like this:


> db.runCommand({addshard : "localhost:10000", allowLocal: true})
{ "shardAdded" : "shard0000", "ok" : 1 }
> db.runCommand({addshard : "localhost:10001", allowLocal: true})
{ "shardAdded" : "shard0001", "ok" : 1 }
> use admin;
switched to db admin
> db.runCommand({"enablesharding" : "chris"})
{ "ok" : 1 }
> db.runCommand({"shardcollection" : "chris.test", "key" : {"shard" : 1}})
{ "collectionsharded" : "chris.test", "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 1 }})
{ "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 2 }})
{ "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 3 }})
{ "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 4 }})
{ "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 5 }})
{ "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 6 }})
{ "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 7 }})
{ "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 8 }})
{ "ok" : 1 }
> db.runCommand({ split : "chris.test" , middle : { shard : 9 }})
{ "ok" : 1 }
> use config;
switched to db config
> db.shards.find()
{ "_id" : "shard0000", "host" : "localhost:10000" }
{ "_id" : "shard0001", "host" : "localhost:10001" }
> db.chunks.find()
{ "_id" : "chris.test-shard_MinKey", "lastmod" : { "t" : 2000, "i" :
0 }, "ns" : "chris.test", "min" : { "shard" : { $minKey : 1 } },
"max" : { "shard" : 1 }, "shard" : "shard0001" }
{ "_id" : "chris.test-shard_1.0", "lastmod" : { "t" : 3000, "i" : 0 },
"ns" : "chris.test", "min" : { "shard" : 1 }, "max" : { "shard" : 2 },
"shard" : "shard0001" }
{ "_id" : "chris.test-shard_2.0", "lastmod" : { "t" : 4000, "i" : 0 },
"ns" : "chris.test", "min" : { "shard" : 2 }, "max" : { "shard" : 3 },
"shard" : "shard0001" }
{ "_id" : "chris.test-shard_3.0", "lastmod" : { "t" : 5000, "i" : 0 },
"ns" : "chris.test", "min" : { "shard" : 3 }, "max" : { "shard" : 4 },
"shard" : "shard0001" }
{ "_id" : "chris.test-shard_4.0", "lastmod" : { "t" : 6000, "i" : 0 },
"ns" : "chris.test", "min" : { "shard" : 4 }, "max" : { "shard" : 5 },
"shard" : "shard0001" }
{ "_id" : "chris.test-shard_5.0", "lastmod" : { "t" : 6000, "i" : 1 },
"ns" : "chris.test", "min" : { "shard" : 5 }, "max" : { "shard" : 6 },
"shard" : "shard0000" }
{ "_id" : "chris.test-shard_6.0", "lastmod" : { "t" : 1000, "i" :
13 }, "ns" : "chris.test", "min" : { "shard" : 6 }, "max" :
{ "shard" : 7 }, "shard" : "shard0000" }
{ "_id" : "chris.test-shard_7.0", "lastmod" : { "t" : 2000, "i" : 2 },
"ns" : "chris.test", "min" : { "shard" : 7 }, "max" : { "shard" : 8 },
"shard" : "shard0000" }
{ "_id" : "chris.test-shard_8.0", "lastmod" : { "t" : 4000, "i" : 2 },
"ns" : "chris.test", "min" : { "shard" : 8 }, "max" : { "shard" : 9 },
"shard" : "shard0000" }
{ "_id" : "chris.test-shard_9.0", "lastmod" : { "t" : 4000, "i" : 3 },
"ns" : "chris.test", "min" : { "shard" : 9 }, "max" : { "shard" :
{ $maxKey : 1 } }, "shard" : "shard0000" }

I see messages like this from the mongod's:

Fri Feb 18 13:03:09 [conn5] request split points lookup for chunk
chris.test { : 8.0 } -->> { : 9.0 }
Fri Feb 18 13:03:09 [conn5] warning: chunk is larger than 33554432
bytes because of key { shard: 8 }
Fri Feb 18 13:03:09 [conn5] Finding the split vector for chris.test
over { shard: 1.0 } keyCount: 8924 numSplits: 0 took 229 ms.
Fri Feb 18 13:03:09 [conn5] query admin.$cmd ntoreturn:1 command:
{ splitVector: "chris.test", keyPattern: { shard: 1.0 }, min: { shard:
8.0 }, max: { shard: 9.0 }, maxChunkSizeBytes: 33554432,
maxSplitPoints: 2, maxChunkObjects: 250000 } reslen:69 229ms
Fri Feb 18 13:03:10 [conn5] request split points lookup for chunk
chris.test { : 5.0 } -->> { : 6.0 }
Fri Feb 18 13:03:11 [conn5] warning: chunk is larger than 33554432
bytes because of key { shard: 5 }
Fri Feb 18 13:03:11 [conn5] Finding the split vector for chris.test
over { shard: 1.0 } keyCount: 8924 numSplits: 0 took 231 ms.
Fri Feb 18 13:03:11 [conn5] query admin.$cmd ntoreturn:1 command:
{ splitVector: "chris.test", keyPattern: { shard: 1.0 }, min: { shard:
5.0 }, max: { shard: 6.0 }, maxChunkSizeBytes: 33554432,
maxSplitPoints: 2, maxChunkObjects: 250000 } reslen:69 231ms


After a few runs:

> use chris
switched to db chris
> db.test.find().count()
3800000
> db.printShardingStatus()
--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
{ "_id" : "shard0000", "host" : "localhost:10000" }
{ "_id" : "shard0001", "host" : "localhost:10001" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" :
"config" }
{ "_id" : "chris", "partitioned" : true, "primary" :
"shard0000" }
chris.test chunks:
shard0001 5
shard0000 5
{ "shard" : { $minKey : 1 } } -->> { "shard" :
1 } on : shard0001 { "t" : 2000, "i" : 0 }
{ "shard" : 1 } -->> { "shard" : 2 } on :
shard0001 { "t" : 3000, "i" : 0 }
{ "shard" : 2 } -->> { "shard" : 3 } on :
shard0001 { "t" : 4000, "i" : 0 }
{ "shard" : 3 } -->> { "shard" : 4 } on :
shard0001 { "t" : 5000, "i" : 0 }
{ "shard" : 4 } -->> { "shard" : 5 } on :
shard0001 { "t" : 6000, "i" : 0 }
{ "shard" : 5 } -->> { "shard" : 6 } on :
shard0000 { "t" : 6000, "i" : 1 }
{ "shard" : 6 } -->> { "shard" : 7 } on :
shard0000 { "t" : 1000, "i" : 13 }
{ "shard" : 7 } -->> { "shard" : 8 } on :
shard0000 { "t" : 2000, "i" : 2 }
{ "shard" : 8 } -->> { "shard" : 9 } on :
shard0000 { "t" : 4000, "i" : 2 }
{ "shard" : 9 } -->> { "shard" : { $maxKey :
1 } } on : shard0000 { "t" : 4000, "i" : 3 }


Thanks for any clues.

Nat

unread,
Feb 18, 2011, 9:03:05 AM2/18/11
to mongodb-user
If your shard can go larger than the specified chunk size, you need to
choose a more unique shard key. Otherwise, the balancer will cause a
consistent load since it cannot split the data and will keep to do so.

On Feb 18, 9:43 pm, Chris Howells <chris+mo...@chrishowells.co.uk>
wrote:
> (complete source athttp://chrishowells.co.uk/people.txt)
> Thanks ...
>
> read more »

Chris Howells

unread,
Feb 18, 2011, 10:52:55 AM2/18/11
to mongodb-user
Hi,

On Feb 18, 2:03 pm, Nat <nat.lu...@gmail.com> wrote:
> If your shard can go larger than the specified chunk size, you need to
> choose a more unique shard key. Otherwise, the balancer will cause a
> consistent load since it cannot split the data and will keep to do so.

OK, thanks. So what I'm trying to do is pre-split some chunks and
manually move them to the correct the desired shard so that I can
stress test this hardware.

I'm using my own app to stress test it so I'm generate data in
whatever way is going to shard well. So I've changed it shard on a
value which has 1000 possible values (rather than 10), and trying to
re-split it into chunks of 100 like below, hopefully.

Should this work better?

Should I increase the specified chunk size? I am expecting a
collection to be about 40GB in total. If I have 10 mongodb servers
that'll be about 4GB each, all data inserted into the same collection.

db.runCommand({ split : "chris.test" , middle : { shard : 100 }})
db.runCommand({ split : "chris.test" , middle : { shard : 200 }})
db.runCommand({ split : "chris.test" , middle : { shard : 300 }})
db.runCommand({ split : "chris.test" , middle : { shard : 400 }})
db.runCommand({ split : "chris.test" , middle : { shard : 500 }})
db.runCommand({ split : "chris.test" , middle : { shard : 600 }})
db.runCommand({ split : "chris.test" , middle : { shard : 700 }})
db.runCommand({ split : "chris.test" , middle : { shard : 800 }})
db.runCommand({ split : "chris.test" , middle : { shard : 900 }})

db.runCommand({moveChunk: "chris.test", find: { "shard" : 100 }, to :
"shard0000"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 200 }, to :
"shard0001"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 300 }, to :
"shard0000"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 400 }, to :
"shard0001"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 500 }, to :
"shard0000"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 600 }, to :
"shard0001"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 700 }, to :
"shard0000"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 800 }, to :
"shard0001"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 900 }, to :
"shard0000"})
db.runCommand({moveChunk: "chris.test", find: { "shard" : 1000 }, to :
"shard0001"})

What I've modified the C++ app to:

for (int i = 0; i < 1000; ++i)
{
int k = 0;
for (int j = 0; j < 1000; ++j)

Jayesh

unread,
Feb 18, 2011, 11:12:41 AM2/18/11
to mongodb-user
Hi Gabriel,

Given what you have described, this should be quite easy to do - and
that too with a single server and without sharding !

However, let me ask a few more questions:

- Is the question whether MongoDB can do it or can it do it within
some constraints (e.g. time, space, throughput, response time etc.)?
- I am specifically interested in the time and/or throughput
constraints - any expansion on that would help
- Is it safe to assume that you have access to sufficient amount of
decently spec'ed hardware?
(e.g. servers, storage and network like the one described in your
post)
- Is it safe to assume that you have a "unique/primary key" or can
generate one from the client side as part of your data insert?

Assuming that you want to insert the 100k documents - each of 80KB
(lets say 100KB for ease of computation!).

This brings your *total* throughput requirement to 100,000 x 100,000 =
10,000,000,000 = 10 GB

If you want a fast, turbo-charged response time for your clients, I
would architect the data store solution by "inserting" a
"memcached" layer in between and making the data load as a two step
process. So here's how your "architecture" would look -

client ------ memcached ----- mongodb

In the first step, I would have the clients insert the data record in
memcached and insert only the "key" for the memcached in mongodb.
This will give you near-wire-speed throughput all the way through -
especially with the amount of RAM that you have on your server.

In the next step, I would siphon-off the data records from memcached
to mongodb by sequentially reading through the keys stored in the
first step.

I have been doing this kind of stuff for quite a while on virtual
machines on much smaller hardware than what you have.
My throughput has been quite decent/acceptable - although my record
size is quite small.
While your mileage may vary, I expect decent performance/throughput.
Also, all my work has been PHP.

Finally, if you do a "reasonable" amount (i.e. not too many) and well-
designed (i.e. not too big and applicable to your queries) indexes,
your read performance also should be pretty good. In fact, I would
"read" from memcached if wil be retrieving data based on the "key"
only.

Even if you read from MongoDB, your performance should be good because
you can most probably fit ALL your data an indexes in memory !

Hope that helps !


-- Jayesh


On Feb 17, 3:58 am, Chris Howells <chris+mo...@chrishowells.co.uk>
wrote:
> Hi,
>
> I need to do 100k concurrent inserts, from about fourty clients, for a
> period of a few hours. After the few hours period has finished, that
> data will not be inserted to or updated again. Another disparate job
> will run some time later and act in a similar manner, inserting data
> into another collection. Each insert is around 30KB of data, which
> MongoDB seems to turn into around 80KB of data on disk.  Ideally the
> fourty clients would each insert data into the same collection.
>
> 80KB X 100K is some 800MB a second, so we need to spread the write
> load over multiple machines, so we need sharding.
>
> I've read the Wiki, "Scaling MongoDB", the thread from last week
> "Scaling mongo with tons of writes" [1] and the associated bug report
> ([2]) and I'm thoroughly confused now about what Mongo can currently
> achieve with concurrent write scaling of a collection. If I had say 10
> Mongo servers to do sharding over, and 40 clients hitting those
> servers with 100K concurrent inserts of 80KB of data, to the same
> collection, could Mongo currently be able to cope? (mongos and config
> server configuration TBD).
>
> [1] -http://groups.google.com/group/mongodb-user/browse_thread/thread/e8eb...
> [2] -http://jira.mongodb.org/browse/SERVER-939

Eliot Horowitz

unread,
Feb 19, 2011, 12:06:04 AM2/19/11
to mongod...@googlegroups.com
I think your plan looks good.

Reply all
Reply to author
Forward
0 new messages