sharding and high speed writes

John May

unread,

Aug 2, 2010, 5:09:46 PM8/2/10

to mongodb-user

Hello --

I'm evaluating mongodb for an application where I need to store data
from a live network connection at very high rates. I expect to query
only a small fraction of this data, and as I fill up the available
storage, I will evict the oldest entries to make room for new ones.
The mongodb capped collection looks very well suited to my needs. In
fact, with a single server instance I can ingest at about 50 MB/second
(to a memory mapped file; presumably this will be slower once the data
starts spilling out of main memory).
Unfortunately, this isn't quite fast enough, so I'm looking at
sharding. Even more unfortunately, sharding and capped collections are
not compatible. Too bad, because not only do I lose the high
performance that comes from preallocating storage, but I also lose the
automatic eviction of old data. So my first question is, Will it ever
be possible to shard capped collections?
Now, when I tried to shard over four servers using a non-capped
collection, I didn't see any performance gain compared to running a
single-server instance. Both run at about 25 MB/second. Is this
expected behavior? I have the shards on for separate cluster nodes,
each writing to a local disk. There is a single configuration server
on another node, and the routing process is on yet another node with
the client that is collecting the data. Is that a reasonable
arrangement? I suppose I could do the sharding myself and split my
feed over N distinct instance of mongodb, each writing to a capped
collection. But then I would have to implement the distributed query
infrastructure myself. Is there another way to get the write
performance I need, which is roughly 100 MB/second? I can throw more
nodes and disks at the problem, if necessary.

Eliot Horowitz

unread,

Aug 2, 2010, 5:22:20 PM8/2/10

to mongod...@googlegroups.com

What's the bottleneck sharded/not sharded?
Can you run "iostat -x 2"

Also - you seemed to say you had 4 shards, each going at 25MB/s
totaling 50, so either I missed something or you mis typed.

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

John May

unread,

Aug 2, 2010, 6:11:41 PM8/2/10

to mongodb-user

OK, iostat pointed out something I suspected but hadn't actually
verified: only one shard is being written at a time. That's because
I'm sharding on "_id" and probably the automatically generated keys
are adjacent. So that explains the lack of scaling. One solution I've
used for this kind of problem in the past is to generate a hash on the
timestamp and use that as a key to distribute the records more evenly
across the shards. I'll try that...

Just to clarify my earlier message, I was describing three separate
experiments:

Capped collection, no sharding: 50 MB/sec
Non-capped collection, no sharding (single-server instance of
mongodb): 25 MB/sec
Non-capped collection, 4 shards: 25 MB/sec total (actually a little
slower...)

But even if I can get this to scale, I'm still wondering about the
possibility of a capped, sharded collection.

Eliot Horowitz

unread,

Aug 2, 2010, 10:58:35 PM8/2/10

to mongod...@googlegroups.com

Ah ok - makes sense.
And yes - for high write throughput you should use a non-sequential shard key.

About a sharded capped collection - the semantics of this are very confusing.
The way capped collections work is that they are a sequential based on insert.
With sharding, lets say you had a capped collecion on each shard, you
could end up with hot spots where data was really old on some, new on
others, etc...

What semantic make sense to you?

John May

unread,

Aug 3, 2010, 8:09:17 PM8/3/10

to mongodb-user

For a capped sharded collection, the first thing I'd think of would be
round-robin distribution of successive entries over the shards. This
would optimize load balance across the shards. If keys are generated
sequentially, it should be easy to figure out where a given item
resides without having to query them all. But there may be cases with
user-selected keys where you can't infer the placement during a query.
Also, increasing the size of such a collection would be hard, but this
is a "capped" collection, after all.

On another front, I still haven't been able to get standard sharding
to scale properly on ingestion (with version 1.5.6). My hashed keys
(64-bit integers) appear to be well-distributed, but the vast majority
(though not all) of my chunks end up going to shard0002. Any idea why
this would happen?

Eliot Horowitz

unread,

Aug 3, 2010, 11:43:20 PM8/3/10

to mongod...@googlegroups.com

How much data have you put in?
Default chunk size is 200mb, so until you hit 2gb or so won't see much
distribution.
You can change the chunk size with --chunkSize on mongos command line

Mathias Stearn

unread,

Aug 4, 2010, 2:00:53 AM8/4/10

to mongod...@googlegroups.com

could you attach or pastebin the output of the printShardingStatus() function?

Markus Gattol

unread,

Aug 5, 2010, 1:20:39 PM8/5/10

to mongod...@googlegroups.com

[skipping a lot of lines ...]

John> On another front, I still haven't been able to get standard
John> sharding to scale properly on ingestion (with version 1.5.6). My
John> hashed keys (64-bit integers) appear to be well-distributed, but
John> the vast majority (though not all) of my chunks end up going to
John> shard0002. Any idea why this would happen?

Mentioned already but here is a summary of why this might be:

http://www.markus-gattol.name/ws/mongodb.html#i_want_mongodb_start_sharding_with_less_50_mib

Reply all

Reply to author

Forward