sharding tweets

braver

unread,

Nov 27, 2009, 1:51:59 AM11/27/09

to mongodb-user

I'd like to load Twitter data in such a way that I can scan it on my
32-core box quickly. As I understand, a single mongod is single-
threaded now, so I'd have to "shard" by loading chunks of the
collection into separate instances. Do the collections and containing
databases have to be called the same? How does the Java driver help
with sharding?

Ideally, what I need the help of the driver with is opening, for a
collection sharded across N instances, the N cursors, each on its own
thread, so I can scan the collection N times faster.

In the multi-threaded version, I'd like to do it automatically without
sharding, as follows:

-- mongod computes the N equidistant starting points for N chunks of a
collection, and
-- returns N cursors, each on its own thread in the Java driver

Possible? How do folks do it?
-- Alexy

Dwight Merriman

unread,

Nov 27, 2009, 10:27:57 AM11/27/09

to mongod...@googlegroups.com

a few thoughts

(1) you might want to just implement the problem first with a single client and server, to see if you are happy with the basic approach regardless of performance
(2) concurrency is coming; that said, it won't necessarily mean faster. it's possible disk i/o or main memory bandwidth are database performance limiters rather than cpu computation.
(3) you can shard as a way to max out all cores of the db. the database will look like one logical db and collection (or collections) so you don't do anything special generally
(4) what you describe below sounds a bit like a manual version of map/reduce. perhaps use Hadoop instead? Or the mongodb map / reduce (mongodb map/reduce uses javascript for function execution, so there is some limits to its speed, although it scales well)

--

You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

braver

unread,

Nov 27, 2009, 11:21:01 PM11/27/09

to mongodb-user

Dwight -- you make excellent points! (1) When I read my collection
straight from Clojure, it takes about 40 minutes to process 100
million tweets. When doing it, the java process serving Clojure goes
from 100% to 200-700% CPU, while mongod stays at about 7%. (2) The
latter is probably not in any way indicative of the disk IO
bottleneck... (3) You mean I can use sharding as it exists, readily
for this? (4) Indeed, would like to think in terms of map-reduce, but
would rather have the code in my languages of choice.

-- Alexy

Aníbal Rojas

unread,

Nov 28, 2009, 8:14:19 AM11/28/09

to mongod...@googlegroups.com

Brave,

Regarding (4): We did a lot of testing on MongoDB's MapReduce and
Server Side JavaScipt execution, against external MapReduce / Parallel
processing.

The second wins in terms of performance, and manageability. Period.

I prefer to see the DB bursting data to MapReduce processes that I
can control in terms of concurrency, priority and distribution, rather
to see MongoDB trying to solve a heavy weight MapReduce for minutes.

But in the end it depends on your problem, as usual.

--
Aníbal

> --
>
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
>
>
>

--
Sent from my mobile device

--
Aníbal Rojas
Ruby on Rails Web Developer
http://www.google.com/profiles/anibalrojas

Dwight Merriman

unread,

Nov 28, 2009, 8:36:54 AM11/28/09

to mongod...@googlegroups.com

(3) yes - but if db is at 7% utilization, i doubt that will help any as sounds like it isn't the bottneck

-- Alexy

Reply all

Reply to author

Forward