A bit of query optimisation help!

16 views
Skip to first unread message

JacobB

unread,
Feb 9, 2012, 5:17:49 PM2/9/12
to mongodb-user
Hi, so I'm currently trying to test out MongoDB's query functionality.
My test data is a collection containing 8 million documents. There are
about five hundred different types of documents- they all contain the
same ten basic properties/attributes, but each different type has some
additional ones. The smallest add one or two extra properties/
attributes, but the largest adds about 1000. On average, they each add
about 10 extra. The overall size of the data is 10 gigs on disk.

Now, 286,000 of these documents contain information about the user's
user agent. So, for some analytics, maybe one thing I want to do is
find out a count for each of the different client browsers- this will
involve firstly discovering what all the browsers are, and then for
each of them creating a count of the number of distinct users using
each. Now for this I couldn't figure out how (if it is possible) to
write something like this in a single Mongo query in Java. But I came
up with the following java code:

query.put("Client Browser", new BasicDBObject("$exists", true));
List<String> userAgents = auditMongoCollection.distinct("Client
Browser", query);
for (int i = 0; i < userAgents.size(); i++){
query.put( "Client Browser",userAgents.get(i));
List<String> browserUsers = auditMongoCollection.distinct("userID",
query);
System.out.println( userAgents.get(i) + ": " + browserUsers.size() );
}

The hardware I'm using is somewhat modest, with only 8 gigs of RAM, so
the entire dataset will not fit. But 8 gigs seems enough to me for
storing the indexes at least. Without indexes, this was a pretty slow
query, taking 203 seconds. Putting an index on client browser takes
this to sixteen seconds, which doesn't seem bad to me at all.

But I have two big questions with this:
1. Is there anyway I can optimise this algorithm further?
2. Is it possible to do more work in a single Mongo query?

Eliot Horowitz

unread,
Feb 12, 2012, 1:26:32 AM2/12/12
to mongod...@googlegroups.com
Do you know which part is slow?
How many different browsers are there?

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>

Raxit Sheth

unread,
Feb 12, 2012, 1:59:18 AM2/12/12
to mongod...@googlegroups.com
What you are trying to do is, trying to analyze some property/value into existing data set. <and not modify it !>

You want it for real-time processing or batch processing ?
What is acceptable timeline ?


JacobB

unread,
Feb 12, 2012, 8:27:22 PM2/12/12
to mongodb-user
So just to explain, I'm just testing at the moment. So I'm not seeing
whether or not I can do this for a real time application. Though given
this particular query, I would assume this is the sort of thing you
might want for analytics!
There are only 12 different browser types in the dataset, so its not
the loop which is slowing things down. Actually I was curious myself,
so I threw in some more timers... and what would you know? The vast
majority of this is stuck here:
List<String> userAgents = auditMongoCollection.distinct("Client
Browser", query);
The loop actually finishes within 300 milliseconds. So its just this
one distinct call which is making it waaaay slow. Why is this?

Eliot Horowitz

unread,
Feb 12, 2012, 9:42:10 PM2/12/12
to mongod...@googlegroups.com
that distinct isn't optimal yet
See: https://jira.mongodb.org/browse/SERVER-2094

You might just want to keep a separate collection of just unique browser ids

Reply all
Reply to author
Forward
0 new messages