aggregation $sort index

810 views
Skip to first unread message

Sam Halliday

unread,
Dec 23, 2012, 3:37:09 PM12/23/12
to mongod...@googlegroups.com
Hi all,

I'd like to be able to use $project to add the sorted index of documents into an aggregation pipeline.

e.g. imagine creating a leaderboard of scores for some game, I'd like to be able to sort the documents, add the ranking in the table, and then take the top 10 or bottom 10.

(Actually, I'd like to be able to take the 5 above and below a certain value, but I suspect this is too complex for a single Mongo aggregation query)

Regards,
Sam

Sam Halliday

unread,
Jan 3, 2013, 6:29:23 AM1/3/13
to mongod...@googlegroups.com
Hi all,

I've not received any responses to this – so I assume it is not possible.

Instead, as a workaround, is there any way to use MapReduce to add a field to all documents in a collection which indicates their sorted position? (I could then run this periodically).

Sam Halliday

unread,
Jan 3, 2013, 6:36:22 AM1/3/13
to mongod...@googlegroups.com
Created RFE: https://jira.mongodb.org/browse/SERVER-8065

Workaround suggestions welcome!

Sam Millman

unread,
Jan 3, 2013, 6:39:14 AM1/3/13
to mongod...@googlegroups.com
I don't quite understand:


e.g. imagine creating a leaderboard of scores for some game, I'd like to be able to sort the documents, add the ranking in the table, and then take the top 10 or bottom 10.

If you sort by something other than score how will you know what is the top 10 and bottom 10? Or do you mean you wanna take the first top ten (vice versa too) sorted by another attribute?


--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Samuel Halliday

unread,
Jan 3, 2013, 6:56:10 AM1/3/13
to mongod...@googlegroups.com
Sam,

The sorting into "position" happens by another attribute – the "score". I already have that.

The kind of query I want to do is "top 10" and "5 above / below" some particular document. When there are 10,000+ documents, doing this outside MongoDB is intractable.

--
Sam

Sam Millman

unread,
Jan 3, 2013, 7:11:47 AM1/3/13
to mongod...@googlegroups.com
So you lready have a score attribute which is maintained?

Then can't you do this via normal querying without using things like MR or the aggregationm framework?

i.e.:

find({_id: {$gt: the_last_document_i_want_5_above}}).sort({_id: 1}).limit(5) ?

Sam Millman

unread,
Jan 3, 2013, 7:15:38 AM1/3/13
to mongod...@googlegroups.com
Instead of sort() on _id you would sort on score of course

Sam Halliday

unread,
Jan 3, 2013, 7:22:24 AM1/3/13
to mongod...@googlegroups.com, mongod...@googlegroups.com
Sam,

That's not what I am asking: I can assure you I know how to write a limited database query.

Rather than repeat my question, I would urge you to reread the thread and RFE. I need the position field to be calculated.

Kind regards,
Sam Halliday

-- 
Sent from my iPhone

Stephen Steneker

unread,
Jan 3, 2013, 7:32:58 AM1/3/13
to mongod...@googlegroups.com
The sorting into "position" happens by another attribute – the "score". I already have that. 

The kind of query I want to do is "top 10" and "5 above / below" some particular document. When there are 10,000+ documents, doing this outside MongoDB is intractable.

Hi Sam,

If I understand correctly, you want to show "nearby" rankings for a given user's rank.  For example, if a user was currently 250th you might be finding the 5 users ranking above and below that position (so ranking 245-255).

A much better approach would be periodically calculating the overall rankings so you can do range queries based on the already calculated ranking values.

Frequently trying to aggregate all the user rankings to find a range for each user would be very resource intensive (and impractical for realtime leaderboard views).

Leaderboards on popular sites often update only a few times a day (or even once a day, as with StackOverflow :).

Cheers,
Stephen 

Sam Millman

unread,
Jan 3, 2013, 7:33:19 AM1/3/13
to mongod...@googlegroups.com
I read wrong, sorry. I, for some reason, just read the score part not the position part.

There is no easy way to do this realtime. Most people do a recursive update of the collection iterating through a cursor sorted by score (since score actually defines position) to assign position once every maybe hour or day (as is the case for sites like stackoverflow).

An MR could take an input query of sorted score and then just emit a "var i" into a position collection with a value of user_id:

{
    _id: 5, // position
    user_id: OjbectId()
}

You can then query this collection to get your user_ids and then do a quick $in query to get the associated users to the positions. It is two cursors but meh

Sam Halliday

unread,
Jan 3, 2013, 7:40:52 AM1/3/13
to mongod...@googlegroups.com
Thanks Sam, I'll look into passing a sorted query to the MR, I didn't realise that was possible.


Kind regards,
Sam Halliday

-- 
Sent from my iPhone

Sam Halliday

unread,
Jan 3, 2013, 7:44:47 AM1/3/13
to mongod...@googlegroups.com
A much better approach would be periodically calculating the overall rankings so you can do range queries based on the already calculated ranking values.

I believe this is exactly what I was asking for as a workaround ;-) Sam 2 has suggested passing a sorted query to MR so I'll try that.

I believe it is feasible to do the sorting and extraction in an aggregation (Ice profiled and it works well). There is just no way to add the position field, which is necessary for me. The real problem is slightly more involved that I have described here, so using MR is something of a hack.


Kind regards,
Sam Halliday

-- 
Sent from my iPhone
--

Stephen Steneker

unread,
Jan 3, 2013, 8:03:32 AM1/3/13
to mongod...@googlegroups.com
On Thursday, January 3, 2013 11:40:52 PM UTC+11, Sam Halliday wrote:
Thanks Sam, I'll look into passing a sorted query to the MR, I didn't realise that was possible.

Hi Sam,

I'm not sure you need a Map/Reduce for this.  Are there extra calculations being done?

Since you already have the score calculated for each user, I think you would just do a find() which returns results in the desired ranking order and iterate the results to update the ranking for each user.

Some example JavaScript code (mongo shell):

// Calculate rankings based on score (descending)
var userRank = 1;
db.users.find().sort({score: -1}).forEach(
function(user) {
db.users.update({_id:user._id}, {$set: { ranking: userRank }});
userRank++;
}
)

// Top 10 users
db.users.find().sort({ranking: 1}).limit(10);

// Bottom 10 users
db.users.find().sort({ranking: -1}).limit(10);

// Find users +/- from my ranking in the leaderboard
var myRanking = 10; var nearbyRanking = 5;
db.users.find({ $and: [
{ ranking: { $gte: myRanking - nearbyRanking }},
{ ranking: { $lte: myRanking + nearbyRanking }},
]}).sort({ranking:1})

Cheers,
Stephen 

Samuel Halliday

unread,
Jan 3, 2013, 8:20:09 AM1/3/13
to mongod...@googlegroups.com
Stephen,

I'd like to avoid MR, but it is the best out of the alternatives.

Pulling all the objects into the mongo shell is suboptimal is it involves network traffic. Using exec suffers the same lock problems as MapReduce, plus requires additional deployment setup.

These are all hacky workarounds – the optimal approach would be to be able to obtain the index of a sorted document in an aggregation, hence the RFE. An aggregation is optimal since it all stays in the DB, there is no global lock and there is no rank vs score lag (visible to users).

--
Sam

Sam Millman

unread,
Jan 3, 2013, 8:25:09 AM1/3/13
to mongod...@googlegroups.com
Map reduce and the shell should have similar bandwidth measurements provided your shell is on the master server. Since an MR runs on each shard and then returns a result it still has to pipe all that to the master. So the cost of picking each out in console will be similar to the cost of picking them out in an MR.

Sam Millman

unread,
Jan 3, 2013, 9:10:56 AM1/3/13
to mongod...@googlegroups.com
I should add though that what I mentioned was bandwidth not working set or cost of paging in those documents on that server. I am unsure which method is more performant actually in this scenario and I would recommend trying both out

Samuel Halliday

unread,
Jan 3, 2013, 9:18:08 AM1/3/13
to mongod...@googlegroups.com
In my application, MP is the preference over script or exec. (the others lose too many points on network I/O and deployment setup).

--
Sam
Reply all
Reply to author
Forward
0 new messages