Return every Nth record

1,372 views
Skip to first unread message

Justin

unread,
Oct 3, 2011, 10:03:35 PM10/3/11
to mongodb-user
I have a somewhat large dataset that I need to work with, but
depending on the time range, the data may be way more than what is
needed, or can be handled by the client.

Is there a query that would return every Nth record, where N increases
in value as the time range (number of records) increases?

Right now I'm returning all the records in the query, and processing
as they come in using a date check, but it's CPU intensive, and seems
like more work than it should be.

Thanks for any insight!

Karl Seguin

unread,
Oct 3, 2011, 10:21:25 PM10/3/11
to mongod...@googlegroups.com
If I understand, you want to pull a sample from your collection. If there are 1000 records, you'd want a random selection of 100 records (N  = 10). If there are 100000 records, you'd want a random selection of 100 records also (N = 1000) ?

I think the Random Attribute approach from the cookbook is the best place to get started:

Once you figure out what N is, which should be something like

var numberOfSamples = 100;
var n = db.x.count() / numberOfSamples;

you can do something like:

var rand = Math.random();
var found = db.x.find({random: { $gte: rand}}).limit(numberOfSamples).toArray();
if (found.length < numberOfSamples) {
    var alsoFound = db.x.find({random: { $lte: rand}}).limit(numberOfSamples - found.length).toArray();
   //merge alsoFound into found
}

MiddleTommy

unread,
Oct 4, 2011, 9:27:13 AM10/4/11
to mongodb-user
I would try an async approach. Depending on your programming language
this may or may not be easy. then use limit(1).skip(N) in an async for
loop..
From the shell I dont know of a way to get the results you seek.

Mathias Stearn

unread,
Oct 4, 2011, 12:26:19 PM10/4/11
to mongod...@googlegroups.com
@MiddleTommy: don't do that! it will have to scan all records many times. Since skip() will find and skip over each record using a high value will be very slow. In particular using that to fetch every 10th item will be O(n^2) in the size of the collection.


I think there are two good ways to do this. The most direct is to assign an incrementing value to every object, then use the $mod operator[1] to pick every Nth record. This will require a full table scan so I don't suggest it if using a high N although it will work just fine for smallish values of N. This also won't work as well if you will be deleting records and you think you will not delete an equal number from each mod-N bucket.

Another solution similar to Karl's that will rely on randomness to provide a probabilistically correct solution would be to assign all objects a random value from 0 to 100,000. Then you can select a range of numbers where 1000 numbers equals 1% of total (so if you want N=1000 select a range of 100 such as [0,100] or [12342,12442]). You can index the random value field which will make this work well for large values of N but it will not work as well for huge datasets (larger than ram) as objects will be pulled from random places on disk which will be much slower than sequential access. For smaller data sets, you will want to use a max value well under 100,000, say db.foo.count()/10 or /100.

Justin

unread,
Oct 4, 2011, 8:42:52 PM10/4/11
to mongodb-user
These are great suggestions, and ones I hadn't considered. Thank you!

It looks like there is already a feature request to have this
functionality added (as a few DB's support it in one form or another):
https://jira.mongodb.org/browse/SERVER-2397
Reply all
Reply to author
Forward
0 new messages