document sorting with $skip and $limit

148 views
Skip to first unread message

John Myers

unread,
Feb 3, 2015, 2:00:29 PM2/3/15
to mongod...@googlegroups.com
I have small sensors that collect information and they all simultaneously insert data into a single database/collection.

I batch process this data using another program, right now I set my batch size limit to 50,000 documents and use a combination of $skip and $limit to pick up where I left off (from the previous batch).

I need to return my batches sorted by a 'start_time', when I sort or use $orderby AFAIK the entire collection gets sorted first then the query is returned, is that correct?

Example for running my batch queries:

batch # : query
0 : find({yadda yadda yadda}).skip(0).limit(50000)  ~ lets say this returns 300 documents, then I pick back up at 300
1 : find ({yadda yadda yadda}).skip(300).limit(50000) ~ lets say this returns 8000 documents, then I pick back up at 8300
2 : find ({yadda yadda yadda}).skip(8300).limit(50000) ... and so on

Is this the best way to batch process? 

Is there a better way to sort each batch (or do some kind of insertion sort) when inserting?

Asya Kamsky

unread,
Feb 3, 2015, 2:05:13 PM2/3/15
to mongodb-user
John,

The best way to sort would probably be by an indexed field with a
boundary condition.

For example, if you let MongoDB generate the _id field, it's already
indexed and it's monotonically increasing (since the first four bytes
represent timestamp).

You can now query for all records with {
_id:{$gt:ObjectId("last_idvalueprocessed")}}).sort({_id:1}).limit(50000)

When you process the batch of 50K you make a note of the highest
(last) _id value and use it in the next query to fetch.

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/e9c4d31f-0b17-4445-b68e-f8e7bdd874ee%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Asya Kamsky

unread,
Feb 3, 2015, 2:06:15 PM2/3/15
to mongodb-user
Oh, by the way, sorting on start_time isn't working for you because
you don't have an index on it - if you did then there would be no need
for mongod to do an in memory sort before returning results.

Asya

John Myers

unread,
Feb 3, 2015, 2:15:31 PM2/3/15
to mongod...@googlegroups.com
Thanks! My 'start_time' field does not necessarily correspond with the default OID timestamp so I cannot use the OID (sorting by OID and start_time may yield a different order). However, as you mentioned, if I place an index on the 'start_time' field (which is just a Double, being a UNIX timestamp) and I modify the example like this: 

You can now query for all records with { 
start_time:{$gt:"last_start_time_processed"}}).sort({start_time:1}).limit(50000) 

Would this achieve the same thing?

Asya Kamsky

unread,
Feb 3, 2015, 3:50:01 PM2/3/15
to mongodb-user

Yep, your sample would achieve exactly what you want.

Just make sure you cannot have duplicate start_time.  Otherwise, it's possible to miss some records (which wouldn't be the case with ObjectId() unless your clocks were not synchronized - conceivably that could be a problem with both fields).

Asya

John Myers

unread,
Feb 3, 2015, 4:13:39 PM2/3/15
to mongod...@googlegroups.com
Ah, that makes sense.

For a batch that comes back, is there a quick way to access the last record (so I can update the last_start_time_processed before starting work on the batch) without moving the cursor. Right now I have multiple threads that pull batches out and work in them simultaneously, kind of like:

lock mutex
fetch_batch()
update_batch_left_off()
unlock mutex
distpach_batch_for_work()

It seems I may be locked into an iterative approach with this?

Asya Kamsky

unread,
Feb 4, 2015, 2:36:24 AM2/4/15
to mongodb-user
It's hard to create a perfect solution to this - what if a batch
processing thread dies before it's done?

Asya
> https://groups.google.com/d/msgid/mongodb-user/4f9bc625-c586-40f5-a1c4-703f90d934aa%40googlegroups.com.

John Myers

unread,
Feb 4, 2015, 11:39:08 AM2/4/15
to mongod...@googlegroups.com
What a about a sparse indexed field such as analyzed:true that I attach to each document on insert.

Then within the batch processing, I just $unset each document as its touched/analyzed, if the thread dies the documents that didn't get looked at still have the indexed field set.

John Myers

unread,
Feb 4, 2015, 1:23:24 PM2/4/15
to mongod...@googlegroups.com
Although I suppose that won't work if I'm marking the last_start_time value. The likelihood of a batch crashing is very low given the program I'm running, and if a thread does crash, then there's an acceptable amount of loss that can be taken from the batch processing program. The problem still is being able to rapidly determine what the last document in a return set is so I can pull out the OID or indexed field to mark as the one that I left off at.


On Wednesday, February 4, 2015 at 2:36:24 AM UTC-5, Asya Kamsky wrote:

Asya Kamsky

unread,
Feb 4, 2015, 7:04:38 PM2/4/15
to mongodb-user
That's a possibility - but why would you want to index this field?
> https://groups.google.com/d/msgid/mongodb-user/af12b204-abdc-4557-bdba-2c975c06080b%40googlegroups.com.

John Myers

unread,
Feb 4, 2015, 7:38:11 PM2/4/15
to mongod...@googlegroups.com
I was referencing the indexed field that contains the timestamp, but having thought about it I think I can actually just use the OID to grab the new batch.
Reply all
Reply to author
Forward
0 new messages