Fetching all documents from mongo for batch processing

1,081 views
Skip to first unread message

nare...@helpshift.com

unread,
Dec 10, 2016, 4:37:59 AM12/10/16
to mongodb-user
Hi,

I have a collection called issues that has a field called message_ids. There is another collection in the same db which has these messages as separate documents.
I have around 23 million issues which in total have 126 million messages. I go through the issues collection sorted by created_at, and fetch messages for the issues along the way. 

After about 8 million issues and all their related messages, I see a sudden spike in the number of pages in memory. I am not sure what is happening here. 

I have a couple of questions so that I can understand what I am doing wrong:

1. What happens when I supply the `limit` field to a query, does it make mongo perform better? I have been fetching messages with queries that have a limit. So far, I have read, that limit would only affect the Result Set that would be served in batches by the mongod server. So, I am thinking that it doesn't matter what the limit is, we always get documents in batches that cannot exceed 16Mb in size. So, my question is whether there is real benefit in using `limit` solely for the purpose of limiting the data that I fetch form mongo in one query. I have to eventually get everything even though I use a limit. So, for that I have range queries on a timestamp field. I have made sure that I am hitting the right compound indices.

2. How can I efficiently query documents given that I need to fetch all of them and process them in some way? How can I avoid page faults when I know that I am going to query for every document in the next 12 hours? 

3. How is Result Set kept on mongod server side? How is working set derived from the Result Set? Is it possible to simply flush all the working set once onto disk and start afresh so that there are new fetches on page faults and no flushes to disk?

I am using mongod version 2.6

Thanks,
NJ

Tom Li

unread,
Dec 19, 2016, 12:18:31 AM12/19/16
to mongodb-user

Hi Narendra,

Could you provide the following relevant information:

. Are you running single mongod server or clustered? Replica Set or with Sharding?

. What is (are) the specs of your mongod server(s)?

. What is the size (storage) of the ‘issues’ collection and the other collection for all messages?

. What’s your query and the explain() result. Please see cursor.explain() and Indexes: covered queries

. How do you fetch the documents from the collection ? Through one of MongoDB drivers, if so which one ? or via mongo shell?

To answer your questions:

  1. limit() is used to maximize performance and prevent MongoDB from returning more results than required for processing. Please refer to the doc cursor.limit()
  2. If you really have to query all of the documents, then make sure you have enough memory for your working set. You can minimise page faults if you have enough memory to hold your working set. Please see workingSet for more info. If this is a scheduled job that runs every 12 hours, depending on the use case you can try to pre-process the collection to avoid churning the entire collection every 12 hours. For example see pre-aggregation workflow sample.
  3. I recommend you to have a look at the following doc for more details Memory Diagnostics

Please note that MongoDB v2.6 has reached it’s end of life on October 2016, the latest MongoDB version is 3.4.0, which has a lot of improvements over your running version 2.6. Please refer to the Release Notes for MongoDB 3.4. You may want to determine if upgrading is applicable to your use case. However, before doing any major changes to your deployment, please ensure that all data are backed up and all procedures thoroughly tested.

Regards,

Tom


Reply all
Reply to author
Forward
0 new messages