Slow aggregation pipeline queries

470 views
Skip to first unread message

BSR

unread,
Feb 28, 2017, 4:10:15 PM2/28/17
to mongodb-user
We have a pipeline query with an $match stage at first that matches on an field with unique index.  
For aggregation pipeline queries with large numbers of matching criteria we're exceeding memory limitations of Aggregation framework(100MB) resulting in diskwrites to _tmp directory which is slowing down the process
Is there any way to pull parts of the document during the $match stage such that instead of pulling all the fields in all matching documents in to memory and pass into the next stage of aggregation pipeline(Which is $project) we could only pull few required fields from all the matching documents into memory?

Rhys Campbell

unread,
Mar 1, 2017, 4:30:49 AM3/1/17
to mongodb-user
Perhaps you can do something with $sort & $limit.


Provide the query and a sample document otherwise.

Ivan Grigolon

unread,
Mar 4, 2017, 3:16:48 AM3/4/17
to mongodb-user

Hi,
as Rhys pointed out you can optimize the pipeline query to avoid reaching the 100MB limit.


Also, you can add indexes to speed up queries and aggregation pipelines. However some stages of the aggregation pipeline will not be able to use an index, for example $group cannot use any index. $match instead can use an index. You can find more information in in the pipeline operator and indexes section of the MongoDB documentation.


Could you please provide some documents and the query that is hitting the 100MB limit? Also, please provide the indexes that are in use with the command db.collection.getIndexes


It would also help if you could provide any error log should you have any.


Best Regards,
Ivan

BSR

unread,
Mar 5, 2017, 3:23:00 AM3/5/17
to mongodb-user
Hi Ivan,
Thanks for the response while I am working on getting info you asked for ,I would like to provide more in detail info on our problem

Our primary issue is generating a report which  requires processing of 80,000 documents with AVG size of 3 KB 
we are using aggregation pipeline query to generate a report for this customer ,initially we got error saying we hit the limit of 100 MB while generating this report
 so we have added allowDiskUse option then the report  generated successfuly but it's taking around 1-2 min (for other reports using less than 20,000 documents the report generation time is within 5 sec 

sample of aggregation pipeline query we are using :

[ { "$match" : { "rec.lrec.Result.Control" : { "$in" : [ "4TXdsd" , "4dsds"]}}} , { "$project" : { "purpose1likStatusList" : { "$ifNull" : [ "$rec.lrec.Currentlik.purpose1.StatusList" , ""]} , "purpose2likStatusList" : { "$ifNull" : [ "$rec.lrec.Currentlik.purpose2.StatusList" , ""]} , "latestrec" : "$$CURRENT"}} , { "$project" : { "likFields" : { "$cond" : [ { "$ne" : [ "$purpose1likStatusList" , ""]} , { "likType" : "$latestrec.rec.lrec.Currentlik.purpose1.Type" , "likStatus" : "$latestrec.rec.lrec.Currentlik.purpose1.StatusList"} , { "$cond" : [ { "$ne" : [ "$purpose2likStatusList" , ""]} , { "likType" : "$latestrec.rec.lrec.Currentlik.purpose2.Type" , "likStatus" : "$latestrec.rec.lrec.Currentlik.purpose2.StatusList"} ,  null ]}]} , "latestrec" : "$latestrec" , "_id" : 0}} , { "$unwind" : "$latestrec.rec.lrec.EventList"} , { "$match" : { "latestrec.rec.lrec.EventList.Common.Subtype" : "VOL"}} , { "$unwind" : "$latestrec.rec.lrec.EventList.DescriptionList"} , { "$project" : { "dScore" : "$latestrec.rec.lrec.Result.ComScoreInt" , "dId" : "$latestrec.rec.lrec.Criteria.DmrDId" , "likType" : "$likFields.likType" , "likStatus" : { "$arrayElemAt" : [ "$likFields.likStatus" , 0]} , "expirationDateTicks" : "$likFields.expirationDateTicks" , "recReturnDateTicks" : "$latestrec.rec.lrec.Result.ReturnedDate.Ticks" , "recOrigDateTicks" : "$latestrec.rec.lrec.Result.recOrigDate.Ticks" , "cit" : "$latestrec.rec.lrec.EventList.DescriptionList.AdrLargeDescription" , "comPoints" : "$latestrec.rec.lrec.EventList.DescriptionList.ComPoints" , "comCode" : "$latestrec.rec.lrec.EventList.DescriptionList.comCode" , "volDate" : "$latestrec.rec.lrec.EventList.Vol.CovDate.Ticks" , "repId" : { "$literal" : "58b7sdasafa"}}}]

I ran profiler multiple times while this report was being generated but I didn't capture any queries taking more than 3 sec but the Application is displaying the report after 1-2 min after the pipeline query has hit the mongo 

our assumptions for troubleshooting the issue were: 

1)when the aggregation pipeline query is executed the  match on record.control(Record.control has Unique index)  is executed first and the query is hitting mongo and pulling all 80,000 documents  including all fields in to its memory(by doing an index scan) for passing on to the next level of aggregation pipeline 
2)The match step of aggregation pipeline is taking less than 3 sec which we are recording at the profiler 
3)The  major delay is after the match step of pipeline i.e processing of 80,000 documents in next levels of project ,unwind ,,, to get the required output
4)As per my understanding if the memory limit of 100 MB is reached mongo would write few temp files on the /data/_tmp directory and use those files to process the data in later steps of aggregation pipeline query , but when I check the _tmp directory while the report is being generated I don't see any files being created on that directory which I don't understand

questions :

1)Are any of our assumptions are wrong?
2)Do you see any issues with the above aggregation pipeline query?any suggested improvements ?
3)Any suggested methods of tuning the aggregation pipeline query?
4)To generate the final report we need only 30% of all the fields in each document , but as the aggregation pipeline above starts with a match condition and later project comes into picture we assume that all the fields from all the matching documents are being getting into memory making it to hit 100 MB limit and it is becoming too late for the project to come into play and reduce the size of documents for the next pipeline stages
how can we avoid this?
5)Is there a more secure way of sharing our info with you?

I really appreciate your effort in helping us to resolve our issues  

Thanks
sreddy 

BSR

unread,
Mar 5, 2017, 5:10:08 PM3/5/17
to mongodb-user
there are 2 indexes :
1) Result.Control -- Unique index --- as mentioned in the assmuption #1 provided in previous post 
2) dId 

Ivan Grigolon

unread,
Mar 9, 2017, 11:39:56 PM3/9/17
to mongodb-user

Hi,


I have reformatted the query to make it easier to read:

The first thing I noticed is the $unwind operations. Depending on the number of elements in the array that you are unwinding, the number of documents will grow significantly. For example, depending on how selective is the initial $match, if the array only has 2 elements, you could potentially end up with 160,000 (i.e. 2 * 80,000) of non-indexed documents, which will subsequently be unwound again in the second $unwind operation.


Another observation, if it's possible, you should move forward the second $match before $unwind stage, filtering down the number of documents further.


I would also recommend to benchmark performance in isolation per stack. For example, you should time the aggregation execution via mongo shell and exclude your application report display to get more accurate measurement. Without knowing your application stack and how it would process the result returned by the aggregation pipeline, displaying large number of results may cause performance issues.


It is worth mentioning that MongoDB has flexible schema design, and you should take advantage of this by designing your MongoDB data schema for the benefit of your application. i.e. model your document schema to match your application use case.


If you are interested in a private and confidential level of support, you could consider alternative support channels.


Best Regards,
Ivan

Asya Kamsky

unread,
Mar 10, 2017, 2:19:17 AM3/10/17
to mongodb-user
1. What version of MongoDB are you using?

2. What is the output when you run explain on this aggregation pipeline?

3. if you only need a subset of fields from the document, why are you preserving the entire document in the first $project stage?

4. you said you have a unique index on Result.Control - where are the 80,000 documents coming from?  You are only matching on two possible values.

Asya


--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/dfcec6e4-53cd-4d2a-aedd-42af42d5ca12%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Asya Kamsky
Lead Product Manager
MongoDB
Download MongoDB - mongodb.org/downloads
We're Hiring! - https://www.mongodb.com/careers
Reply all
Reply to author
Forward
0 new messages