$graphLookup scalability

695 views
Skip to first unread message

Ryan Hodges

unread,
May 4, 2018, 8:40:50 PM5/4/18
to mongodb-user
Hi,

Our team is deciding whether MongoDB might be a good fit for our application.  One of the use cases is querying for the predecessor commits of a given commit in a version control system, taking into account merges, branch creations, etc....  We have a commits collection that has a field that stores the parent commits for that given commit.  Each commit except for the very first commit has at least one parent.  Mongo's $graphLookup function works nicely here.  However, we've hit two limitations.  Maybe with your expertise you can help us out:

One is if we start at the say 50, 000 commit and we try to walk back the commit tree, we exceed the 16mb document size:

db.commits.aggregate([{ $match: { "_id": "5aeadb26b2ac4334dc138e5b@polaris_dev/50000" } }, {$graphLookup: {"from": "commits", "startWith": '$parent', "connectFromField": "parent", "connectToField": "_id", "as": "predecessors"}}]).pretty()
assert: command failed: {
        "ok" : 0,
        "errmsg" : "BSONObj size: 16852277 (0x1012535) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"5aeadb26b2ac4334dc138e5b@polaris_dev/50000\"",
        "code" : 10334,
        "codeName" : "Location10334"
} : aggregate failed
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:16:14
assert.commandWorked@src/mongo/shell/assert.js:403:5
DB.prototype._runAggregate@src/mongo/shell/db.js:260:9
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1212:12

Maybe there's a way to squeeze more into the document if we just store the commit IDs in the 'predecessors' array as opposed to the whole commit record.  I'm not sure how to do that yet; I'm still refamiliarizing myself with MongoDB.  .  However, is there way for the $graphLookup to return something so that each commit record or part of one doesn't have to be in a single document?

Another issue is that if the $graphLookup function traverses enough records, it exceeds a 100mb pipeline limitation:

db.commits.aggregate([{ $match: { "_id": "5aeadb26b2ac4334dc138e8d@polaris_dev/2554999" } }, {$graphLookup: {"from": "commits", "startWith": '$parent', "connectFromField": "parent", "connectToField": "_id", "as": "predecessors"}}]).pretty()
assert: command failed: {
        "ok" : 0,
        "errmsg" : "$graphLookup reached maximum memory consumption",
        "code" : 40099,
        "codeName" : "Location40099"
} : aggregate failed
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:16:14
assert.commandWorked@src/mongo/shell/assert.js:403:5
DB.prototype._runAggregate@src/mongo/shell/db.js:260:9
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1212:12

It looks like there is a allowDiskUse parameter but it doesn't work with the $graphLookup function.....  I also see this issue:


Because of these issues I might start leaning towards an Oracle solution.   Apparently Oracle SQL provides a 'CONNECT BY' operator which can traverse a graph similar to the $graphLookup.  However, my team wanted me to open an official support case, in case there are tricks to getting Mongo to work.  If there aren't, then we might not have to start over with our design.

Thanks,
Ryan

Rhys Campbell

unread,
May 6, 2018, 8:13:54 AM5/6/18
to mongodb-user
Using $project you can include and exclude fields - https://docs.mongodb.com/manual/reference/operator/aggregation/project/#pipe._S_project

The obvious thing to do here would be to batch your requests rather than doing everything in a single query.

Ryan Hodges

unread,
May 6, 2018, 11:39:29 AM5/6/18
to mongodb-user
Thanks for the tip on $project.  I started reading about it after I posted the question.  The obvious solution is to perform batch requests but it is not obvious how to do that.  I know that there's a $maxDepth parameter but there are a couple of probelms:

1. The results of the lookup added to the given array are not added in any definitive order.  So I would need a way to figure out the last commit visited and start from there in the next batch request

2. What complicates matters is when the $graphLookup function traverses multiple parents.   In cases like that you want the next batch request to not only finish traversing whatever parent path it stopped at but also traverse any other branch paths.

The obvious solution is not very obvious and also not ideal.  The ideal solution would be to perform a query that I can iterate over using a cursor.  I recognize that MongoDB does not specialize in graphs but it is good at a whole lot of other stuff.  I was hoping there  might be a way to get this to work.

Regards,
Ryan

ty...@microstrategy.com

unread,
Dec 14, 2018, 4:39:33 PM12/14/18
to mongodb-user
Hi Rhys, I met the same problem which is described as the second problem by Ryan, and I thinnk $project could be used in the aggregation, but could not be used within $graphlookup 

Suppose each document is huge because of many other fields that are not important(want to be ignored) for current graphlookup search, this will lead to the search exceeding the maximum memory usage. Could  $graphLookup support an option that accepts a select within fields(very similar to  $project.), and which could be applied at each stage of the breadth-first search?
Reply all
Reply to author
Forward
0 new messages