Aggregate Pipeline addToSet on nested array of records returning a set of arrays

UltimateCodeWarrior

unread,

Nov 14, 2018, 1:04:13 AM11/14/18

to mongodb-user

Goal: For the fastest-possible queries with C#.NET and MongoDB v4.0, perform multiple distinct(s) utilizing a single aggregation pipeline style query on a document with nested array records and return a flattened out set.

Current-Problem: For first level document entries it works great, unexpected results when running the distinct on the nested array records.

Example Document:

{

"title" : "Movie A",

"released": 1984,

"minutes" : 90,

"category": "SciFi",

"cost": 24000000,

"gross": 48000000,

"rated": "PG",

"score": 3.4,

"cast" : [

{

"name": "Buzz",

"gender":"M",

"age": 28,

"country": "USA",

"role": "Leading",

"award" : "Golden Globe",

"pay": 500000

},

{

"name": "Sally",

"gender":"F",

"age": 21,

"country": "Austrailia",

"role": "Supporting",

"award" : "Academy",

"pay": 300000

}

]

}

How to Build with Mongo Client:

use <your database name>;

db.createCollection("movies");

db.movies.insertOne({ "title" : "Movie A", "released": 1984, "minutes": 90, "category": "SciFi", "cost": 24000000, "gross": 48000000, "rated": "PG", "score": 3.4, "cast" : [ { "name": "Buzz", "gender":"M", "age": 28, "country": "USA", "role": "Leading", "award" : "Golden Globe", "pay": 500000 }, { "name": "Sally", "gender":"F", "age": 21, "country": "Austrailia", "role": "Supporting", "award": "Academy", "pay": 300000 } ]});

db.movies.insertOne({ "title" : "Movie B", "released": 1986, "minutes": 120, "category": "Horror", "cost": 28000000, "gross": 40000000, "rated": "R", "score": 4.4, "cast" : [ { "name": "Ryan", "gender":"M", "age": 55, "country": "England", "role": "Leading", "award": "Oscar Nomination", "pay": 100000 }, { "name": "Sally", "gender":"F", "age": 23, "country": "Austrailia", "role": "Supporting", "award": "Academy", "pay": 300000 } ]});

db.movies.insertOne({ "title" : "Movie C", "released": 1988, "minutes": 106, "category": "Drama", "cost": 38000000, "gross": 45000000, "rated": "PG13", "score": 2.4, "cast" : [ { "name": "Pat", "gender":"F", "age": 42, "country": "Ireland", "role": "Supporting", "award": "Irish", "pay": 350000 }, { "name": "Gene", "gender":"F", "age": 44, "country": "France", "role": "Leading", "award": "Independent", "pay": 280000 } ]});

C#.NET Code w/ Mongo v 4.0 and 2.5 Driver version

MongoDatabase mdb = Mongo.Controller.Instance.getDB();

var collection = mdb.GetCollection("movies");

var match = new BsonDocument

{

"$match",

new BsonDocument

{

{ "released", new BsonDocument

{

{"$gte", 1984},

{ "$lte", 1988}

}

};

var group = new BsonDocument

{

{ "$group",

new BsonDocument {

{ "_id", 0},

{ "categoryDistinct", new BsonDocument { { "$addToSet", "$category" } }},

{ "ratedDistinct", new BsonDocument { { "$addToSet", "$rated" } }},

{ "countryDistinct", new BsonDocument { { "$addToSet", "$cast.country" } }},

{ "awardDistinct", new BsonDocument { { "$addToSet", "$cast.award" } }}

}

};

var pipeline = new[] { match, group };

var args = new AggregateArgs();

args.Pipeline = pipeline;

args.AllowDiskUse = true;

var results = collection.Aggregate(args).ToList();

foreach (var obj in results)

{

Console.WriteLine(obj.ToString());

}

Output:

{ "_id" : 0, "categoryDistinct" : ["Drama", "Horror", "SciFi"], "ratedDistinct" : ["PG13", "R", "PG"], "countryDistinct" : [["Ireland", "France"], ["England", "Austrailia"], ["USA", "Austrailia"]], "awardDistinct" : [["Irish", "Independent"], ["Oscar Nomination", "Academy"], ["Golden Globe", "Academy"]] }

Notes: categoryDistinct and ratedDistinct are coming back as I would expect, but the countryDistinct and awardDistinct are not, they are coming back in a nested sub array record form.

Desired Output : countryDistinct should have been ["Austrailia,England,France,Ireland,USA"]

awardDistinct should have been: ["Academy","Golden Globe","Independent","Irish","Oscar Nomination"];

(I know I didn't specify a sort anywhere so if you can give a pointer that will not impact speed, that would be greatly appreciated.

Any pointers on how to do this efficiently -- speed consciously would be greatly appreciated.

Thanks in Advance!

Wan Bachtiar

unread,

Nov 14, 2018, 7:56:33 PM11/14/18

to mongodb-user

categoryDistinct and ratedDistinct are coming back as I would expect, but the countryDistinct and awardDistinct are not, they are coming back in a nested sub array record form.

Hi,

This is because cast is an array, and when the group access it as cast.country and cast.award those fields are an array value of each of the respective fields. Thus, adding those array values to a set would return an array of arrays.

You can try to insert an $unwind stage after the $match and before the $group stage. For example:

db.movies.aggregate([
    {"$match": {"released": {"$gte": 1984, "$lte": 1988}}}, 
    {"$unwind": "$cast"}, 
    {"$group": {
        "_id": null, 
        "categoryDistinct": {"$addToSet":"$category"},
        "ratedDistinct": {"$addToSet":"$rated"},
        "countryDistinct": {"$addToSet":"$cast.country"},
        "awardDistinct": {"$addToSet":"$cast.award"},
        }
    }
])

I know I didn’t specify a sort anywhere so if you can give a pointer that will not impact speed, that would be greatly appreciated

Similarly, you can insert a $sort stage after the $unwind and before $group to sort the documents before the grouping stage. Any extra operations will impact speed, although at various tolerable degrees. Based on the use case and requirements, you can then perform adjustments. i.e. scale up, change document schema, etc.

You may also find Aggregation Pipeline Optimization a useful reference.

Regards,
Wan.

UltimateCodeWarrior

unread,

Nov 15, 2018, 5:15:24 AM11/15/18

to mongodb-user

Thanks Wan!

I had this posted in a few places and what you had written was the consensus. Your example was much more complete though, I appreciate it and I picked up a few pointers from your post, namely setting _id to null as well as being able to construct the query in javascript so that I can test it in the mongo command line.

It does indeed work the way I needed it to, but it wasn't any faster than what I was trying to optimize when dealing with over 100,000 records. When I ran separate db.collection.distinct() on each of these fields, it was actually faster to run multiple distinct queries rather than the aggregation pipeline. I know that db.collection.distinct() is a limited / special purpose aggregation pipeline function. I also tried launching various threads (first 2, then 4 then 8) in C# with the year based $match split into equal time chunks across those threads to try and speed up the aggregation pipeline query -- and it seemed to actually have the reverse effect and slowed the query down. Seems the single threaded back to back synchronous distincts() are much faster. At least from C# as I was hoping to save the overhead of issuing multiple queries to mongodb. Maybe it's got it's own thread-pool built in and all I am doing is negating that effect when I launch my own threads. It was disheartening after all that effort to learn the more advanced aggregation pipeline way, but ending up with no tangible benefit.

I wish there was a guide mongo operations to avoid if possible ($group, $sort, $unwind) as they are CPU/RAM expensive. What would be cool is if the mongodb founding developers could take a data set like I have uploaded and show GOOD, BETTER and BEST ways of finding common things like distincts, sums, difference, ratios in more complex documents, complete with custom indexes and hints as well as show timing information each step of the way. This would save a lot of frustration and headache. Knowing when to use one operation over another is just as important as knowing how to use an operation. That's the essence of what I am trying to figure out.

Robert Cochran

unread,

Nov 15, 2018, 6:18:08 AM11/15/18

to mongodb-user

Hi,

How much physical memory is installed on your MongoDB server device? I am guessing it must be quite low on memory. Adding more memory would help you with queries.

Thanks

Bob

Robert Cochran

unread,

Nov 15, 2018, 8:22:14 PM11/15/18

to mongodb-user

Hi!

I hope you can assess the memory capacity of the host machine that your MongoDB server is running on. You should be able to do very quick aggregation queries on 100,000 records. If it takes a lot of time, perhaps the underlying code is slow (.Net Core is a long way from being able to execute with the speed of Node.js for example) and/or your machine is resource starved. The more memory (RAM) you can give the physical hardware, the better MongoDB will perform. Try to jack your memory up to 16 or 32 Gb. Also try to find other processes which are eating your memory resources. Shut them down if you or the system doesn't need them.

I played with your data a little. Thank you for posting it. The "movie" documents are probably from a MongoDB University course, right? Perhaps you are taking an M1xx series class? If so, that is fantastic. It is great to learn MongoDB the database. One thing the online classes will cover in detail is indexing. You can greatly speed up a query with good indexes.

I had a little fun by creating indexes on the "released" key:

MongoDB Enterprise > show collections
ucwmovies 

MongoDB Enterprise > db.ucwmovies.createIndex( { "released" : 1 } ) 
{ 
 "createdCollectionAutomatically" : false, 
 "numIndexesBefore" : 1, 
 "numIndexesAfter" : 2, 
 "ok" : 1 
} 

MongoDB Enterprise > db.ucwmovies.createIndex( { "released" : -1 } ) 
{ 
 "createdCollectionAutomatically" : false, 
 "numIndexesBefore" : 2, 
 "numIndexesAfter" : 3, 
 "ok" : 1 
}

Wan's query does a $match on that key. $match can make use of an index if you place the $match stage early in the aggregation pipeline, as it is here.

It will take a little while for each index to be built for 100,000 documents. Also indexing isn't a substitute for giving the server machine an adequate amount of physical memory. Snap in a couple more sticks of memory if your machine can take it and you can afford it. Then index on the keys that you think should be indexed. In my case I think indexing 100,000 documents took less than a few seconds. Indexing millions of documents is something that I'd notice more.

Thanks so much

Bob

On Thursday, November 15, 2018 at 5:15:24 AM UTC-5, UltimateCodeWarrior wrote:

UltimateCodeWarrior

unread,

Nov 15, 2018, 9:52:45 PM11/15/18

to mongodb-user

Hey Bob!

Thanks for the pointers, I welcome any and all suggestions.

I have not used to this Google Groups UI, it's a little bizarre on how to post a reply, so I hope ...it gets to the right place.

So the machine I am using right now has 24 Gigs of Ram, Windows 10, 4-Core I7 Processor, Laptop, SSD. I am running C# Visual Studio, a C# Web Server, WAMP, and Mongodb instance, Chrome (20 tabs), Notepad ++, Examine 64, Eclipse, Slack, Sublime Text, and a couple of command line prompt windows . The production level solution will be deployed on a 16GB Desktop machine with an 8-core I7 Processor, SSD. It won't have quite so much stuff running.

I'm not part of that MongoDB University program, but glad to know it exists. I will have to check into that course, it's always good to pick up some job-saving pointers.

As far as the technology stack is concerned, it wasn't my first choice, it's just an unlikely mix / odd couple that I was handed and need to contend/cope with at the moment. I have noted your suggestion of Node.js and should the need become imperative that we squeeze more speed out of the machine, then...perhaps a Node.js re-write is in order and I am more than game for that.

I feel I am suffering being on the bleeding edge because many of the examples on the web (stack overflow) for C#-Mongo are from like 2012 and between the .NET API, Driver API, Driver Version, and the Mongo changes its hard to find examples that aren't severely outdated. I had to cobble/ patch many examples together to get to the example I posted. I don't believe I have seen a single C#-Mongo4 Book on Amazon. If you know of a good book, or one in development, I will gladly volunteer to be the editor or feedback :)

I think I borrowed the idea of the movie database from another post I saw that probably stuck in my subconscious. (Teacher-Student-Grade scenario might hit a nerve with some talent, so I choose the Movie scenario) I figured I would stick with something most people would understand intuitively rather that have to do the mental hopscotch to understand the data before they could go on to understand the code then question and issue. I wanted to put in there enough substance to pose further questions without having to radically change the data model.

So I read a few things in my travels on mongoDB.

#1 - User is limited to the amount of indexes, I think it was somewhere around 64, some power of 2. This was a drawback to me because if the app has dynamic queries where it allows the user to search by various criteria (9-ways-from-Sunday), then it's possible to exceed the index limitation the more query $match parameters that the user has opted to throw into the mix. This happened in my case and so I was looking for a way to compress many queries into a super aggregation pipeline so that I could ultimately reduce the number of indexes. Well, I got partially there getting the aggregation pipeline to work, but I didn't quite get the performance results I wanted, even with indexing and threading. There's a good chance I have missed something.

#2 - The size of the name of index was limited too, I think around 128 characters,so you have to figure out a more cryptic way of naming the indexes, it cant be a 1:1 mangling of all of your query parameters like the auto-namer does if you don't explicitly supply a name yourself, otherwise you will run out of space and it will throw an exception.

#3 - The documentation cautions users of mongodb about having an excessive number of indexes as well because it will slow down inserts as it has to rebuild all of those indexes that were specified at the very beginning. So, without being on the team that engineered the engine, it's hard to know where that 'sweet spot' is. Is less more? or is more better? I didn't find a good example in C# on how to use "hint" to force an index either, the search continues...

A couple of questions on your indexes:

#1 - You are creating an ascending and descending index on the year. I thought I read that mongodb was smart enough to know read an index forward or backward, so why would it be necessary to create two?

#2 - When you created the indexes, it looked like you used the mongo command line tool and it was probably a blocking/synchronous operation that waited until completion before you got the command line prompt back? My situation is slightly different, as I won't require my users to run through a series of mongo commands to create indexes. I programatically create indexes at the startup of the C# web server. Sometimes I add them after the data has been added to test if they will indeed optimize the query by speeding it up. I don't think the Driver gives me any sort of feedback to know when the index is truly ready to 'go'. It seems that it's asynchronous creation of the index itself, but I have know way of knowing when the data has been indexed. What I usually do is shut down the C# server after the indexes have been programatically inserted and then run the db.movies.reIndex(); in the mongo command line shell and let that finish . Then I exit mongo command line shell, shut down mongodb, restart mongodb to make sure everything is running optimally before starting any query profiling/timing commands. Is there a better way to do this/faster way?

So with roughly 100k documents, with indexes on the released year, running distincts on the fields separately seemed to be around 6 seconds.

When I tried C# threading where I would split up the released year into 2,4,8 equal chunks, and joined all the threads, it would be around 8 seconds, so no real gains there. It could be that the data was not as spread out evenly in the released year bins as one would imagine, the data could have been lopsided and so naturally some threads would have all the heavy lifting and others would have nothing. So maybe if I threaded each distinct and ran them asynchronously I would have gotten better performance from threading.

When I tried the aggregation pipeline as mentioned earlier, I was getting 8-11.5 seconds. So things just kept getting worse and so I threw my hands up and decided to go to a higher power and post the various groups for some collaboration.

There's a couple of caveats that I would like to point out.

#1 - The data I have posted is similar in structure and element types only, but not in element names / element content. I think my average document size is about 2500 bytes according to mongo stats for the collection.

#2 - There are virtually zero "skipped" documents in the data set that I have when examining by the release date, in fact most of the times I'm querying across all documents and release date doesn't cut things down anyhow because I get the min and max release date earlier and use that in the query or I just omit it all together .

So what I have decided to do (for now) in order to give the appearance of faster queries is to actually pre-run them periodically in a background thread on the C# server and cache the results. When the web service is called, it will send back the cached results. This will prevent users from bogging down mongo by clicking an asynchronous javascript dynamic query button 10 times because they are impatient that the result is taking a long time, meanwhile inadvertently flat lining the mongo server by inundating it with overlapping / cascading resource gulping/choking queries that further stretch out the timeline for getting results. It's a viscous cycle.

All the Best!

Wan Bachtiar

unread,

Nov 25, 2018, 11:02:29 PM11/25/18

to mongodb-user

When I ran separate db.collection.distinct() on each of these fields, it was actually faster to run multiple distinct queries rather than the aggregation pipeline. I know that db.collection.distinct() is a limited / special purpose aggregation pipeline function

Hi,

Data in MongoDB has a flexible schema. Flexible schema allows you to focus on your application design and let the database design conform for the benefit of the application. If your application has a requirement for specific queries, it would be beneficial to alter the document schema to boost the query performance.

See Data Modelling Introduction and Data Model Examples and Patterns
for more information and examples.

User is limited to the amount of indexes, I think it was somewhere around 64, some power of 2. This was a drawback to me because if the app has dynamic queries where it allows the user to search by various criteria (9-ways-from-Sunday)

As you have mentioned in the thread, excessive number of indexes may affect write operation performance. This is because for every write operation, those indexes would have to be updated as well (not rebuild entirely).

Depending on the use case of your application, you can always ensure the application add an indexed field to limit the scope of the query and simplify query operations. See also Indexing Strategies.

You are creating an ascending and descending index on the year. I thought I read that mongodb was smart enough to know read an index forward or backward

Yes, you don’t need both ascending or descending indexes on a Single Field Index because MongoDB can traverse the index in either direction. Although this may not be the case with Compound Indexes.

When you created the indexes, it looked like you used the mongo command line tool and it was probably a blocking/synchronous operation that waited until completion before you got the command line prompt back?

There are two types of index creation, foreground and background. The default for mongo shell is to create in the foreground (blocking). Users have the option to specify background to build an index in the background so the operation does not block other database activities. See also db.collection.createIndex()

So with roughly 100k documents, with indexes on the released year, running distincts on the fields separately seemed to be around 6 seconds.

I would suggest to perform cursor.explain() on the queries to analyse further what’s causing the slowness. See also Explain Results.

Also, you could perform some monitoring on the server to identify bottlenecks. See also MongoDB Monitoring Tools

Regards,
Wan.

Reply all

Reply to author

Forward