Aggregation Issue

128 views
Skip to first unread message

Michael Wheater

unread,
Jul 8, 2014, 1:35:58 PM7/8/14
to mongod...@googlegroups.com
So, I'm stuck on an issue with aggregation.

I have a set of data in a collection that looks like this (I'll call this the data collection)

db.data.insert({ o_id: 1, value: 5 })
db.data.insert({ o_id: 2, value: 6 })
db.data.insert({ o_id: 3, value: 9 })
db.data.insert({ o_id: 4, value: 7 })

o_id corresponds to the id in another collection that looks like this (I'll call this the category collection)

db.category.insert({ _id: 1, color: "blue" })
db.category.insert({ _id: 2, color: "blue" })
db.category.insert({ _id: 3, color: "yellow" })
db.category.insert({ _id: 4, color: "red" })

My end goal is aggregate the value field based on color.

{
    _id : blue
    value : 11
},
{
    _id: yellow
    value: 9
},
{
    _id: red
    value: 7
}


I know this is going to take at least two queries, one to look up the color collection and one to do the aggregate. I'm mostly concerned with the aggregate call.

I know I can do it with a big long conditional but I think this will get out of hand quickly (color will probably have 100ish entries, data will have millions)

db.data.aggregate([
    {
        $project: {
            "value" : '$value',
            "color" : {
                $cond: [
                    { $eq : ['$o_id',1]},
                    'blue',
                    {
                        $cond: [
                            { $eq : ['$o_id',2]},
                            'blue',
                            {
                                $cond: [
                                    { $eq : ['$o_id',3]},
                                    'yellow',
                                    {
                                        $cond: [
                                            { $eq : ['$o_id',4]},
                                            'red',
                                            'error'
                                        ]
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }
        }
    },
    {
        $group: {
            _id: '$color',
            value: { '$sum' : '$value' }
        }
    }
 ])

I was hoping there could be another way, one that would be more maintainable, perhaps using $let or $map or something.






Asya Kamsky

unread,
Jul 8, 2014, 2:42:46 PM7/8/14
to mongodb-user
I was hoping there could be another way, one that would be more maintainable

Storing the color along with o_id in the data collection would make it simple to aggregate.   Is there a compelling reason not to denormalize this field into the data documents?

Asya



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/70744806-8236-4fa2-9255-26108043933d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Wheater

unread,
Jul 8, 2014, 3:00:38 PM7/8/14
to mongod...@googlegroups.com
I don't want to denormalize due to the large amounts of data duplication.

I tried to keep the example simple but the category collection will have a large amount of fields other than color that I would like to use in a similar way. To store all the fields of the category with the individual data collection records is a large amount of data duplication. Also if category changes color (or any other field) that means large amounts of queries to update the data collection.

I don't expect the category collection to go above 100 records. The data collection will easily be above 100 million records.

Asya Kamsky

unread,
Jul 8, 2014, 3:06:08 PM7/8/14
to mongodb-user
One possible solution is to go the reverse direction than you are currently going.

First aggregate on o_id and then map o_id values to category "strings" (re-aggregating by string now).
I'm assuming the result set will be considerably smaller than pre-grouped data going in...

Asya



Nefiga

unread,
Jul 10, 2014, 12:20:24 AM7/10/14
to mongod...@googlegroups.com

I'm having this exact problem. It would be awesome if there was some way to map values to a key:value store in a project stage in aggregation.

Something like

{ some_storage: [ 1, 1, 2, 1, 5, 2 ] }

$project
: {
  new_field
: some_storage[$old_field]
}

Nefiga

unread,
Jul 10, 2014, 12:46:26 AM7/10/14
to mongod...@googlegroups.com

Nefiga

unread,
Jul 10, 2014, 1:31:08 PM7/10/14
to mongod...@googlegroups.com
The stackoverflow solution in the link above works nicely, but it would be even better if somehow there was an operation to get a scalar/string (value) out of an array in the $project phase. Then there wouldn't be a need to do the unwind at the very least.

Any thoughts, Asya?

Asya Kamsky

unread,
Jul 12, 2014, 2:48:37 AM7/12/14
to mongodb-user
I'm not quite sure what the exact syntax you're proposing would be.

Can you give an example (with some made-up operator that would do what you want)?

Asya



Nefiga

unread,
Jul 13, 2014, 2:06:06 PM7/13/14
to mongod...@googlegroups.com
The reasoning behind this is to basically do application joins inside of mongo so that the joined information can be used inside of the aggregation operation without having to deal with multiple aggregation queries using the $out operator and garbage collection of completed aggregation queries etc.

OK, here is a scenario.

db.employee.drop();
db
.employee.insert({'_id' : 1, 'name' : 'Bob', 'department_id' : 1});
db
.employee.insert({'_id' : 2, 'name' : 'Ted', 'department_id' : 1});
db
.employee.insert({'_id' : 3, 'name' : 'Joe', 'department_id' : 2});
db
.employee.insert({'_id' : 4, 'name' : 'Abe', 'department_id' : 2});


db
.department.drop();
db
.department.insert({'_id' : 1, 'name' : 'Accounting', 'company_id' : 5});
db
.department.insert({'_id' : 2, 'name' : 'Accounting', 'company_id' : 6});


db
.company.drop();
db
.company.insert({'_id' : 5, 'name' : 'Acme Company'});
db
.company.insert({'_id' : 6, 'name' : 'Beta Company'});

Let's say I want to count the number of employees per company. What I want is to aggregate against the employee collection, but use the company_id in a $group stage. If we are "making up" syntax to make this possible, I suppose such syntax could be:

var company_id_array = [5, 6];


db
.employee.aggregate([
   
{$project: {'name' : '$name', 'company_id' : '$company_id_array[$department_id]' }}
]);


//would return
{ "_id" : 1, "name" : "Bob", "company_id" : 5 }
{ "_id" : 2, "name" : "Ted", "company_id" : 5 }
{ "_id" : 3, "name" : "Joe", "company_id" : 6 }
{ "_id" : 4, "name" : "Abe", "company_id" : 6 }

although the syntax $company_id_array[$department_id] is obviously not valid and seems problematic to interpretation if it was allowed. Perhaps a made up operator would be more explicit/mongo-esque. We can call it "$selectIndex", and I suppose it could fit into the mongo array aggregation operators (along with "$size"):

{$project: {'name' : '$name', 'company_id' : {$company_id_array : {$selectIndex : $department_id}}}

Really what I want to do is (using the made up syntax above):

var company_id_array = [5, 6];
db
.employee.aggregate([
   
{$project: {'name' : '$name', 'company_id' : {'$company_id_array' : {$selectIndex : '$department_id'}} }},
   
{$group: {'_id : '$company_id', 'count' : {'$sum' : 1}}}
]);


Another possibility (that would work better with a single aggregate wrapper so it is not necessary to use the var company_id_array = []; block) could be:

db.employee.aggregate([
{$project: {'name' : '$name',
'company_id' : {
'$let': {
vars
: { 'company_id_array': [5,6], 'department_id': '$department_id'  },
in: {'$$company_id_array : {$selectIndex : '$$department_id'}}
}}
}}
]);

If $selectIndex or something similar were to be created, it could probably be even more useful if it could do something like:

$company_id_array = [{'_id' : 1, 'name' : 'Acme Company'}, {'_id' : 2, 'name' : 'Beta Company'}];

//example project
{$project: {'name' : '$name', 'company_id' : {$company_id_array : {$selectIndex : '$department_id.name'}}}}


Although what I am trying to accomplish at the moment does not require anything like that, I could see how something like that could be very useful and basically allow simple application level joins inside of mongo within a single aggregation statement where the joined collection is sufficiently small to basically "put into an array" so to speak to pass along to the aggregation pipeline. Personally I would use this all the time. I have many use cases where I am doing some simple denormalization in collections that contain tens to hundreds of millions of rows with collections that are in the 10 - 1000 range.

Of course, maybe there's another simple way to do what I am asking about using the existing 2.6 syntax and I have just not come up with it yet. For the time being, the only solution I can think of (that still allows me to use mongo to do the aggregation) is to use the $out parameter and the perform a second aggregation.

What do you think?

Nefiga

unread,
Aug 13, 2014, 8:20:46 PM8/13/14
to mongod...@googlegroups.com
I would LOVE LOVE LOVE a reply to this from anyone. I keep running into issues where if there was just a way to get an array to scalar it would save so much time and processing. Now I have an issue where I have a single item in an array that I've obtained through this method described here: http://stackoverflow.com/questions/23995955/mongodb-aggregation-pipeline-is-slow

but now i have to unwind over 50 million elements just to peek into an array which is already length 1, if I could just merely look at it

Anyway, if anyone has any ideas, let me know.

Thanks!
On Friday, July 11, 2014 11:48:37 PM UTC-7, Asya Kamsky wrote:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;
...

Asya Kamsky

unread,
Aug 14, 2014, 10:01:36 PM8/14/14
to mongodb-user
The problem with what you are asking for, is you created a fully
normalized relational schema and now you want MongoDB to do joins for
you on the server side (btw, which server? if we're sharded, there are
many servers) - and this is simply not how you are expected to
maximize the benefit of a document data store.

Having said that though - it seems to me that you can already do what
you were trying to do by scripting a mapping from department_ids to
company_ids... I show a few tricks like that on my "MongoDB tricks"
blog...

With your example (but I added a department/company mapping to show it
works for non-exact mapping) you do something like this (using shell):

var mapping=db.department.aggregate({$project:{_id:0,department_id:"$_id",company_id:1}}).toArray();
cond=mapping[mapping.length-1].company_id;
for (i=mapping.length-2;i>=0;i--)
cond={$cond:{if:{$eq:["$department_id",mapping[i].department_id]},then:mapping[i].company_id,else:cond}}
db.employee.aggregate({$project:{name:1, company_id:cond}})
{ "_id" : 1, "name" : "Bob", "company_id" : 5 }
{ "_id" : 2, "name" : "Ted", "company_id" : 5 }
{ "_id" : 3, "name" : "Joe", "company_id" : 6 }
{ "_id" : 4, "name" : "Abe", "company_id" : 6 }


Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/ffb010c5-0457-4f5d-82c4-c33564ba5c05%40googlegroups.com.

Nefiga

unread,
Sep 5, 2014, 3:13:18 PM9/5/14
to mongod...@googlegroups.com
Thanks for the reply, Asya. I know it was awhile back but I never noticed until now.

Yes we had previously read your blog for tips, it is definitely very helpful. We ended up coming up with more or less the same solution as you proposed.

One thing that would make this even easier is support for $in when using the $cond in the aggregation (see https://jira.mongodb.org/browse/SERVER-6146).

I have upvoted for this issue as it greatly reduces the amount of nested $cond required to make this work.

Thanks for the help!

Asya Kamsky

unread,
Sep 6, 2014, 2:16:14 AM9/6/14
to mongodb-user
Reply all
Reply to author
Forward
0 new messages