Wide to long transformation in aggregation

43 views
Skip to first unread message

Boyd Skelton

unread,
Apr 20, 2016, 9:48:54 AM4/20/16
to mongodb-user
I am working with a dataset that looks like this:

asset, start_date, end_date, ship_date
"A", 2015-01-01, 2015-02-01, 2015-03-01
"B", 2015-02-01, 2015-02-15, 2015-03-01

Ultimately, I would like to aggregate the data to produce the following:

[{
  _id : {
    year: 2015,
    month: 1
  }, 
  start:1,
  end: 0,
  ship: 0
},
{
  _id : {
    year: 2015,
    month: 2
  }, 
  start: 1,
  end: 2,
  ship: 0
},
{
  _id : {
    year: 2015,
    month: 2
  }, 
  start: 0,
  end: 0,
  ship: 2
}]

My thought is to use $project, $addToSet, and $unwind; but is there a better way?

Thank you.

Kevin Adistambha

unread,
Apr 29, 2016, 1:30:52 AM4/29/16
to mongodb-user

Hi Boyd

asset, start_date, end_date, ship_date

“A”, 2015-01-01, 2015-02-01, 2015-03-01

“B”, 2015-02-01, 2015-02-15, 2015-03-01

Could you provide an actual document that you’re trying to aggregate? The example you gave looks like an SQL table and does not accurately show the datatype associated with it.

For example, the date aggregation operator does not work with string types. It requires an ISODate type to be able to function.

_id : {
year: 2015,
month: 1
},
start:1,
end: 0,
ship: 0
},

Could you elaborate on the meanings of the field names and values, e.g. what does “start”, “end”, and “ship” signifies? How are they connected to the example you give (which only has “A” and “B”)

Best regards,
Kevin

Boyd Skelton

unread,
May 4, 2016, 11:16:09 AM5/4/16
to mongodb-user
Kevin,

My apologies, I was trying to simplify the data:
[
{
  "order_no" : "A",
  "ship_dt": ISODate("2009-05-15T00:00:00Z"),
  "order_dt": ISODate("2008-01-28T00:00:00Z"),
  "start_dt": ISODate("2008-04-02T00:00:00Z"),
  "stop_dt": ISODate("2008-04-18T00:00:00Z")
},
{
  "order_no" : "B",
  "ship_dt": ISODate("2009-05-15T00:00:00Z"),
  "order_dt": ISODate("2008-02-28T00:00:00Z"),
  "start_dt": ISODate("2008-04-15T00:00:00Z"),
  "stop_dt": ISODate("2008-05-01T00:00:00Z")
}
]

Kevin Adistambha

unread,
May 5, 2016, 3:02:32 AM5/5/16
to mongodb-user

Hi Boyd,

I am assuming that the numbers in your example aggregation represent counts of activities (i.e. ship, order, start, and stop) that occurred in each Year-Month combination. Please let me know if this assumption is incorrect.

Using your example data:

{
  "order_no": "A",
  "ship_dt": ISODate("2009-05-15T00:00:00Z"),
  "order_dt": ISODate("2008-01-28T00:00:00Z"),
  "start_dt": ISODate("2008-04-02T00:00:00Z"),
  "stop_dt": ISODate("2008-04-18T00:00:00Z"
)
}
{
  
"order_no": "B",
  "ship_dt": ISODate("2009-05-15T00:00:00Z"),
  "order_dt": ISODate("2008-02-28T00:00:00Z"),
  "start_dt": ISODate("2008-04-15T00:00:00Z"),
  "stop_dt": ISODate("2008-05-01T00:00:00Z")
}

the aggregation query you need is:

db.orders.aggregate(

  // Project the year and month of each order, conserving the activity type, into an array
  {$project: {
    activities: [
      {year: {$year: "$ship_dt"}, month: {$month: "$ship_dt"}, type: {$literal: "ship"}},
      {year: {$year: "$order_dt"}, month: {$month: "$order_dt"}, type: {$literal: "order"}},
      {year: {$year: "$start_dt"}, month: {$month: "$start_dt"}, type: {$literal: "start"}},
      {year: {$year: "$stop_dt"}, month: {$month: "$stop_dt"}, type: {$literal: "stop"}}
    ]
  }},

  // Unwind the activities array
  {$unwind: "$activities"},

  // For each distinct year-month, push each activity type into an array
  {$group: {
    _id: {year: "$activities.year", month: "$activities.month"},
    type: {$push: "$activities.type"}
  }},

  // For each distinct year-month, put the count of each activity into separate fields
  {$project: {
    _id: "$_id",
    order: {$size: {$filter: {input: "$type", as: "x", cond: {$eq: ["$$x", "order"]} }} },
    start: {$size: {$filter: {input: "$type", as: "x", cond: {$eq: ["$$x", "start"]} }} },
    stop:  {$size: {$filter: {input: "$type", as: "x", cond: {$eq: ["$$x", "stop"]} }} },
    ship:  {$size: {$filter: {input: "$type", as: "x", cond: {$eq: ["$$x", "ship"]} }} }
  }},

  // Sort by year-month
  {$sort: {"_id.year": 1, "_id.month": 1}}
)

Running the aggregation query with your example data resulted in:

{
  "_id": {
    "year": 2008,
    "month": 1
  },
  "order": 1,
  "start": 0,
  "stop": 0,
  "ship": 0
},
{
  "_id": {
    "year": 2008,
    "month": 2
  },
  "order": 1,
  "start": 0,
  "stop": 0,
  "ship": 0
},
{
  "_id": {
    "year": 2008,
    "month": 4
  },
  "order": 0,
  "start": 2,
  "stop": 1,
  "ship": 0
},
{
  "_id": {
    "year": 2008,
    "month": 5
  },
  "order": 0,
  "start": 0,
  "stop": 1,
  "ship": 0
},
{
  "_id": {
    "year": 2009,
    "month": 5
  },
  "order": 0,
  "start": 0,
  "stop": 0,
  "ship": 2
}

Please note that this aggregation query assumes that the whole collection will be used, which may not be very performant in a very large collection. If your use case permits, I would recommend to add a $match stage as the first stage (e.g. before the $project stage) to reduce the amount of data that enters the pipeline.

If you find that you need to do this aggregation frequently, you might find that using a Pre-Aggregated Reports beneficial. An advantage of using pre-aggregated report is that it allows you to generate up-to-the-minute reports without the overhead of running a resource-intensive aggregation query, which could interfere with your day-to-day use of the database. The reports can then be retrieved using a single find() command. The tradeoff is that your application would need to update two collections at once for each data point insertion (i.e. an insert for the original collection, and another for the pre-aggregated report).

Best regards,
Kevin

Boyd Skelton

unread,
May 5, 2016, 12:44:15 PM5/5/16
to mongodb-user
Thank you for the great explanation. 

I noticed the $project to array is a 3.2+ feature, but I cannot find an example that is 2.6 compatible. Is there another route, or would denormalization be better?

Kevin Adistambha

unread,
May 5, 2016, 11:37:25 PM5/5/16
to mongodb-user

Hi Boyd,

I noticed the $project to array is a 3.2+ feature, but I cannot find an example that is 2.6 compatible. Is there another route, or would denormalization be better?

It’s a bit different using 2.6 due to the limited operators available in the aggregation framework, but it can be done. This is tested on 2.6.11:

db.orders.aggregate(

  // Project the year and month of each order, conserving the activity type, into an array with single element
  {$group: {
    _id: "$order_no",
    order: {$push: 
      {year: {$year: "$order_dt"}, month: {$month: "$order_dt"}, type: {$literal: "order"}}
    },
    start: {$push:
      {year: {$year: "$start_dt"}, month: {$month: "$start_dt"}, type: {$literal: "start"}}
    },
    stop: {$push:
      {year: {$year: "$stop_dt"}, month: {$month: "$stop_dt"}, type: {$literal: "stop"}}
    },
    ship: {$push: 
      {year: {$year: "$ship_dt"}, month: {$month: "$ship_dt"}, type: {$literal: "ship"}}
    }
  }},

  // Project the activities of each order into a single array
  {$project: {
    _id: "$_id",
    activities: {$setUnion: ["$order", "$start", "$stop", "$ship"]}
  }},

  // Unwind the activities array
  {$unwind: "$activities"},

  // Project the activity types into separate fields
  {$project: {
    _id: {year: "$activities.year", month: "$activities.month"},
    order: {$cond: [ {$eq: ["$activities.type", "order"]}, 1, 0 ]},
    start: {$cond: [ {$eq: ["$activities.type", "start"]}, 1, 0 ]},
    stop:  {$cond: [ {$eq: ["$activities.type", "stop" ]}, 1, 0 ]},
    ship:  {$cond: [ {$eq: ["$activities.type", "ship" ]}, 1, 0 ]}
  }},

  // For each distinct Year-Month, calculate the total number of each activity
  {$group: {
    _id: "$_id",
    order: {$sum: "$order"},
    start: {$sum: "$start"},
    stop: {$sum: "$stop"},
    ship: {$sum: "$ship"}
  }},

  // Sort by Year-Month

  {$sort: {"_id.year": 1, "_id.month": 1}}
)

This query should have an identical output with the previous one:

{
  "_id": {
    "year": 2008,
    "month": 1
  },
  "order": 1,
  "start": 0
,
  "stop": 0,
  "ship": 0
},
{
  "_id": {
    
"year": 2008,
    "month": 2
  },
  "order": 1,
  "start": 0
,
  "stop": 0,
  "ship": 0
},
{
  "_id": {
    
"year": 2008,
    "month": 4
  },
  "order": 0,
  "start": 2,
  "stop": 1
,
  "ship": 0
},
{
  "_id": {
    
"year": 2008,
    "month": 5
  },
  "order": 0,
  "start": 0,
  "stop": 1
,
  "ship": 0
},
{
  "_id": {
    
"year": 2009,
    "month": 5
  },
  "order": 0,
  "start": 0,
  "stop": 0,
  "ship": 2
}

However, I must reiterate my earlier point that if you frequently need this aggregation, please consider using a pre-aggregated report instead.

Best regards,
Kevin

Reply all
Reply to author
Forward
0 new messages