Understanding the performance of MongoDb for Queries on large data sets

122 views

Skip to first unread message

Matt Lightbourn

unread,

Sep 20, 2016, 7:09:37 PM9/20/16

to mongodb-user

Hi all, I need to trouble somebody who knows the performance capabilities of MongoDb queries over a database with what I consider to be a lot of documents :) apologies it's not the shortest thread, I hope I've given you enough information to give a view of what I want to do.

I have a series of business 'sales transactions' which will be stored as individual documents in MongoDb across a community of businesses within a single collection. I want to run queries to get specific information for each of those businesses to provide business insights for each of the business owners.

I expect to receive on average 250-350 transactions per day per business. Currently there are 65,000 businesses so, the amount of documents received and created in MongoDb are peaking at about 22,750,000 per day.

Over the course of a single year, that will be about 8.3B documents.

Here's an example of a sales transaction:

{

"_id": {

"$oid": "579936bcf36d28088399737a"

"object_class": "Goods & Services Transaction",

"object_category": "Revenue",

"object_type": "Transaction",

"object_origin": "Kounta",

"object_origin_category": "Point of Sale",

"object_creation_date": {

"$date": "2016-05-15T22:49:35.665Z"

"party_uuid": "64324d5d-af53-44dc-8a3d-1eb5ee1f93a0",

"connection_uuid": "4e711224-8a55-403f-be43-f10377c7c1c4",

"transaction_reference": "1010",

"transaction_status": "PAID",

"transaction_date": {

"$date": "2016-03-05T03:05:00.201Z"

"transaction_due_date": {

"$date": "2016-03-05T03:05:00.201Z"

"transaction_currency": "NZD",

"goods_and_services": [

{

"item_identifier": "11",

"item_name": "Stussy Classic Tank Top, M, Blue",

"item_quantity": 1,

"item_gross_unit_sale_value": 42,

"item_revenue_category": "Sales Revenue",

"item_net_unit_cost_value": null,

"item_net_unit_sale_value": 35,

"item_unit_tax_value": 7,

"item_net_total_sale_value": 35,

"item_gross_total_sale_value": 42,

"item_tax_value": 7

{

"item_identifier": "20",

"item_name": "Levi 501 Size 32, Straight",

"item_quantity": 2,

"item_gross_unit_sale_value": 262.8,

"item_revenue_category": "Sales Revenue",

"item_net_unit_cost_value": null,

"item_net_unit_sale_value": 219,

"item_unit_tax_value": 43.8,

"item_net_total_sale_value": 438,

"item_gross_total_sale_value": 525.6,

"item_tax_value": 87.6

{

"item_identifier": "29",

"item_name": "Red Chilli Mens Socks, 8-11, 1pk",

"item_quantity": 1,

"item_gross_unit_sale_value": 13.2,

"item_revenue_category": "Sales Revenue",

"item_net_unit_cost_value": null,

"item_net_unit_sale_value": 11,

"item_unit_tax_value": 2.2,

"item_net_total_sale_value": 11,

"item_gross_total_sale_value": 13.2,

"item_tax_value": 2.2

}

"transaction_gross_value": 580.8,

"transaction_gross_curr_value": 580.8,

"transaction_net_value": 484,

"transaction_cost_value": null,

"transaction_payments_value": 580.8,

"transaction_payment_extras_value": null,

"transaction_tax_value": 96.8,

"payments": [

{

"payment_method": "CASH",

"payment_value": 580.8,

"payment_extras": 0,

"payment_date": {

"$date": "2016-03-05T03:05:00.201Z"

}

"party": {

"customer": {

"customer_identifier": "WALKIN",

"customer_name": "All Customers"

"staff": {

"staff_identifier": "1590",

"staff_name": "Andrew Brouja"

}

"resources": {

"device": {

"device_identifier": "1",

"device_name": "Terminal"

"location": {

"location_identifier": "1",

"location_name": "Store"

}

You will notice I have some obvious indexes in here which I have made bold in the above document example. The first three indexes are used for me to categorise my documents during a query.

Example query descriptions:

So, if I wanted to know the sales by day over the course of the past week, I would look for ALL "goods & services transaction" documents where it is "revenue" (as opposed to expense) and where it is a "transaction" (as opposed to proposition or instruction) and range by transaction_date (from today - 7 days to today) and then sum the transaction_net_value (the transaction total ex Taxes) grouped party_uuid (the business identifier) and transaction_date. The values I would want from this are party_uuid, the transaction_date, the summed value and a category called "sales_by_day".

I would then $out to a run-time environment which makes it available to each business. That means I have aggregated and calculated prior to it being required and turned it into a document (in a different collection) which can be picked up by the customer when they need it rather than attempting to do the query in real-time when the customer wants it. The query would run every x minutes and replace all those documents as a refreshed set. I wanted to create these cache type documents so that I can control how often queries are run on the database.

This is only one of many such queries. I want to understand how this will perform in the real world. If a year down the road I have 8.3B documents, firstly, the query only cares about the documents created in the last 7 days. Then it only cares about the type and category of the documents before it calculates. There will be other queries which look at an entire year of data and sums grouped by the month of transaction_date.

I would also probably put in a load of other values calculated into the same queries so, there would be a document which is from a 7 day set, a 30 day set and a 12 month set. I would not only include sales_by_day but also any other values I can get out with the same group criteria. sales_expense_by_day. I would have other queries where maybe it requires an unwind like top_selling_product ad top_companion_product, etc along with any queries where the group is not by day but by customer (like top_customer_sales) or staff (like top_staff_sales).

What I want to understand is, is this child play for Mongo or am I pushing it's performance considerably? Would you $out those documents to another collection and use Mongo for both Data Warehouse AND run-time or do you think, the data produced from the aggregation should be pushed off Mongo and into something like Cassandra? My main purpose for MongoDb in this project is to ETL transactions from multiple sources into the standardised data schema (as shown above) and keep them all. Then apply queries to that data set to produce data values to be consumed by reports and dashboard widgets.

Any opinion and recommendation will be very much appreciated. I would also potentially like some sort of stats - I've looked everywhere for it and found nothing. Like, how many documents in a collection? How many documents with matching value in an index before it impacts performance? How will my queries perform? If I am using a single collection for all these transactions, should I consider a collection per year? What's the best way to setup MongoDb to handle my use - sharding, etc.

Thanks, Matt

Kevin Adistambha

unread,

Sep 29, 2016, 1:45:56 AM9/29/16

to mongodb-user

Hi Matt,

So, if I wanted to know the sales by day over the course of the past week, I would look for ALL “goods & services transaction” documents where it is “revenue” (as opposed to expense) and where it is a “transaction” (as opposed to proposition or instruction) and range by transaction_date (from today - 7 days to today) and then sum the transaction_net_value (the transaction total ex Taxes) grouped party_uuid (the business identifier) and transaction_date. The values I would want from this are party_uuid, the transaction_date, the summed value and a category called “sales_by_day”.

I would then $out to a run-time environment which makes it available to each business. That means I have aggregated and calculated prior to it being required and turned it into a document (in a different collection) which can be picked up by the customer when they need it rather than attempting to do the query in real-time when the customer wants it. The query would run every x minutes and replace all those documents as a refreshed set. I wanted to create these cache type documents so that I can control how often queries are run on the database.

If I understand correctly, you have 65,000 businesses, all of which you intend to give an aggregated report every week/month/year. Currently you are storing detailed documents recording each sale for each business. However, you don’t want to perform the aggregation on-demand, but as a scheduled job instead.

If my understanding is correct, then you may be able to use the pre-aggregated report design pattern. The main idea is that when you are doing an insert into the main collection, you also do an update to a document that keeps a running tally of the inserted data. This is similar to your $out method, although it doesn’t actually use the aggregation framework to achieve it.

For more information, please see the Pre-Aggregated Reports page.

What I want to understand is, is this child play for Mongo or am I pushing it’s performance considerably? Would you $out those documents to another collection and use Mongo for both Data Warehouse AND run-time or do you think, the data produced from the aggregation should be pushed off Mongo and into something like Cassandra? My main purpose for MongoDb in this project is to ETL transactions from multiple sources into the standardised data schema (as shown above) and keep them all. Then apply queries to that data set to produce data values to be consumed by reports and dashboard widgets.

Any opinion and recommendation will be very much appreciated. I would also potentially like some sort of stats - I’ve looked everywhere for it and found nothing. Like, how many documents in a collection? How many documents with matching value in an index before it impacts performance? How will my queries perform? If I am using a single collection for all these transactions, should I consider a collection per year? What’s the best way to setup MongoDb to handle my use - sharding, etc.

One of the reason why you don’t see much stats in the internet is because unlike relational databases where the database design is focused on how it is stored, MongoDB focuses on how the data is used. For example, given the same input data like yours, you require an aggregated report. Someone else might require fast reads, others might require fast writes, while others may require a balanced read/write workload. All four use cases also depends on the available budget and hardware, and thus could have radically different document design, even though the use case looks similar at a glance.

You might find the following links helpful for your consideration:

Best regards,
Kevin

Matt Lightbourn

unread,

Sep 29, 2016, 1:59:00 AM9/29/16

to mongod...@googlegroups.com

Thanks Kevin, I understand your logic here, I'll investigate the preaggregated reports you have mentioned. Thanks again, Matt

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/IrmHJhkltgQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user+unsubscribe@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/d216238a-e50d-4961-9ef9-361a8f79ddfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages