Doing Totals/aggregates

240 views
Skip to first unread message

Rajesh Gupta

unread,
Jan 26, 2017, 3:48:26 AM1/26/17
to google-a...@googlegroups.com, objectify...@googlegroups.com
Hi, 

I have the following Entities (Shown in objectify style) 
I want to do several reports for this data with different invoice date ranges.  
 - Sales by customer 
 - Product Qty sold by product
 - Product value by product
 
The reports can be filtered with any Date range like 'Last Month', 'Last Qtr', 'Year to date' or any custom date range.

The no of invoices fetched can be from 100-10000, depending on the date range.

What is the good way to do such aggregate reports.?  

So far, we have used looping with hashmaps.  But this style of reports will increase and the data will also increase.  I am looking for more efficient data structures or schemas.  

@Entity
class SalesInvoice {
  @Id Long id;
  Key<Customer> customerKey;
  
  List<LineItem> lineItems

  double total;
  Key<SalesPerson> salesPersonKey;
  Date invoiceDate;
}

@Embed
class LineItem {
  Key<Product> productKey;
  double qty;
  double price;
  double tax;
  double totalPrice
}

--
Regards,
Rajesh
Accounting/Inventory/Orders/Sales/Purchase on Google Cloud Platform and Mobile

Nicholas Okunew

unread,
Jan 26, 2017, 4:21:46 AM1/26/17
to objectify...@googlegroups.com, google-a...@googlegroups.com
1. You need to remove all doubles and use bigdecimal or joda money. If you don't know why, google it.

2. If you want to do adhoc reporting, consider replicating to bigquery and running from there. If you don't want to do that, queries in batches on backends is the way to go, but this kind of reporting is the datastores weakspot.
--
You received this message because you are subscribed to the Google Groups "objectify-appengine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to objectify-appengine+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rajesh Gupta

unread,
Jan 26, 2017, 7:17:36 AM1/26/17
to objectify...@googlegroups.com, google-a...@googlegroups.com
Hello,

No, we cannot do bigquery replication, as invoices may get deleted and updated. 

We do these reports in a task, and tasks write to memcache the report data.

Any json technologies here?.  Can the invoice list be written using jackson as json, and use any json query technologies.  

On Thu, Jan 26, 2017 at 2:51 PM, Nicholas Okunew <naok...@gmail.com> wrote:
1. You need to remove all doubles and use bigdecimal or joda money. If you don't know why, google it.

2. If you want to do adhoc reporting, consider replicating to bigquery and running from there. If you don't want to do that, queries in batches on backends is the way to go, but this kind of reporting is the datastores weakspot.
To unsubscribe from this group and stop receiving emails from it, send an email to objectify-appengine+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "objectify-appengine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to objectify-appengine+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff Schnitzer

unread,
Jan 26, 2017, 2:43:01 PM1/26/17
to objectify-appengine, google-a...@googlegroups.com
I use Postgres. Replicate a subset of your entity data to a datastore that supports aggregations. Assuming your dataset fits in a traditional RDBMS, they’re awesome for aggregations and ad-hoc queries.

The datastore makes an awesome primary datastore because it is zero maintenance and never breaks, but it isn’t the only tool in the toolbox. Use dedicated “slave” indexes for special purposes like fulltext search and aggregation. I use the task queue for replication; whenever an entity changes in a material way, my code transactionally enqueues a sync-task that syncs the relevant state of that entity to the appropriate index.

You might be asking yourself why not just use a RDBMS in the first place? It’s a reasonable question, and for many apps it is the right one. But running an RDBMS as a primary store has a cost - you have to be really careful with it. Running a DDL statement that locks up the db is a minor inconvenience when it takes down your reports pages for a half hour; if it takes down your sales flow it’s catastrophic. Personally I like the two-store solution, but I had to build some infrastructure to make the syncing process easier.

If your dataset doesn’t fit in an RDBMS or has load requirements that make relational databases untenable, you can still use BigQuery and similar. You can delete/change records by adding new records that “cancel out” the old records. It’s not as pretty as an update but adds up just the same.

Jeff


George (Cloud Platform Support)

unread,
Jan 26, 2017, 3:52:06 PM1/26/17
to Google App Engine, objectify...@googlegroups.com
How large can your sales invoice data get in the end? The solutions recommended above may work well for relatively small volumes. If you need to process terabytes of data in the end, Cloud Bigtable might prove speedier and cost less overall.

Jim

unread,
Jan 27, 2017, 2:18:38 PM1/27/17
to Google App Engine, objectify...@googlegroups.com
Another option is to use map-reduce against your datastore tables for aggregation of truly 'big' data sets.  It's nowhere near as flexible as some of the other options mentioned here, but if your requirements are fairly static it works great and will allow you to keep your data in one place.

Rajesh Gupta

unread,
Feb 8, 2017, 4:46:28 AM2/8/17
to google-a...@googlegroups.com
All of the above solutions involve more expertise and time, which is not suitable for small teams. Ofcourse, they achieve good results for large data and big companies.  Small companies will loose the benefit of using the appengine paas with such wide solutions.
It is beneficial to provide some aggregation layer on top of datastore with built in fast google technologies.
Interested developers will choose to use it.

Regards,
Rajesh
Accounting/Inventory/Orders/Sales/Purchase on Google Cloud Platform and Mobile

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscribe@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/9fe3d5f5-1995-461b-85cf-1006877f3dad%40googlegroups.com.

Nicholas (Google Cloud Support)

unread,
Feb 8, 2017, 9:36:40 AM2/8/17
to Google App Engine
Datastore is not designed for large scale data aggregation.  Its emphasize is on fast and horizontally scalable reads.  Though it can be used to query over large datasets, it is nowhere near optimized in the same way that SQL or SQL-like technologies are.  For full dataset joins and aggregation, BigQuery really shines.  The other options mentioned above Postgres and MapReduce would also be appropriate depending on your data.  I would strongly suggest using BigQuery as it comes with a Datastore loading solution and is designed for precisely what you describe.  In addition, reading all entities in order to aggregate some data from them will be far more costly with Datastore than the equivalent job in BigQuery.

In your earlier comments, you mention that you cannot use BigQuery due to an update and delete requirement.  Data can be updated in BigQuery.  Do such update capabilities meet the requirements you mentioned above?  If not, what is ability do you think is missing from these update and delete features?

Lastly, in your example case, you mentioned that a given query could pull anywhere from 10 to 10,000 entities.  How many entities are there in total?  Knowing the total count of entities through which your query must search is crucial to understanding the likely costs of queries in time and quota usage.  How many total entities do you expect to have in a given dataset within the next 6-12 months based on recent month's growth?  The answer to this should help you decide whether to invest time into replicating data into a aggregation tool like BigQuery and building useful queries for reporting.

Rajesh Gupta

unread,
Feb 10, 2017, 3:13:41 AM2/10/17
to google-a...@googlegroups.com
We are using namespaces to provide ERP SaaS solutions..  We want to do fast aggregations for reports within the namespace.  Each KIND can have 5000-10000 rows within the given time range. 

Combining Datastore with BigQuery is not cheap solution for small teams and for small data. Now, small teams have to worry about two data storage technologies. What about the transactions?  Now, a transaction should extend to the BigQuery as well, when the data is updated in the datastore. 
We have to worry about the schema changes that will happen in the datastore and should be replicated to the BigQuery.
Suddenly now the appengine solution has become too complex with too many technologies for writing simple apps. It defeats the purpose of choosing appengine paas.

It will be beneficial to provide an aggregation layer on top of Datastore.  Google team can incorporate mapreduce as an optional layer for aggregations and make it transparent to the developers and give a easy api to use.

Just my 2 cents.  

Regards,
Rajesh
Accounting/Inventory/Orders/Sales/Purchase on Google Cloud Platform and Mobile


For more options, visit https://groups.google.com/d/optout.



--

Nicholas (Google Cloud Support)

unread,
Feb 10, 2017, 10:22:58 AM2/10/17
to Google App Engine
Exports to BigQuery can be done at regular intervals rather than at every Datastore write removing the bulk of transactional and latency challenges with doing both writes at once.  Given the existing tools available (BigQuery, Cloud SQL, Dataproc, MySQL on GCE, one's own application of MapReduce upon an existing Datastore dataset), the above-mentioned solutions are those available to you at the moment.

As for the aggregation layer on top of Datastore using MapReduce, that sounds like an appropriate idea for a feature request on the App Engine public issue tracker.  I encourage to send your feature requests there as we welcome the opportunities to improve the platform.  Be sure to include the business case you provided above along with some more specific details about how you see this being implemented.  Any relevant details can be helpful in evaluating and realizing such ideas.  After filing said feature request, feel free to post a link to it here so that others in the community can follow through and star it to show their support.

Hope this helps!
Nicholas


On Friday, February 10, 2017 at 3:13:41 AM UTC-5, Rajesh Gupta wrote:
We are using namespaces to provide ERP SaaS solutions..  We want to do fast aggregations for reports within the namespace.  Each KIND can have 5000-10000 rows within the given time range. 

Combining Datastore with BigQuery is not cheap solution for small teams and for small data. Now, small teams have to worry about two data storage technologies. What about the transactions?  Now, a transaction should extend to the BigQuery as well, when the data is updated in the datastore. 
We have to worry about the schema changes that will happen in the datastore and should be replicated to the BigQuery.
Suddenly now the appengine solution has become too complex with too many technologies for writing simple apps. It defeats the purpose of choosing appengine paas.

It will be beneficial to provide an aggregation layer on top of Datastore.  Google team can incorporate mapreduce as an optional layer for aggregations and make it transparent to the developers and give a easy api to use.

Just my 2 cents.  

Regards,
Rajesh
Accounting/Inventory/Orders/Sales/Purchase on Google Cloud Platform and Mobile

PK

unread,
Feb 10, 2017, 10:52:21 AM2/10/17
to google-a...@googlegroups.com
Are you aware of this map/reduce framework?I have been using it for reporting on top of datastore. It is not as convenient as a SQL aggregation query but is very flexible. 


PK
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.

Jeff Schnitzer

unread,
Feb 10, 2017, 5:27:38 PM2/10/17
to Google App Engine
If you’re doing aggregations across <10k rows, you don’t need (or want) BigQuery or map/reduce or any other “big data” solution. You want a basic SQL database. Use Cloud SQL if you want something easy to integrate with GAE.

You’re not going to get SQL aggregations out of the datastore; it’s just not set up that way. Map/reduce will always perform poorly and provide limited ad-hoc query ability. I’m a huge fan of the datastore myself but it’s not the only tool in the toolbox. If you are unwilling/unable to use multiple tools, just use a SQL database. They work fine with GAE.

That said, it’s pretty easy to replicate data from the datastore to an RDBMS via the task queue. If you’re using the RDBMS as an analytics engine there are no transactional issues; you just replicate data. Any junior engineer on your staff should be able to figure this out.

Jeff

On Fri, Feb 10, 2017 at 12:13 AM, Rajesh Gupta <rajesh...@veersoftsolutions.com> wrote:
We are using namespaces to provide ERP SaaS solutions..  We want to do fast aggregations for reports within the namespace.  Each KIND can have 5000-10000 rows within the given time range. 

Combining Datastore with BigQuery is not cheap solution for small teams and for small data. Now, small teams have to worry about two data storage technologies. What about the transactions?  Now, a transaction should extend to the BigQuery as well, when the data is updated in the datastore. 
We have to worry about the schema changes that will happen in the datastore and should be replicated to the BigQuery.
Suddenly now the appengine solution has become too complex with too many technologies for writing simple apps. It defeats the purpose of choosing appengine paas.

It will be beneficial to provide an aggregation layer on top of Datastore.  Google team can incorporate mapreduce as an optional layer for aggregations and make it transparent to the developers and give a easy api to use.

Just my 2 cents.  

Regards,
Rajesh
Accounting/Inventory/Orders/Sales/Purchase on Google Cloud Platform and Mobile

Reply all
Reply to author
Forward
0 new messages