Datastore: how to design for huge time-series data

1,893 views
Skip to first unread message

Martin Trummer

unread,
Aug 13, 2013, 8:59:52 AM8/13/13
to google-a...@googlegroups.com
I'm a newbie to the AppEngine datastore and like to know how to best design this use case:
there may be some time-series with huge amount of data: e.g. terra-bytes for one time-series
the transacations doc says about entity groups:
  • "Every entity belongs to an entity group, a set of one or more entities that can be manipulated in a single transaction."
  • "every entity with a given root entity as an ancestor is in the same entity group. All entities in a group are stored in the same Datastore node."
so does that mean, that all the terra-bytes of data for the huge time-series would end up on one computer somewhere in the AppEngine network?
if so: 
  • that's not a good idea, right?
  • how to avoid it? should I split up the data in sections (e.g. per month) where each section has it's own kind/entity group?

Jay

unread,
Aug 13, 2013, 4:42:25 PM8/13/13
to google-a...@googlegroups.com
In my opinion, your biggest take away from this should be to avoid having a mega entity group and you do this by simply not having all the entities in question have the same parent. Or perhaps more pointedly, any parent at all. Unless there is a really strong case to put many thousands of entities in the same entity group, I just wouldn't do it. You can have transactions across entity groups now so if you need a transaction with a few entities you are OK. 
As you need to relate the entities, do that by some other means instead of a parent entity. For example, you could use a ndb.KeyProperty or possibly just an encoded string or something along those lines. 

Rafael

unread,
Aug 13, 2013, 7:20:15 PM8/13/13
to google-appengine
i implemented this by having these components: 

- TimeSeriesIndex - different rows for hour, day, week, month, year, etc. You can squeeze a lot of data in 1mb :)
- DataPoint - unprocessed data point data. thousands of rows per minute. 
- cron that process the datapoints inside the indexes
- the ui read only TimeSeriesIndex that contains the timestamps and the data points. 

thanks
rafa


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Martin Trummer

unread,
Aug 14, 2013, 8:49:29 AM8/14/13
to google-a...@googlegroups.com

On Tuesday, 13 August 2013 22:42:25 UTC+2, Jay wrote:
In my opinion, your biggest take away from this should be to avoid having a mega entity group and you do this by simply not having all the entities in question have the same parent. Or perhaps more pointedly, any parent at all.
that's what I'd like to do for this entity kind. but the docs indicate, that it's not possible: "Every entity belongs to an entity group,.." 
What am I missing?

Martin Trummer

unread,
Aug 14, 2013, 8:51:30 AM8/14/13
to google-a...@googlegroups.com
okay, so you have 2 entity types "TimeSeriesIndex" and "DataPoint"
but what about the DataPoint entity - you also have the same problem there, right?
all your data ends up in the DataPoint entity - or does your cron-job delete the DataPoints, after generating the TimeSeriesIndex?

timh

unread,
Aug 14, 2013, 10:49:37 AM8/14/13
to google-a...@googlegroups.com
If you do not specify an ancestor the entity group of the entity consists of only itself. 

So if you create 2 million entities with no parent entity then you have 2 million separate entity groups.

Which is fine for what you are doing.

Any thing else will severely limit write through put.

Martin Trummer

unread,
Aug 14, 2013, 10:51:02 AM8/14/13
to google-a...@googlegroups.com
great - thanks timh
that was the point, I was missing!

Mit freundlichen Grüssen/Kind  regards,
 Martin Trummer
______________________________
DI (FH) Martin Trummer
Mobile: +43 676 700 47 81
skype:ds.martin.trummer
mailto:martin....@dewesoft.com
Attention! New mailaddress ends with .com (was .org before)

DEWESoft GmbH
Grazerstrasse 7
A-8062 Kumberg
Austria / Europe
Tel.: +43 3132 2252
Fax: +43 3132 2252
UID-Nr.: ATU 654 93414
FB-Nr.:  078/4236-21
www.dewesoft.com
______________________________

This e-mail may contain confidential and/or privileged information.
If you are not the intended recipient (or have received this e-mail in error), please notify the sender immediately and delete this e-mail.
Any unauthorized copying, disclosure or distribution of the material in this e-mail is prohibited.


--
You received this message because you are subscribed to a topic in the Google Groups "Google App Engine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-appengine/JNU8i7KaxWM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-appengi...@googlegroups.com.

Jeff Schnitzer

unread,
Aug 15, 2013, 2:59:36 AM8/15/13
to Google App Engine
Keep in mind that this can get very expensive very fast, and on-the-fly aggregation is pretty much unavailable. You might consider running a specialized timeseries db on GCE or some other cloud host.

Jeff


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.

Vinny P

unread,
Aug 15, 2013, 4:38:58 AM8/15/13
to google-a...@googlegroups.com
I'd recommend building a test application to load in a bunch of dummy entries, and seeing what performance you get out of it. From there we can discuss specific optimization strategies and so forth depending on where the bottlenecks turn up.
 
 
-----------------
-Vinny P
Technology & Media Advisor
Chicago, IL

App Engine Code Samples: http://www.learntogoogleit.com
 
 


--

Shailendra Singh

unread,
Jan 29, 2015, 3:32:52 PM1/29/15
to google-a...@googlegroups.com
Hi Rafael

It's a old thread, but can you please share some information on how you stored " different rows for hour, day, week, month, year, etc. You can squeeze a lot of data in 1mb :)" in GAE? I an new to GAE and i am trying to store some ts data with respect to a entity in NDB. 

Thanks

timh

unread,
Jan 31, 2015, 3:21:19 AM1/31/15
to google-a...@googlegroups.com
Have a look at nimbits it stores time series in appengine datastore.  It's written in java, but the data models used should be straightforward to translate into NDB.

T

gregory nicholas

unread,
Jan 31, 2015, 2:34:32 PM1/31/15
to google-a...@googlegroups.com
i've got some code for this from a recent project . hit me up .

log individual events, then run map reduce to aggregate into time slices also by field values to create preagrregrated counts .

querying is not as nimble as say mongo, so this works, but a few extra steps

Shailendra Singh

unread,
Feb 1, 2015, 7:17:38 AM2/1/15
to google-a...@googlegroups.com, faction...@gmail.com

I was trying to use repeated properties in GAE to store multiple values inside a property with a time stamp just like a time-series database. Once that is done, we can query for last 1 hour or 2 hour or 1 day and similar type. Can you guide me a preferred way as again only 1 mb of repeated properties are available in GAE free trial.

Rafael

unread,
Feb 1, 2015, 1:10:48 PM2/1/15
to google-appengine
To solve that problem you can have DataPoint as a temporary table only. 

That way, every 5 minutes you can run a cron that download all DataPoint and deletes them after you summarize the content in another table. 

You can summarize on a 5 minute table, then the average from that table goes to the 30 minutes summary and the avg of that one goes to the 2 hour one.. etc. 

Rafael

unread,
Feb 1, 2015, 1:11:26 PM2/1/15
to google-appengine
By the way, this isn't a datastore specific problem. Even on mysql, you don't want to be querying millions of rows to draw a simple summary. 

Emanuele Ziglioli

unread,
Feb 1, 2015, 3:19:51 PM2/1/15
to google-a...@googlegroups.com
What about using BigQuery, anybody has tried for this specific purpose?
Inserting data and exporting a whole table is free at this stage.

By the way, I've tried a couple of strategies, involving entity groups. I started storing the timestamp as key.
That improved things a little bit, in the sense my entities were smaller, so I could loop over them faster.
Cost-wise it's not been too bad but my sets don't go over 1M rows each.
If you can live with BigQuery append-only's restriction, I would definitely try it.

Emanuele

Shailendra Singh

unread,
Feb 1, 2015, 5:00:30 PM2/1/15
to google-a...@googlegroups.com
This might be a question out of track. Somehow i figured out how to store multiple values in NDB i.e. using repeated properties. Now my next step is to create google chart. Can some one guide me for the same. I hadn't found much tutorials for google Chart + NDB . Google charts have google docs but their queries and functions are only for some of the data stores like google spreadsheet, fusion table etc. How can i used NDB queries to get data which is compatible for google chart. Any lead will be helpful. 



With regards.
Shailendra Singh


Here's to the crazy ones, the misfits, the rebels, the troublemakers,the
round pegs in the square holes... the ones who see things differently --
they're not fond of rules... You can quote them, disagree with them, glorify
or vilify them, but the only thing you can't do is ignore them because they
change things... they push the human race forward, and while some may see
them as the crazy ones, we see genius, because the ones who are crazy enough
to think that they can change the world, are the ones who do.
-- Steve Jobs, US computer engineer & industrialist (1955 - )*


Vinny P

unread,
Feb 3, 2015, 3:42:29 AM2/3/15
to google-a...@googlegroups.com
On Sun, Feb 1, 2015 at 4:00 PM, Shailendra Singh <srj...@gmail.com> wrote:
This might be a question out of track. Somehow i figured out how to store multiple values in NDB i.e. using repeated properties. Now my next step is to create google chart. Can some one guide me for the same. I hadn't found much tutorials for google Chart + NDB . Google charts have google docs but their queries and functions are only for some of the data stores like google spreadsheet, fusion table etc. How can i used NDB queries to get data which is compatible for google chart. Any lead will be helpful. 



The exact process depends on your use case, but for the simplest case you would retrieve the entity you need, retrieve all of the repeated properties, then print out those values as part of a web page similar to this code: https://developers.google.com/chart/interactive/docs/examples#table_example

Did you want to make the data available as part of an API (since Google Charts can load in JSON data) or write it directly into a web page?

 
 
-----------------
-Vinny P
Technology & Media Consultant
Chicago, IL

Nickolas Daskalou

unread,
Feb 3, 2015, 4:03:10 AM2/3/15
to Google App Engine
+1 for BigQuery if you only need to add records (not edit or delete them).

We use BigQuery to store analytic data for FollowUs.com.

Best thing about it is that it works as advertised.

Biggest downside is that queries can take a few seconds to return with results. If you can live with that, I say definitely give it a go.

Nick


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.

Etienne B. Roesch

unread,
Feb 14, 2017, 9:40:44 AM2/14/17
to Google App Engine
Hi,

Sorry for the repeat, but I am trying to wrap my head around the GAE-osphere and I am getting a bit confused;

I need to store and retrieve/analyse timeseries data, of varying sizes and resolutions; at the moment, the data is received and stored on GAE through to Google Cloud SQL (python). That's not ideal. I foresee I will have to do more analytics than data storage for the sake of storage, and I predict a big throughput of data generally, and have thus been looking at BigQuery/Datalab. I don't seem to find an obvious way to load data in BigQuery from Cloud SQL, and would either have to export the data to Cloud Storage in csv-ish, or directly stream the data from the GAE app to BigQuery (which is currently my preferred option).
Alternatively, there is the option of passing through to Google Datastore first, which for me might be a more flexible way of preprocessing the data before it enters in BigQuery.

Is this the way to do things, or am I missing something?

Thanks!

Etienne

Nick (Cloud Platform Support)

unread,
Feb 14, 2017, 3:11:34 PM2/14/17
to Google App Engine
Hey Etienne,

You've correctly enumerated a few ways to transfer data from Cloud SQL to BigQuery:

* export to CSV and load the CSV into BigQuery
* retrieve the data with an app and stream it into the BigQuery API
* export to Datastore and then import to BigQuery  

There are also other ways, such as using a mysqldump of your SQL DB. You should check the BigQuery "Loading Data" documentation.

Let me know if you have any further questions, and I'll be happy to assist.

Cheers,

Nick
Cloud Platform Community Support

Evan Jones

unread,
Feb 14, 2017, 4:08:08 PM2/14/17
to Google App Engine
I would recommend using the BigQuery streaming API. We do a heck of a lot of that at Bluecore and it works well. Depending on how your data arrives, you may want to use a Task Queue or similar to collect lots of rows together to be able to insert batches into BigQuery, which will be more efficient.

Good luck!

Evan



On Tuesday, February 14, 2017 at 9:40:44 AM UTC-5, Etienne B. Roesch wrote:
Reply all
Reply to author
Forward
0 new messages