What is the most efficient way to compute a rolling median on appengine?

358 views
Skip to first unread message

Mathieu Simard

unread,
Nov 12, 2013, 3:07:34 PM11/12/13
to google-a...@googlegroups.com
Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median.
Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine?

Vinny P

unread,
Nov 13, 2013, 1:23:16 AM11/13/13
to google-a...@googlegroups.com
On Tue, Nov 12, 2013 at 2:07 PM, Mathieu Simard <mathieu....@gmail.com> wrote:
Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median.



That would depend on the size of your data and the rate it's coming in. Can you detail your use case?
 
 
-----------------
-Vinny P
Technology & Media Advisor
Chicago, IL

App Engine Code Samples: http://www.learntogoogleit.com

Gilberto Torrezan Filho

unread,
Nov 13, 2013, 6:59:49 AM11/13/13
to google-a...@googlegroups.com
I miss some Redis functionality in App Engine as well. Memcache is just an unreliable cache to hold some data for while... nothing more.

To make such calculations which iterate over large sets of data, I use backends with in-memory processing: loading part of the data from datastore into memory, spawn multiple threads (if applicable) and iterate over data. Ugly, strange, error-prone and sometimes slow, but it works.

A bomb-to-kill-an-ant solution would be using Google BigQuery. I don't like like the idea, but depending on your problem it can solve it for you.

You can try to use some MapReduce processing as well. But since I'm using Java (a not so loved language in App Engine, see servlet 3.0 discussion) MapReduce (Mapper, actually) is too experimental to put in production (after the Conversion and Files API, I learned my lesson: never ever ever use an experimental API in App Engine).

Anyway, you have several options to try. I just recommend you to avoid storing large datasets on Memcache, since it's just a cache and can wipe your data at any time - invalidating your calculations.

Mathieu Simard

unread,
Nov 13, 2013, 11:00:47 AM11/13/13
to google-a...@googlegroups.com
At the moment I'm already receiving ~15 data points per second.
This is growing quickly as more of our clients roll out the app in production.

I want to produce a reliable median of the different metrics being received without incurring the extreme costs of writing all of those points to the datastore.

Mathieu Simard

unread,
Nov 13, 2013, 11:08:29 AM11/13/13
to google-a...@googlegroups.com
Storing the data points isn't an option.
I'm already receiving far too many data points to start writing them all.
Unless they dramatically lower the write costs...

As for the in-memory approach, can you provide a scale at which you're using this technique?
Do you have to ensure that the backend runs on a single thread?

Kaan Soral

unread,
Nov 13, 2013, 2:20:21 PM11/13/13
to google-a...@googlegroups.com
A single datastore entity can hold up to 1MB's

How big will a single dataset be?

If it's smaller than 1MB's in summarized format, you could build a queue based solution to handle the >15/s data rate

You could also probably develop something like a tree, with each entity representing a node and storing the data about the leafs etc, it could maybe lead up to a practical median calculator, just an idea, the point is:

as Vinny P stated, the solution is always based on exactly what you are doing

Mathieu Simard

unread,
Nov 13, 2013, 8:58:22 PM11/13/13
to google-a...@googlegroups.com
Here's a better definition of the problem:

I receive a value for a tracked metric (i.e. start-up time) for different systems at a rate of 15/s.
This rate is expected to grow quickly as clients add systems.
I need to produce a rolling median of that metric.

Inserting all entries in the datastore is not an option since that would already require 15 writes per second which is extremely expensive.
Using the memcache is not a good solution since there is no atomic push/pop on arrays. (Hence, my earlier reference to Redis.)
Using a backend instance to hold it all in memory is a quick fix but it won't scale as we add new metrics.

At the same time, I'm trying to keep the cost low since volume is only going to grow.

Vinny P

unread,
Nov 14, 2013, 2:16:05 AM11/14/13
to google-a...@googlegroups.com
On Wed, Nov 13, 2013 at 7:58 PM, Mathieu Simard <mathieu....@gmail.com> wrote:
Here's a better definition of the problem:
I receive a value for a tracked metric (i.e. start-up time) for different systems at a rate of 15/s.
This rate is expected to grow quickly as clients add systems.
I need to produce a rolling median of that metric.
At the same time, I'm trying to keep the cost low since volume is only going to grow.



A few months back someone posted a similar problem to this mailing list: a mobile game needed a backend to collect scores from thousands of mobile clients, compute a leaderboard, then send back the leaderboard to the clients, all in a 10 second window. After the discussion, the consensus IIRC was to either (1) run App Engine backends to reap incoming requests and calculate values within a high-memory backend, or (2) run a Compute Engine machine to reap and calculate values.

The choice is up to you since it depends on what you're comfortable with, but if low cost is an important goal I'd choose the Compute Engine route. Since your application and values can be held entirely within RAM, you can choose a high-memory, diskless instance to optimize your resource usage. 

However if the incoming values will have spiky traffic levels or if your application requires complex services such as Endpoints, hosting on App Engine is the better solution.

Barry Hunter

unread,
Nov 14, 2013, 8:14:55 AM11/14/13
to google-appengine

Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine?


There is also the option of 'redis as a service' 


Certainly 'convenient', but might have issues with cost and/or latency depending on requirements. 
 

Stephen

unread,
Nov 15, 2013, 2:04:58 PM11/15/13
to google-a...@googlegroups.com
On Tue, Nov 12, 2013 at 8:07 PM, Mathieu Simard <mathieu....@gmail.com> wrote:
Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median.
Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine?

- batch data in front-end instance memory
- flush data to a pull queue every time window
- run a task every time window to gather all data batches from pull queue
- merge data, compute moving median, write result to data store

Instances will be started as more data is submitted. Tune the frequency of the calculation so that the size of the pending data batches does not overwhelm a single task.

Kaan Soral

unread,
Nov 15, 2013, 2:10:39 PM11/15/13
to google-a...@googlegroups.com
This seems like a bad idea, you could just add it as a pull-queue task, instances can die pretty easily

You also assumed that the data is small enough to process/store easily, however this doesn't seem to be the case

Mathieu Simard

unread,
Nov 15, 2013, 2:29:11 PM11/15/13
to google-a...@googlegroups.com
I decided to settle with a solution that will do the trick for now.

I'm building heaps of data per metrics per system in memcache and just accepting the fact that sometimes I will not be able to compute my results.
I've no guarantee of consistency, since now all my writes in memcache can overwrite others.

It's far from perfect, but it does the trick quite well at a fraction of the cost.
If flush rates get too intense I'll simply upgrade to a dedicated memcache.

I will post an update if I hit a wall with that approach.

John Belmonte

unread,
Nov 19, 2013, 12:50:51 PM11/19/13
to google-a...@googlegroups.com
Hi Mathieu,

I've been toying with concurrent data structures on memcache.  I haven't gotten as far as an (indexed) skiplist yet, but I believe it's possible-- and that structure is efficient for maintaining a running percentile.  Is your app in Python by any chance?

Jeff Schnitzer

unread,
Nov 19, 2013, 3:33:19 PM11/19/13
to Google App Engine
Yikes, most of these suggestions are crazy complicated. Even redis seems excessive.

Write a little server that accepts values, stores them in memory, rolls off old values, and answers requests for the median. Run it on GCE or linode or whatever. This should be a few hours of programming, and even a simplistic implementation in Java should be able to handle hundreds of QPS. Even if you hit capacity, you can always start sampling. Or you can federate.

Don't treat GAE as "all or nothing".

Jeff


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

Mathieu Simard

unread,
Nov 19, 2013, 3:44:49 PM11/19/13
to google-a...@googlegroups.com, je...@infohazard.org
Thanks Jeff for your balanced insight.

I'm not treating GAE as an all or nothing. Not at all.
The options you describe were all discussed, I was just wondering if someone had drawn a clever trick to resolve this problem directly in GAE.

Please next time take a few minutes to read the whole thread.

Stephen

unread,
Nov 19, 2013, 4:28:32 PM11/19/13
to google-a...@googlegroups.com

On Fri, Nov 15, 2013 at 7:04 PM, Stephen <sdeasey...@gmail.com> wrote:

On Tue, Nov 12, 2013 at 8:07 PM, Mathieu Simard <mathieu....@gmail.com> wrote:
Since there is no appengine solution available such as the Redis atomic list, I'm left wondering how to implement a cost effective rolling median.
Has anyone come up with a solution that would be more convenient than running a redis instance on Google Compute Engine?

- batch data in front-end instance memory
- flush data to a pull queue every time window
- run a task every time window to gather all data batches from pull queue
- merge data, compute moving median, write result to data store

Instances will be started as more data is submitted. Tune the frequency of the calculation so that the size of the pending data batches does not overwhelm a single task.

As a refinement of the above:

- serve a zero-byte static file at /ok
- have your clients request /ok?k=foo&v=42
- periodically parse the request log: https://developers.google.com/appengine/docs/go/logs/

The advantage here is that:

- you don't have to pay for front-end instance hours
- your clients are served quickly by Google's edge servers
- few moving parts in the critical path that can fail

You could push your rolling medians into Big Query for analysis, which might become economic as you're dealing with only a sample of the original data.

Jim

unread,
Nov 19, 2013, 9:23:59 PM11/19/13
to google-a...@googlegroups.com
Are you doing a time-series type analysis where you need the rolling median value for a specific entity, or do you need the median value across a range of entities?

Vinny P

unread,
Nov 19, 2013, 10:34:46 PM11/19/13
to google-a...@googlegroups.com
On Tue, Nov 19, 2013 at 3:28 PM, Stephen <sdeasey...@gmail.com> wrote:
As a refinement of the above:
- periodically parse the request log: https://developers.google.com/appengine/docs/go/logs/
The advantage here is that:
- you don't have to pay for front-end instance hours
- your clients are served quickly by Google's edge servers
- few moving parts in the critical path that can fail

 

Stephen's idea is also a good option, but note that there is a separate quota and charge for accessing the Logging API: https://developers.google.com/appengine/docs/java/logs/#Java_Quotas_and_limits . In short, it costs $0.12 per each GB of logging data accessed with the initial 100MB free.

Also, there is occasionally a delay of a few seconds to a few minutes between a client request ending and when the associated log becomes available in the logging API. If your application is time-sensitive, you may want to try an alternative.

John Belmonte

unread,
Nov 20, 2013, 12:42:20 PM11/20/13
to google-a...@googlegroups.com
You may have missed my message behind Jeff's-- consistent data structures can be done on memcache, directly in GAE.  If you're using Python I might be able to put together an example which computes a running median.

Luca de Alfaro

unread,
Nov 21, 2013, 12:50:41 AM11/21/13
to google-a...@googlegroups.com
If you can weigh recent data more than older data, you might consider instead of building a rolling average, an  exponentially decaying weights average. 

You can store in ndb, sharded, total_amount, and total_weight, and timestamp.  
Then, when you get an update, you compute the decay_factor, which is equal to exp(- time since update / time constant). 
You then do: 
total_amount = total_amount * decay_factor + amount_now
total_weight = total_weight * decay_factor + weight_now
timestamp = present time
avg = total_amount / total_weight

Mathieu Simard

unread,
Nov 21, 2013, 8:49:35 AM11/21/13
to google-a...@googlegroups.com
I'm already using that approach.
However, the distribution of my metrics require a more precise solution for my median.

Mathieu Simard

unread,
Nov 21, 2013, 9:12:15 AM11/21/13
to google-a...@googlegroups.com
I need the median value for multiple entities but only compared to themselves.
In the future I will probably create the media across entities by doing a median average.

Mathieu Simard

unread,
Nov 21, 2013, 9:13:40 AM11/21/13
to google-a...@googlegroups.com
As limited as it could be, it does feel like a sweet trick.

Mathieu Simard

unread,
Nov 21, 2013, 9:18:08 AM11/21/13
to google-a...@googlegroups.com
I'm having a look at your project right now John.
This is quite interesting.

Mathieu Simard

unread,
Nov 21, 2013, 10:38:54 AM11/21/13
to google-a...@googlegroups.com
John,

Your solution has a lot of potential.
Especially since you distribute the load across the keyspace.

My only concern is the deque create/bind method.
Shouldn't it be a single method since multiple GAE instances could try to create the deque concurrently?

Jim

unread,
Nov 21, 2013, 2:00:42 PM11/21/13
to google-a...@googlegroups.com
If data points for each entity are not coming too fast, you could use blobstore/gcs to store your time series for each entity in a blob, then store a pointer to that blob in your entity in the data store.  updating is expensive but can run off a task queue.  retrieval of the blobs is very fast, and then you can quicky parse the blob into memory and compute your stats on a given entity.  cross entity stats are trickier and require some map-reduce-esque processing.

we use this approach for smart-meter analytics where data points for a given entity (meter) don't come any faster than once every 15 min... not sure if it would work for you.

Jim

unread,
Nov 21, 2013, 2:04:11 PM11/21/13
to google-a...@googlegroups.com
missed your comment... this is what we're doing, except we avoid the 1MB limitation by storing the data sets in blobs and store the pointer to the blob in the entity record

Mathieu Simard

unread,
Nov 21, 2013, 2:15:41 PM11/21/13
to google-a...@googlegroups.com
The query volume doubled since I first started this thread and, at the current rate, should double again by the end of next week.
I definitely need a solution that can handle in the thousand QPS because that's where we're heading.
I'm currently running this without consistency using a dedicated memcache.


--
You received this message because you are subscribed to a topic in the Google Groups "Google App Engine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-appengine/VMG96Xzvsok/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-appengi...@googlegroups.com.

John Belmonte

unread,
Nov 25, 2013, 11:57:47 AM11/25/13
to google-a...@googlegroups.com
Just to be clear, I don't think you could build a running percentile from a deque.

The potential solution is to build an indexed skip list from the MCAS primitive in memcache-collections.  An indexed skip list has O(log n) inserts and indexed lookups.  For median you'd want to index the N/2 entry.

I'm not saying this is trivial work-- but I've been meaning to try this out anyway and may be able to help.

The main thing I'm trying to find out from you is whether you're in Python or not.  This library is Python only.
Reply all
Reply to author
Forward
0 new messages