Re: [google-appengine] How to solve the aggregate query with condition in Google App Engine

142 views
Skip to first unread message

Takashi Matsuo

unread,
Aug 1, 2012, 9:20:18 PM8/1/12
to google-a...@googlegroups.com
Can you use StackOverflow for this kind of questions?
http://stackoverflow.com/questions/tagged/google-app-engine

On Wed, Aug 1, 2012 at 3:29 PM, Neo <rising...@gmail.com> wrote:
> Suppose, In my website, I ask users to input some string. A user can input
> string multiple times. Whenever any user inputs a string, I log it in the
> database along with the time. Many strings can be same, even though inputted
> by different users. In the home page, I need to give the interface such that
> any user can query for top n (say 50) strings in any time period (say 10 Jan
> 2012 to 30 Jan 2012). If it was SQL, I could have written query like:
>
> select string, count(*)
> from userStrings where day >= d1 and day <= d2
> group by string
> order by count(*) desc
> limit n
>
> How do I solve it in GAE environment? For each such user query, I can't
> process the record at query time - there can be millions of records.
>
> I am using JDO. My obvious goal is to minimize the app engine cost : CPU +
> data.
>
> Thanks,
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/-LDkbiRUoQsJ.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.



--
Takashi Matsuo

hyperflame

unread,
Aug 1, 2012, 9:42:12 PM8/1/12
to Google App Engine
Well, you could just use Google Cloud SQL if you really want a SQL
environment.

Otherwise ( this is just a quick example, obviously there are some
optimizations you could do). Also i'm typing this out on my phone, so
forgive me if I miss a semicolon or something.

To add a string to the datastore:
Entity record = new Entity("record");
record.setProperty("some_input", input_string);
record.setProperty("add_date", new Date());
datastore.put(record);

To search for strings in a given time period:
Query q = new Query("record");
Q.setFilter("add_date", FilterOperator.LESS_THAN,
some_date_that_you're_searching_for);
PreparedQuery pq = datastore.prepare(q);

And then you can pull out the top x entities out of pq as a List, then
extract their properties to find the string.

Neo

unread,
Aug 5, 2012, 7:20:47 AM8/5/12
to google-a...@googlegroups.com
Hi Hyperflame. Thanks for your reply. The problem is, between any given time period, there can be millions of such records. For each of such query, I will have to process those millions of records to get the top x entities. This is definitely very inefficient and not acceptable.

-Neo

hyperflame

unread,
Aug 5, 2012, 3:19:26 PM8/5/12
to Google App Engine
Instead of counting the number of times a string has appeared at
querytime, just change to counting the number of times the string has
been input at input-time.

E.g. When your user puts in a string, search the datastore to see if
that string already appears in an entity. If so, set the count
property on that entity +1. If not, then add a new entity and set the
count property to 1. Then at query time, you can sort by the count
property, in addition to the date limitations.

Neo

unread,
Aug 6, 2012, 12:49:45 AM8/6/12
to google-a...@googlegroups.com
Hi Hyperflame,

Again, the problem with this strategy is that it gives the top 'n' items without considering time period - it is the absolute count of each string till date. It doesn't take into account the time period window. 

Joakim

unread,
Aug 6, 2012, 5:49:18 AM8/6/12
to google-a...@googlegroups.com
I'd solve this by having one unique entity per combination of date and string, storing the string's total for that day in said entity. You can achieve this uniqueness either by formatting special entity names (string ids) to include both the date and the string. Aggregate the day's totals in an entity keyed on the date, holding a map/dictionary of string->count.
Your query would become something like this:
SELECT * FROM DayCount WHERE __key__ >= Key('DayCount', startDate) AND __key__ <= Key('DayCount', endDate)

Or just generate all possible keys from start to end and do a batch fetch. Do the counting and sorting yourself. It's easy to see that this could become slow and expensive, so make sure to cache the generated results - especially if endDate is in the past - in an entity, and cache that entity in memcache. Keep the key to that entity predictable (formatted string with both dates) and you're down to a single fetch. This is also why I said to aggregate all of the day's totals in the DayCount-entity, so that there will be fewer fetches at query-time.

You may be asking yourself why you should bother having one entity per string per day at all, when you could do this with just the DayCount-entity. The reason is that there are limits to how many times per second you can write to a single entity (entity group to be correct). For more on this, see the App Engine talks from Google I/O 2011 (relevant link).
Depending on how important consistency is to your application, you could update the DayCount in the same transaction as the per-string-and-day entity - which would be a bit slower, or defer that job to ASAP-but-not-now by scheduling a task on the TaskQueue, or just have a cron job checking every 5 or ten minutes.

Rising India

unread,
Aug 6, 2012, 9:35:18 AM8/6/12
to google-a...@googlegroups.com
Hi Joakim ,

If I understand it correctly, the string-day entity would be something like this:
Suppose there are 10000 different strings S1, S2, ..., S10000 on some particular day, each asked many a times in that day d1, then we will have 10000 of entities: d1S1, d1S2, ...,d1S10000. That means, suppose there are 't' days in the given time period, and on all of the days in this interval, there are only 10000 different strings (S1 to S10000), and assuming that each of the string is inputted on half of the days in the time interval - that means we will have to process 10000 * (t/2) records - for t = 200 days, it turns out to be 1 Million of records processing for a single query. This too, when we have assumed that there are only 10000 different queries in those 200 days - in a real scenario this can well be 100000. 


Neo

unread,
Aug 6, 2012, 9:36:07 AM8/6/12
to google-a...@googlegroups.com

hyperflame

unread,
Aug 6, 2012, 12:53:14 PM8/6/12
to Google App Engine
On Aug 6, 1:14 am, Steve James <stevejamesuk...@gmail.com> wrote:
> I'd say horses for courses on this one.
>
> IMHO the Datastore may not be the best fit to solving your requirement.
>
> Perhaps Cloud SQL or BigQuery might be better suited?


Have to agree with Steve, this is increasingly sounding like a storage
schema that the datastore is not appropriate for.


On Aug 6, 4:49 am, Joakim <joakim.edenh...@gmail.com> wrote:
> I'd solve this by having one unique entity per combination of date and
> string, storing the string's total for that day in said entity. You can
> achieve this uniqueness either by formatting special entity names (string
> ids) to include both the date and the string. Aggregate the day's totals in
> an entity keyed on the date, holding a map/dictionary of string->count.
> Your query would become something like this:
> SELECT * FROM DayCount WHERE __key__ >= Key('DayCount', startDate) AND
> __key__ <= Key('DayCount', endDate)

Nice one, Joakim!

Just to turn this into code:

To insert an Entity into the datastore (this is not production-quality
code, just showing an example):

try {
Key key = KeyFactory.createKey("someentityname", unique_string +
CURRENT_MONTH + CURRENT_DAY + CURRENT_YEAR);
Entity entity = datastore.get(key);
//If that operation succeeded, then there is an entity already with
that unique string, and was added today. Increase the count by one.
int count = (Integer)entity.getProperty("count");
entity.setProperty("count", count + 1);
entity.setProperty("add_date", new Date());
datastore.put(entity);
}
catch (EntityNotFoundException e) {
//this means that the unique string has not been added so far today
Entity entity = new Entity("someentityname", unique_string +
CURRENT_MONTH + CURRENT_DAY + CURRENT_YEAR);
entity.setProperty("count", 1);
entity.setProperty("add_date", new Date());
datastore.put(entity);
}

And then at query time, you can do a standard search using the Query
class. Of course, if your app is being accessed a lot, you need to
shard, or do some memcaching of data.

Joakim

unread,
Aug 6, 2012, 7:22:55 PM8/6/12
to google-a...@googlegroups.com
The string-day entities are only needed for the current day, as a way to parallel concurrent writes. In fact, it sounds like you will want to delete the day-string entities when they are no longer needed. The digested DayCount entity also described is the way to process queries, and will mean fetching one entity per day in the requested range. Remember to put the results generated in the datastore for re-use.

The model I laid out can be made to work for the numbers you mention, and the more instantaneous consistency you're willing to sacrifice, the easier it gets.
  • You can push the string-writes to asynchronous tasks on the TaskQueue and then have the task operate on the string-day counter. Generally these are called within 100 ms of being added.
  • At the end of the day, have a cron job process each of the string-day entities, generating and updating the DayCount entity and deleting for that day and then deleting the string-day entity. You wont be able to process all of these in one transaction, either do 4 string-day entities per transaction (hitting the maximum of 5 entities for an XG transaction, including the DayCount entity), or just stick to one per transaction.

Notice that this means today's data will not show up in your queries, since they're only updated at the end of the day. We can fix that too.
  • After incrementing a string-day's counter, set a property called updated to [current date+time] before writing the entity and committing the transaction. As an optimization, ignore it if updated was set within the last few minutes.
  • Have a cron job run every X minutes (as often as you want the current day's DayCount updated) to fetch today's DayCount and read it's update-time, then query for string-day entities where updated > dayCountUpdated.
  • Pull all of the counts from these string-day entities into your DayCount entity and set its update-time to the highest observed update-time in the list of string-day entities you got. Commit your changes to the current DayCount.

Be smart about reusing old query results.
  • At the end of the week, generate weekly count aggregations
  • Same for end of month
  • Identify the largest aggregate result or group thereof which fit inside the requested date range, and use these as the base of your calculation

Remember to use unindexed properties for any part of the entities you're not going to use in a query, saves on cost.

Also, as someone who used JDO, I suggest you give Objectify a look.

Neo

unread,
Aug 7, 2012, 8:15:14 AM8/7/12
to google-a...@googlegroups.com
Hi Joakim,

Thanks for the detailed reply. The problem is, though there will be only a single entity for each day, it may have thousands of string > count items. I will have to process all these string > count items. If we start aggregating each week and each month and so on, and find the largest such interval that fits inside the given interval, then we reduce the items processing drastically. The only problem remains here is what if there are half a million of different input strings between the given interval. It seems that even aggregating on months won't be able to help. 

Having said that, I must say that I really like the idea. Do you think that it is scalable?
Reply all
Reply to author
Forward
0 new messages