A difficult app engine optimisation problem - selecting distinct entities across a large table

Edward Hartwell Goose

unread,

Feb 14, 2011, 9:59:45 AM2/14/11

to Google App Engine

Hi,

I was wondering if anyone can help me with this problem. We have an
idea we'd like to implement, and we're currently unable to do this
efficiently.

I've anonymised the data as best as possible, but the structure is the
same.

We have two entities, Car and CarJourney.

Each Car has 0 to many CarJourney's.

Each Car Journey has (amongst other properties) a date associated with
it - the date the journey was started.

I wish to query by time over car journeys. I'll have two times, a
start date and an end date, where start date <= endDate, and I want to
receive the most recently started journey in that period.

So, if I had a particular car in mind, say car 123, I'd write a query
that limits by Car.key and Car.startDate, where Car.key == 123 and
Journey.startDate >= startDate and Journey.startDate <= endDate with
an ordering on Journey.startDate descending and a limit of 1.

e.g. Car A has 3 journeys, taken on 1st, 2nd and the 3rd of the month.
The query start date is 1st and the query end date is the 2nd. The
result of this query would be one Car journey, the 2nd.

Once the result of that query is returned, a very small amount of
processing is done to return a result to the user.

That's the easy bit.

But, instead of over 1 Car, I want a list of cars, where the list
contains N keys to cars.

So, I want to run the above query N times, once for every car. And I
want the latest journey for each car.

Because the time range is flexible (and thus can't be known
beforehand) we can't implement a "isMostRecent" flag, because while it
might be the most recent for now, it might not be the most recent for
the specified date parameters.

We also need to ensure that this returns promptly (current queries are
around the 3-5 second mark for a small set of data) as this goes
straight back to the user. This means that we can't use task queues,
and because the specified dates are arbitrary we can't implement mass
indexing of "isWithinDate" fields.

We tried using an async query, but because the amount of processing is
negligible the bottleneck is still the queries on the datastore
(because the async api still sends the requests synchronously, it just
doesn't block).

Ideally, we'd implement this as a select on car journeys ordered by
startDate where the Car.key is distinct, but we can't seem to pull
this off in GAE.

There are numerous small optimisations we can make (for example, some
MemCaching of repeated queries) but none have made a significant dent
in our query time. And MemCaching can only help for a maximum of 1-2
minutes (due to the inevitable forward march of time!)

Any ideas are most welcome and highly appreciated.

Thanks,
Ed

Stephen Johnson

unread,

Feb 14, 2011, 12:39:12 PM2/14/11

to google-a...@googlegroups.com

Hi Ed,
Here's an idea. I would suggest using as the key of the CarJourney entity the CarId and the StartDate (in a lexicographical manner). For example, for car 123 and start date Monday Feb. 14, 2011 at 5:13:33 pm (if time of day is not important it could be dropped from key), the CarJourney key would be 123:2011:02:14:17:13:33 and I would add a DESCENDING index on the __key__ property since these are not added automatically like other properties. Then, you can query for the entities that meet your specified criteria using a key range like this with a limit of 1.

select CarJourney where __key__ >= MAKEKEY(Car + StartDate) && __key__ <= MAKEKEY(Car + EndDate) order by __key__ DESC

You should be able to fire off these asynchronously for multiple cars. As another optimization, you might think about making these queries a keys only query and only retrieving the key and seeing first if you have the CarJourney in memcache and if not then retrieving the full entity from the datastore and putting the entity into memcache.

Hope this helps,
Steve

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Calvin

unread,

Feb 14, 2011, 1:47:28 PM2/14/11

to google-a...@googlegroups.com

Can you do filtering in memory?

This query would give you all of the journeys for a list of cars within the date range:

carlist = ['123','333','543','753','963','1236']

start_date = datetime.datetime(2011, 1, 30)

end_date = datetime(2011, 2, 10)

journeys = Journey.all().filter('start >', start_date).filter('start <', end_date).filter('car IN', carlist).order('-start').fetch(100)

len(journeys)

43 # <- since it's less than 100 I know I've gotten them all

then since the list is sorted I know the first entry per car is the most recent journey:

results = {}

for journey in journeys:

... if journey.car in results:

... continue

... results[journey.car] = journey

len(results)

6

for result in results.values():

... print("%s : %s" % (result.car, result.start))

753 : 2011-02-09 12:38:48.887976

1236 : 2011-02-06 13:59:35.221003

963 : 2011-02-08 14:03:54.587609

333 : 2011-02-09 10:40:09.466700

543 : 2011-02-09 15:28:53.197123

123 : 2011-02-09 14:09:02.680870

Edward Hartwell Goose

unread,

Feb 14, 2011, 2:38:56 PM2/14/11

to Google App Engine

Hi Calvin & Stephen,

Thanks for the ideas.

Calvin:
We can't do the filtering in memory. We potentially have a car making
a journey (the car analogy isn't so good...) making a journey every 3
seconds, and we could have up to 2,000 cars.

We need to be able to look back up to 2 months, so it could be up to
1.8 billion rows in this table.

Stephen:
That's an interesting idea. However the Asynchronous api actually
fires the requests synchronously, it just doesn't block. (Or at least,
that's my experience).

So, at the moment we fire off 1 query (which Google turns into 2) for
each site. And although the method call returns instantly, it still
takes ~5 seconds in total with basic test data. If each call takes
12ms, we still have to wait 24 seconds for 2,000 sites.

So, the first call starts at time 0, the second call starts at 0+12,
the third at 0+12+12... etc. With 2,000 sites, this works out about 24
seconds. Once you've added in the overheads and getting the list of
Cars in the first place, it's too long.

If we could start even 100 queries at the same time of time 0, that'd
be superb. We thought we could do it with multithreading, but that's
not allowed on App Engine.

Finally - I've also posted this on StackOverflow -
http://stackoverflow.com/questions/4993744/selecting-distinct-entities-across-a-large-google-app-engine-table/4994494#4994494

I'll try and keep both updated.

Any more thoughts welcome!
Ed

Calvin

unread,

Feb 14, 2011, 3:35:41 PM2/14/11

to google-a...@googlegroups.com

Are you returning results to a web browser, or a specialized client? One of the Google I/O talks demonstrates spawning a crapload of tasks in parallel to collect results, and in that demo most of the tasks had completed by the time the web page had refreshed (it might be the one someone linked to at stackoverflow).

If it's webpage responsiveness that's the issue, you could provide results using the channel api, or with ajax polling.

If it's a specialized client, could it poll for results every few seconds until they're exhausted?

Stephen Johnson

unread,

Feb 14, 2011, 4:09:24 PM2/14/11

to google-a...@googlegroups.com

Are you using .asList (which I think blocks like you describe), but I thought asIterable or asIterator wasn't suppose to. (if you're using Java).

Stephen Johnson

unread,

Feb 14, 2011, 4:12:25 PM2/14/11

to google-a...@googlegroups.com

Or maybe it blocks on different result sets just not on getting the next fetch block?? Hmmm. Sounds like a tough problem.

Stephen Johnson

unread,

Feb 14, 2011, 4:51:17 PM2/14/11

to google-a...@googlegroups.com

Okay I think I got something that might work. Reverse the StartDate and CarId for the key from what I said above so the key would look like this: 2011:02:14:17:13:33:123 and the KEYS ONLY query then is:

select __key__ where __key__ >= MAKEKEY(StartDate + CarId) && __key__ <= MAKEKEY(EndDate + CarId) order by __key__ DESC

Now, you can use the Async query to start processing. You're going to get entries that you're not interested in but you're only getting the key field back and not the whole CarJourney entry and this key/id has the Date and Car ID, so the first time you hit a Car ID for each Car then you have the ID for the latest CarJourney for that car. Now, once you've found all car ID's your looking for you can abort the query or you'll reach the end of the query results. Now, as you're looping, store the KEYs of the entries your looking for and then do a batch GET on memcache to retrieve as many Car (you've got the car id) and CarJourney (you've got the carjourney id) entries that might be stored there and then for any that you didn't get from memcache, you can do a batch GET on the datastore using the keys/ids that you have.

I think that if you memcache things appropriately and use the batch gets for memcache and datastore then this might just work for you.

Let me know what you think. It's an interestng problem,
Stephen

nickmilon

unread,

Feb 14, 2011, 6:53:05 PM2/14/11

to Google App Engine

I have faced same kind of problem some time ago.
I tried some of the solutions suggested here (in memory sort and
filtering, encoding things into keys etc. and I have benchmarked those
for both latency and cpu cycles using some test data around 100K
entities)
An other approach I have taken is encoding the date as an integer (day
since start of epoch or day since start of year, same for hour of day
or month depending on how much detail you need in your output) and
saving this into a property. This way you turn your date query filter
into an equality only filter which does not even needs to specify an
index) then you can sort or filter on other properties.
Benchmarking the latest solution I have found that when the filtered
result set is a small fraction of the unfiltered original set, is 1+
order of magnitude faster and cpu-eficient. Worst case when no
reduction of the result set due to filtering the latency and cpu usage
was comparable to the previous solutions)

Hope this helps, or did I missed something ?
Happy coding-:)

On Feb 14, 11:51 pm, Stephen Johnson <onepagewo...@gmail.com> wrote:
> Okay I think I got something that might work. Reverse the StartDate and
> CarId for the key from what I said above so the key would look like this:
> 2011:02:14:17:13:33:123 and the KEYS ONLY query then is:
>
> select __key__ where __key__ >= MAKEKEY(StartDate + CarId) && __key__ <=
> MAKEKEY(EndDate + CarId) order by __key__ DESC
>
> Now, you can use the Async query to start processing. You're going to get
> entries that you're not interested in but you're only getting the key field
> back and not the whole CarJourney entry and this key/id has the Date and Car
> ID, so the first time you hit a Car ID for each Car then you have the ID for
> the latest CarJourney for that car. Now, once you've found all car ID's your
> looking for you can abort the query or you'll reach the end of the query
> results. Now, as you're looping, store the KEYs of the entries your looking
> for and then do a batch GET on memcache to retrieve as many Car (you've got
> the car id) and CarJourney (you've got the carjourney id) entries that might
> be stored there and then for any that you didn't get from memcache, you can
> do a batch GET on the datastore using the keys/ids that you have.
>
> I think that if you memcache things appropriately and use the batch gets for
> memcache and datastore then this might just work for you.
>
> Let me know what you think. It's an interestng problem,
> Stephen
>

> On Mon, Feb 14, 2011 at 2:12 PM, Stephen Johnson <onepagewo...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Or maybe it blocks on different result sets just not on getting the next
> > fetch block?? Hmmm. Sounds like a tough problem.
>

> >>>http://stackoverflow.com/questions/4993744/selecting-distinct-entitie...

Robert Kluin

unread,

Feb 14, 2011, 10:42:58 PM2/14/11

to google-a...@googlegroups.com

Hi Ed,
I think Stephen's most recent idea (key name as date + car id) might
work well. I have a suggestion that is somewhat similar to Nick's
idea.

How wide and variable is the query window? How variable are the
occurrences of car journey data? Once a car starts having journeys,
will it continue having journeys for a (relatively) long time? You
said there would be up to 2,000 cars, are new cars frequently
introduced and old cars removed?

Depending on those questions, perhaps you could build your own
index. Think about an entity that stores a (non-indexed) list of cars
/ car journeys. If you make the key_name for that entity "lower
resolution" than the start time, you store the most recent journey for
each car. If you construct these entities properly, then you could
easily build a list of the keys needed to completely enclose a
particular query window by fetching a small number of these lists. If
you also create higher-level groups, you can also efficiently query
over very large periods of time.

The key names might look something like:
201102151430: 15 Feb 2011 14:30: Journeys that occurred between
14:25 and 14:30.
2011021514: 15 Feb 2011 14: Journeys that occurred from 13:01 to 14:00
20110215: 15 Feb 2011: Journeys that occurred on 15 Feb 2011
etc...

You might use a list of tuples on the entity that stores the car id,
journey id, and start time. That will allow you to efficiently fetch
the cars and the journeys and correctly combine and filter the results
in memory when you need more than one index entity.

You would need to construct the key names based on your query
windows. With some planning, the technique work very well even for
'continuous' query windows that are highly variable in size. I've
used this technique for reporting and am very pleased with it. It is
also relatively easy to implement. You just need a background process
that maintains your indexes, it can be (almost) real-time or an
offline background process.

Robert

Edward Hartwell Goose

unread,

Feb 15, 2011, 4:24:18 AM2/15/11

to Google App Engine

We're returning to a web browser. Specifically as JSON.

I'll look into that video!

Edward Hartwell Goose

unread,

Feb 15, 2011, 4:26:20 AM2/15/11

to Google App Engine

Copy and pasted from StackOverflow so everyone sees the answer:

It's a really good idea, but: because we're looking for the latest
journey in a unknown time frame, you can't store a how many days old
it is (or equivalent), because if a car stopped making journeys for a
week, the last journey would have a "days old" or equivalent of -7.
But, in a time frame between now (0) and 2 weeks ago (-14) we would
need that journey (because it's the latest). In a time frame of now
(0) to 3 days ago (-3) though, we must not show that journey. You end
up doing a > 0 and <= 3 again.

The alternative is to turn your singular field into N fields (say, 1
for each day/hour) for the time frame we need. So in our example, we
want to look at journeys over a 2 month period. We could then have a
true/false value on each field as to whether to show that journey in a
given selection. So in the example above, the journey made 7 days ago
would have -7, -6, -5 ... 0 all set to true and -8 onwards to false.
But because of the inevitable forwardness of time maintaining that set
of fields gets difficult. Add in the 100 index limit and things get
difficult.

Edward Hartwell Goose

unread,

Feb 15, 2011, 4:32:39 AM2/15/11

to Google App Engine

The query window is between now and 2 months ago, with granularity of
a minute. So 15/02/2011 09:28:59 is the same as 15/02/2011 09:28:01.

The number of cars is likely to be very slow increasing, probably
won't see big jumps at all. The majority will start at the beginning
of our beta period. Once a car starts reporting journeys, it's likely
to continue reporting journeys at the same rate (which could be
somewhere between every 3 seconds and every 24 hours) indefinitely.

Both Stephen and your ideas sound like really good steps. I'll discuss
them with my team later and try and get one implemented. Fingers
crossed!

And of course - thanks for the help!

Stephen Johnson

unread,

Feb 15, 2011, 11:40:55 AM2/15/11

to google-a...@googlegroups.com

Hi Ed,
One issue you'll have is where you have a large time frame to query over and some cars that make infrequent journeys and cars that make very frequent journeys so you'll have to scan a lot of key entries to find all cars, so as a further optimization, if you are able to somehow classify a car by it's general frequency of journal entries, you could assign slot numbers to the cars. It wouldn't matter if sometimes they deviate from this general frequency. For example, create five slots, let's call them A,B,C,D,E and let's say car 123 infrequently makes journeys then they go to slot A and car 789 makes frequent journeys so they go to slot E. Then you can add that slot number to the Car entity information and add the slot number to the beginning of the CarJournal keys so it now looks like:

    SLOT:DATE:CAR-ID
    A:2011:02:14:17:13:33:123

So, now when doing your query for cars, you'll get the cars your interested in and determine the slot they are in and group the cars according to slot and run the following query per slot for the cars in that slot:

select __key__ where __key__ >= MAKEKEY(Slot + StartDate + CarId) && __key__ <= MAKEKEY(Slot + EndDate + CarId) order by __key__ DESC

Now, you'll have more queries to run (up to five or maybe 3 if you choose three slots, etc.), but if you're classification of cars is even mildly accurate over a long period you'll probably save yourself a lot of scanning.

Let us know how your testing goes and what you end up doing,
Stephen

Tom Gibara

unread,

Feb 14, 2011, 5:23:45 PM2/14/11

to google-a...@googlegroups.com

Hi Edward,

(I'm by no means expert, so feel free to discount the following)

In my somewhat limited experience, I've identified only three ways of
making queries faster:

1. denormalizing data (pushing more work into the writes)
2. deferring work (typically via task queues)
3. admitting inconsistency (typically reconciling with cron jobs)

and I try to apply these techniques in that order. Disregarding (3)
(under the assumption that users need accurate results), you've
already explained that the data goes back to the client on the same
request, so (2) is only an option if the client is able to page the
results of queries over smaller lists of Cars, or poll for the
completion of a task. Perhaps this is an option.

As I see it, the crux of your problem is that you are trying to
retrieve the CarJourneys based on a list of Keys. Since App Engine
doesn't provide an equivalent to the SQL IN operator (one that won't
just repeat the query N times on your behalf), no amount of
denormalization is going to help you here, because you're always going
to be stuck making separate queries against each Car key OR plucking
out the small set of Cars you want from all the results.

IF there is something that identifies the Cars in your list (eg. that
they are all associated with the same Company say) then it might be
possible to denormalize that (eg. adding the Company key to every
CarJourney entity would allow you to narrow down the dataset returned
even if some application level filtering was still necessary). Without
that, I can't see a direct way of significantly improving the query.

Tom.

Jay

unread,

Feb 15, 2011, 10:43:19 PM2/15/11

to google-a...@googlegroups.com

I am going to post two replies as I have a small idea that might help a little bit, and then a further post on the whole problem in general.

First, Ed, what you have said is consistent and the problem of identifying the "most recent Journey within a constrained time window" does appear to imply looking at all Journeys. So the question comes up, is there a way to exclude some of the Journeys? Is there perhaps some relationship between Journeys that has not been mentioned yet that could be used to make the data set smaller?

The only thing I came up with is the relationship to the search fidelity/granularity. As stated above, there is no differentiation _within_ one minute of time, and further, the boundary of that minute is well defined.

So, for any GIVEN minute, the system can at least determine, prior to search, that some Journeys are not the latest in that minute. Therefore, every time a new Journey starts, the system can (asynchronously of course) get all other Journeys that started in that same minute, and mark them as possibly_latest = false. By default, all Journeys are possibly_latest = true. And your search now includes this extra term, to narrow the results. [and one would use the obvious optimization of storing the minute on the Journey as an integer to make this search faster]

And of course, the system should also have a cron job that runs at some appropriate frequency to mark all Journeys that are more than 2 months old as possibly_latest = false. This is given the other constraint that the oldest searched Journeys start should be no more than 2 months ago.

This is no magic solution. But depending on the data, this could significantly reduce the number of Journeys that are searched (through index).

Jay

unread,

Feb 15, 2011, 11:15:40 PM2/15/11

to google-a...@googlegroups.com

I also wanted to add this comment which is probably not helpful, but speaks more to the root of the problem. :)

And that is this, "you can't get there from here". The fundamental problem does not appear capable of being reduced beyond looking at a bunch of Journeys. I'm going to summarize a couple of things here, but in the end it just comes down to look at a bunch of data ... unless you change something else.

A quick summary of some points:

1. You can de-normalize Car fields to Journey as it relates to this search. So the problem does reduce, slightly, to looking for Journeys. The "two" lookup aspect of it should not be much of a problem. I know this was mentioned above.

2. You can try to squash aspects of your search into the Key as Stephen suggested. This is a good idea but it does not change the nature of the problem - looking at a whole lot of entities, if only the keys of those entities.

These other possibilities are excluded from being viable because of the problem description.

3. Make the processing more parallel by turning it into a counting problem and using a map reduce approach or similar. This is the same as changing the requirements from "query" to "report" in practical terms. There is also the sticky issue of having a reduce implementation on app engine right now.

4. Changing the search criteria to be more restricted in some way that introduces a further relationship between Journeys. This additional information could probably be used to make the search perform better. In fact, one would engineer it in such a way, there would probably be no need to make a change in the way this search worked unless it resulted in this benefit.

5. There is also the "break the fourth wall" kind of solution. Take this part of the problem off of app engine. You would pay a lot more for this. Both in terms of complexity and cost (and maybe some other things too). It may not be realistic, but how realistic is the alternative? I hope you will report back at some stage and tell us.

(this might allow for a truly parallel query for example ... come on Datastore Plus!)

What I am getting at is that the system probably needs a higher level change to achieve the desired goal given the constraints that exist at this time. That is the "you can't get there from here". I think this has been virtually proven at this point though I would be happy to be proved wrong. It comes to trade-offs like everything. You can often trade space for speed, etc. etc. But it is often a trade. So you may need to look at trades between the existing use cases / user expectation and speed.

Thanks Ed for sharing with the group. This has been a good discussion and interesting to think about.

-- Jay

Edward Hartwell Goose

unread,

Feb 16, 2011, 5:39:25 AM2/16/11

to Google App Engine

Thanks very much all. We've had discussions with our client and have
opted to make a short term fix to this problem by changing the
requirements (the easy option ;) ).

So, this means that we're not going to be trying to solve this problem
over the next 1-2 weeks. But we will be looking into it after that.

I'm also going to see if I can tell you who the client (and thus the
real problem) is as well, as your help has been invaluable.

Tom & Jay - thanks very much for your input too. I think there are
some great ideas in there!

I'll also ensure that when we do come to a solution - I'll post it for
all to see!

If anyone does have any more ideas, do continue posting, it's been
really interesting to get everyone's perspective!

Cheers,
Ed

Reply all

Reply to author

Forward