Datastore is slow on queries involving many entities, but a smallish dataset

Eric Rannaud

unread,

Dec 1, 2009, 4:55:47 AM12/1/09

to google-a...@googlegroups.com

At http://www.911pagers.org/, a relatively simple query such as
http://www.911pagers.org/#rangeId0-128, which does:

SELECT * FROM MessageS where id >= 0 && id < 128 order by id

and fetches 128 entities, takes anywhere between 1s and 4s, usually
around +2.5s, measured on the server, in wall-clock time, with the
following code:

Calendar c = Calendar.getInstance();
long t0 = c.getTimeInMillis();
qmsgr = (List<MessageS>) qmsg.execute(lo, hi);
System.err.println("getCMIdRange:qmsg: " + (c.getTimeInMillis() - t0));

id is not the primary key of MessageS, but a 'long' field, _unique_ in
each entity (so it maps 1-to-1 with the entities). Note: such a query
doesn't require a specific index.

There are about 500,000 MessageS entities in the datastore, for a
total of 140 MB, excluding metadata (of which there are about 85 MB).

Repeating the same request several times in a row doesn't improve
response time. The delay has been pretty consistent over several days.
Note again that the delay is measured on the server, the latency of my
connection is not a factor.

I am aware App Engine is a preview and all, but 2.5s, really?

I can find the data faster in a text file split across 4 machines over
a shared Ethernet 10Mb with a bunch of Python for loops running on
OLPC XOs, for $DEITY's sake.

Eric.

Stephen

unread,

Dec 1, 2009, 2:02:25 PM12/1/09

to Google App Engine

On Dec 1, 9:55 am, Eric Rannaud <eric.rann...@gmail.com> wrote:
> Athttp://www.911pagers.org/, a relatively simple query such ashttp://www.911pagers.org/#rangeId0-128, which does:

Are you fetching all 128 entities in one batch? If you don't, the
result is fetched in batches of 20, incurring extra disk reads and rpc
overhead.

Not sure how you do that with the Java API, but with python you pass
'128' to the .fetch() method of a query object.

Eric Rannaud

unread,

Dec 1, 2009, 4:12:30 PM12/1/09

to google-a...@googlegroups.com

On Tue, Dec 1, 2009 at 11:02 AM, Stephen <sde...@gmail.com> wrote:
> On Dec 1, 9:55 am, Eric Rannaud <eric.rann...@gmail.com> wrote:
>> Calendar c = Calendar.getInstance();
>> long t0 = c.getTimeInMillis();
>> qmsgr = (List<MessageS>) qmsg.execute(lo, hi);
>> System.err.println("getCMIdRange:qmsg: " + (c.getTimeInMillis() - t0));
>

> Are you fetching all 128 entities in one batch? If you don't, the
> result is fetched in batches of 20, incurring extra disk reads and rpc
> overhead.
>
> Not sure how you do that with the Java API, but with python you pass
> '128' to the .fetch() method of a query object.

As far as I can tell, there is no such equivalent in the Java API. The
query.execute() statement returns a collection that is meant to
contain all the results. I don't know how they implement the
Collection object returned by query.execute(). Google may well manage
that in batches internally, inside the object with interface
List<MessageS>, but that would be nasty for performance.

I should say that a query with 1 result takes about 30ms. 128*30 =
3840 ms. That's pretty close to what I'm seeing for 128, indicating a
linear scaling in the number of entities. Which would be really bad,
and unexpected.

It's really hard to guess what's going on internally, without any
visibility of the architecture.

To see the impact of number of entities on response time, I did some
systematic testing:

Querying elements [0,10), [0,10), [0,10), [0,20), [0,20), [0,20),
[0,30), [0,30), [0,30), ... [0, 260), [0, 260), [0, 260) by increments
of 10, in a quick succession, three times each, actually shows a
pretty good performance behavior, the largest query with 260 entities
returned taking 300ms. So there is some kind of caching happening,
maybe. I didn't see that caching behavior earlier, but I wasn't doing
queries in such a quick succession.

But if I hit randomly in the datastore, i.e., [X+0,X+10), [X+0,X+20),
[X+0,X+30), ... [X+0, X+260), where X is random and different for
each request, 0 <= X < 500000, then pretty much all the queries take
between 1s and 4s, and we're back to more or less linear scaling in
the number of entities fetched. (With a query returning a single
entitiy taking 3s every so often.)

It does make some sense for random queries to take longer than a bunch
of queries in the same area of the datastore (except that there are no
guarantees that the locality in the datastore is related to the
ordering with respect to the field 'id'). But with the near linear
scaling in response time with the number of entities, say 30 ms per
entity, of average size 463 B, that's an implied bandwidth in the
backend of 120Kb/s. Which is not very good.

A last point, the field 'id' and the PrimaryKey of the entity MessageS
are effectively uncorrelated (with respect to their ordering). The
PrimaryKey is a String containing a MD5 hash of the content, the 'id'
is a long set incrementally.

Has anybody looked (publicly) at datastore performance depending on
query size, locality, etc? If not, I might try to gather some
extensive data, and write it up.

Thanks,
Eric.

yejun

unread,

Dec 1, 2009, 5:51:54 PM12/1/09

to Google App Engine

Does MessageS have a relation? Maybe those related field being
converted to separated queries because there's no table join.
In JPA you can specify those fields as Lazy fetch to avoid that.

On Dec 1, 4:55 am, Eric Rannaud <eric.rann...@gmail.com> wrote:
> Athttp://www.911pagers.org/, a relatively simple query such ashttp://www.911pagers.org/#rangeId0-128, which does:

Eric Rannaud

unread,

Dec 1, 2009, 6:06:55 PM12/1/09

to google-a...@googlegroups.com

On Tue, Dec 1, 2009 at 2:51 PM, yejun <yej...@gmail.com> wrote:
> Does MessageS have a relation? Maybe those related field being
> converted to separated queries because there's no table join.
> In JPA you can specify those fields as Lazy fetch to avoid that.

No, no relation.

yejun

unread,

Dec 1, 2009, 6:21:10 PM12/1/09

to Google App Engine

On Dec 1, 4:55 am, Eric Rannaud <eric.rann...@gmail.com> wrote:

> Athttp://www.911pagers.org/, a relatively simple query such ashttp://www.911pagers.org/#rangeId0-128, which does:

>
> SELECT * FROM MessageS where id >= 0 && id < 128 order by id
>
> and fetches 128 entities, takes anywhere between 1s and 4s, usually
> around +2.5s, measured on the server, in wall-clock time, with the
> following code:
>
> Calendar c = Calendar.getInstance();
> long t0 = c.getTimeInMillis();
> qmsgr = (List<MessageS>) qmsg.execute(lo, hi);
> System.err.println("getCMIdRange:qmsg: " + (c.getTimeInMillis() - t0));
>
> id is not the primary key of MessageS, but a 'long' field, _unique_ in
> each entity (so it maps 1-to-1 with the entities). Note: such a query
> doesn't require a specific index.

Why "such a query doesn't require a specific index"?
There's no such thing as unique constraint on google data store.

Eric Rannaud

unread,

Dec 1, 2009, 6:33:53 PM12/1/09

to google-a...@googlegroups.com

On Tue, Dec 1, 2009 at 3:21 PM, yejun <yej...@gmail.com> wrote:
>> id is not the primary key of MessageS, but a 'long' field, _unique_ in
>> each entity (so it maps 1-to-1 with the entities). Note: such a query
>> doesn't require a specific index.
>
> Why "such a query doesn't require a specific index"?

I mean, according to the documentation, you can execute such a query
without building an index for it, as App Engine provides automatic
indexes in such a case.

"App Engine provides automatic indexes for the following forms of queries:
- queries using only equality and ancestor filters
- queries using only inequality filters (which can only be of a
single property)
- queries with no filters and only one sort order on a property,
either ascending or descending
- queries using equality filters on properties and inequality or
range filters on keys"

My query is of the second type.

> There's no such thing as unique constraint on google data store.

I don't understand what you mean.

yejun

unread,

Dec 1, 2009, 7:21:17 PM12/1/09

to Google App Engine

Did you enable auto index in datastore-indexes.xml and generated
datastore-indexes-auto.xml is correct?

This is related to issue #178 http://code.google.com/p/googleappengine/issues/detail?id=178

Eric Rannaud

unread,

Dec 1, 2009, 7:32:10 PM12/1/09

to google-a...@googlegroups.com

On Tue, Dec 1, 2009 at 4:21 PM, yejun <yej...@gmail.com> wrote:
> Did you enable auto index in datastore-indexes.xml and generated
> datastore-indexes-auto.xml is correct?

Yes. But that's irrelevant, as an index is not needed. And by the way,
you can try to define an index over a single property, but it will be
rejected by appcfg, as App Engine provides those implicitly.

>> > There's no such thing as unique constraint on google data store.
>>
>> I don't understand what you mean.
>
> This is related to issue #178 http://code.google.com/p/googleappengine/issues/detail?id=178

Oh I see. I never talked about a "unique(ness) constraint". I merely
stated that the value of the field id *happens* to be unique accross
all MessageS entities.

Thanks,
Eric.

yejun

unread,

Dec 1, 2009, 7:50:06 PM12/1/09

to Google App Engine

You still need to enable auto index.
http://code.google.com/appengine/docs/java/config/indexconfig.html

On Dec 1, 7:32 pm, Eric Rannaud <eric.rann...@gmail.com> wrote:
> On Tue, Dec 1, 2009 at 4:21 PM, yejun <yej...@gmail.com> wrote:
> > Did you enable auto index in datastore-indexes.xml and generated
> > datastore-indexes-auto.xml is correct?
>
> Yes. But that's irrelevant, as an index is not needed. And by the way,
> you can try to define an index over a single property, but it will be
> rejected by appcfg, as App Engine provides those implicitly.
>
> >> > There's no such thing as unique constraint on google data store.
>
> >> I don't understand what you mean.
>

> > This is related to issue #178http://code.google.com/p/googleappengine/issues/detail?id=178

Eric Rannaud

unread,

Dec 1, 2009, 7:58:09 PM12/1/09

to google-a...@googlegroups.com

On Tue, Dec 1, 2009 at 4:50 PM, yejun <yej...@gmail.com> wrote:
> You still need to enable auto index.
> http://code.google.com/appengine/docs/java/config/indexconfig.html

Not at all. "auto index" as you call it, is different from the
implicit indexing feature of datastore that I quoted in a previous
email.

datastore-indexes-auto.xml is just an helper, which concerns solely
the development server. When running on the development server, when
you perform a query for which an explicit index would have to be
built, instead of rejecting the query (which is what the real app
engine does), it accepts the query, and adds to
datastore-indexes-auto.xml the configuration you would have to put in
your datastore-indexes.xml, just to help you figure out what do.

This later file is used by appcfg to demand an explicit index build.
If configured to do so, appcfg can use the content of
datastore-indexes-auto.xml as well, as another configuration file for
demanding an index build.

Besides, I do have this "auto index" enabled, and always had.

Thanks,
Eric.

Stephen

unread,

Dec 1, 2009, 9:49:27 PM12/1/09

to Google App Engine

On Dec 1, 9:12 pm, Eric Rannaud <eric.rann...@gmail.com> wrote:

> On Tue, Dec 1, 2009 at 11:02 AM, Stephen <sdea...@gmail.com> wrote:
> > On Dec 1, 9:55 am, Eric Rannaud <eric.rann...@gmail.com> wrote:
> >> Calendar c = Calendar.getInstance();
> >> long t0 = c.getTimeInMillis();
> >> qmsgr = (List<MessageS>) qmsg.execute(lo, hi);
> >> System.err.println("getCMIdRange:qmsg: " + (c.getTimeInMillis() - t0));
>
> > Are you fetching all 128 entities in one batch? If you don't, the
> > result is fetched in batches of 20, incurring extra disk reads and rpc
> > overhead.
>
> > Not sure how you do that with the Java API, but with python you pass
> > '128' to the .fetch() method of a query object.
>
> As far as I can tell, there is no such equivalent in the Java API. The
> query.execute() statement returns a collection that is meant to
> contain all the results. I don't know how they implement the
> Collection object returned by query.execute(). Google may well manage
> that in batches internally, inside the object with interface
> List<MessageS>, but that would be nasty for performance.

Something like this..?

DatastoreService datastore =
DatastoreServiceFactory.getDatastoreService();

Query query = new Query("MessageS");
query.addFilter("id", Query.FilterOperator.GREATER_THAN_OR_EQUAL, 0);

List<Entity> messages = datastore.prepare(query)
.asList(FetchOptions.Builder.withLimit(128));

You might also have to tweak chunkSize and/or prefetchSize, or ask on
the Java list.

> A last point, the field 'id' and the PrimaryKey of the entity MessageS
> are effectively uncorrelated (with respect to their ordering). The
> PrimaryKey is a String containing a MD5 hash of the content, the 'id'
> is a long set incrementally.

If you are querying mostly by id then it may make sense to make the
primary key an integer id and the hash a property.

Then you would have two options for fetching: a query, like before,
but this time against the __key__ pseudo property which should be
faster than looking at the secondary index; and you can also do a
direct 'get' which should be faster than any query:

messages = MessageS.get_by_id(range(0, 128))

(Java has similar APIs)

> I should say that a query with 1 result takes about 30ms. 128*30 =
> 3840 ms. That's pretty close to what I'm seeing for 128, indicating a
> linear scaling in the number of entities. Which would be really bad,
> and unexpected.
>
> It's really hard to guess what's going on internally, without any
> visibility of the architecture.

This is a recent change. It used to be that you were charged whatever
api_cpu it took to run your query, as measured on the machines. Now
there seems to be an algorithm that generates the cost based on your
entities and query type, so it will be the same from query to query.

This is good because now your costs do not suddenly go up 30% because
Google's infrastructure is having a bad day. The incentives used to be
all wrong. The change is bad because Google didn't announce it. Are
the api_cpu costs exactly the same as before? If not, it is an
unannounced price in/decrease.

> Has anybody looked (publicly) at datastore performance depending on
> query size, locality, etc? If not, I might try to gather some
> extensive data, and write it up.

It would be nice to work out what the algorithm for api_cpu is...

Eric Rannaud

unread,

Dec 1, 2009, 10:02:05 PM12/1/09

to google-a...@googlegroups.com

On Tue, Dec 1, 2009 at 6:49 PM, Stephen <sde...@gmail.com> wrote:
>> I should say that a query with 1 result takes about 30ms. 128*30 =
>> 3840 ms. That's pretty close to what I'm seeing for 128, indicating a
>> linear scaling in the number of entities. Which would be really bad,
>> and unexpected.
>>
>> It's really hard to guess what's going on internally, without any
>> visibility of the architecture.
>
>
> This is a recent change. It used to be that you were charged whatever
> api_cpu it took to run your query, as measured on the machines. Now
> there seems to be an algorithm that generates the cost based on your
> entities and query type, so it will be the same from query to query.
>
> This is good because now your costs do not suddenly go up 30% because
> Google's infrastructure is having a bad day. The incentives used to be
> all wrong. The change is bad because Google didn't announce it. Are
> the api_cpu costs exactly the same as before? If not, it is an
> unannounced price in/decrease.
>
>
>> Has anybody looked (publicly) at datastore performance depending on
>> query size, locality, etc? If not, I might try to gather some
>> extensive data, and write it up.
>
>
> It would be nice to work out what the algorithm for api_cpu is...

Note: I am *not* using api_cpu time (not available in the Java
runtime, as far I know). I am using system wall-clock time.
I am doing the Java equivalent of two gettimeofday() and computes the
difference.

So the way api_cpu and co are computed is an entirely different question.

Eli Jones

unread,

Dec 1, 2009, 10:33:51 PM12/1/09

to google-a...@googlegroups.com

If you get bored, you could try creating a python version of your app
that just times how long it takes to fetch(128) of the entities.

Maybe the Java stuff isn't as optimized as the Python.. It seems odd
that python defaults to 20 results at a time while Java just grabs the
full result set. There may be something else going on under the hood.

> --
>
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>
>

--
Sent from my mobile device

Julian Namaro

unread,

Dec 2, 2009, 12:48:15 AM12/2/09

to Google App Engine

Hi Eric,

Of course this kind of performance is in no way normal.
Such a simple query typically takes below 50ms.
So either there is a problem in your code, or you hit a bug in the
java SDK.
I suggest you post a more complete code sample in the App Engine Java
group.

And no need to be condescending here. You'll see that the datastore is
a great piece of engineering :)
Query performance is completely independent of the number of entities
you have.
It depends on the size of the result set, but results are fetched in
parallel so the overhead is mainly the time needed to deserialize
entities. For the same reason locality is not a factor.

On Dec 2, 6:12 am, Eric Rannaud <eric.rann...@gmail.com> wrote:

Eric Rannaud

unread,

Dec 2, 2009, 1:22:09 AM12/2/09

to google-a...@googlegroups.com

On Tue, Dec 1, 2009 at 9:48 PM, Julian Namaro <namaro...@gmail.com> wrote:
> Of course this kind of performance is in no way normal.
> Such a simple query typically takes below 50ms.

Good to hear; I will post more code. Thanks.

> And no need to be condescending here. You'll see that the datastore is
> a great piece of engineering :)

Point taken.

It is the fourth major performance problem that I reported in my whole
5 days of using App Engine, at no point getting feedback from the
Google people that are active here. My apologies for growing somewhat
frustrated.

- Index on 120MB of raw data taking 12h+ to build:
http://groups.google.com/group/google-appengine/browse_thread/thread/828d543bd5d03683/498ab1b297b15a44

- Ordered queries returning smaller dataset than non-ordered queries:
http://groups.google.com/group/google-appengine/browse_thread/thread/3d0ce8d09242db60/dd15fbc070e0abbe

- This thread.

- Index deletion taking many hours, and counting. Index in "Error"
state that cannot be deleted.
http://code.google.com/p/googleappengine/issues/detail?id=2460

Thanks.

Stephen

unread,

Dec 2, 2009, 6:45:30 AM12/2/09

to Google App Engine

On Dec 2, 3:02 am, Eric Rannaud <eric.rann...@gmail.com> wrote:

It should be in the logs, which are accessible via the admin console.

Eric Rannaud

unread,

Dec 2, 2009, 3:03:50 PM12/2/09

to google-a...@googlegroups.com, google-app...@googlegroups.com

Crossposting to App Engine Java group: the original thread is at
http://groups.google.com/group/google-appengine/browse_thread/thread/22018ef2e132ac13/54485b787d5a80b5

In a few words: I have a problem with reasonable queries taking a very
long time (several seconds). These queries return 128 entities, from a
total of 500,000 entities of that type in the datastore. Each entity
is about 400 bytes.

On Tue, Dec 1, 2009 at 6:49 PM, Stephen <sde...@gmail.com> wrote:
> On Dec 1, 9:12 pm, Eric Rannaud <eric.rann...@gmail.com> wrote:
>> On Tue, Dec 1, 2009 at 11:02 AM, Stephen <sdea...@gmail.com> wrote:
>>> On Dec 1, 9:55 am, Eric Rannaud <eric.rann...@gmail.com> wrote:

>>>> SELECT * FROM MessageS where id >= 0 && id < 128 order by id
>>>>

>>>> Calendar c = Calendar.getInstance();
>>>> long t0 = c.getTimeInMillis();
>>>> qmsgr = (List<MessageS>) qmsg.execute(lo, hi);
>>>> System.err.println("getCMIdRange:qmsg: " + (c.getTimeInMillis() - t0));
>>
>>> Are you fetching all 128 entities in one batch? If you don't, the
>>> result is fetched in batches of 20, incurring extra disk reads and rpc
>>> overhead.
>>
>>> Not sure how you do that with the Java API, but with python you pass
>>> '128' to the .fetch() method of a query object.
>>
>> As far as I can tell, there is no such equivalent in the Java API. The
>

> Something like this..?
>
> DatastoreService datastore =
> DatastoreServiceFactory.getDatastoreService();
>
> Query query = new Query("MessageS");
> query.addFilter("id", Query.FilterOperator.GREATER_THAN_OR_EQUAL, 0);
>
> List<Entity> messages = datastore.prepare(query)
> .asList(FetchOptions.Builder.withLimit(128));
>
> You might also have to tweak chunkSize and/or prefetchSize, or ask on
> the Java list.

I did some tests with the code you proposed. The performance remains
essentially the same as with the JDO API, i.e. between 1 and 4 second
per "execute"/"prepare" statement (2.5s on average).

DatastoreService datastore =
DatastoreServiceFactory.getDatastoreService();
Query query = new Query("MessageS");

query.addFilter("id", Query.FilterOperator.GREATER_THAN_OR_EQUAL, lo);
query.addFilter("id", Query.FilterOperator.LESS_THAN, hi);

long t0 = Calendar.getInstance().getTimeInMillis();

List<Entity> r = datastore.prepare(query)
.asList(FetchOptions.Builder
.withLimit(128)
.prefetchSize(128)
.chunkSize(128));

System.err.println("LOW:getCMIdRange:qmsg: "
+ (Calendar.getInstance().getTimeInMillis() - t0)
+ " " + r.size());

Thanks.

Julian Namaro

unread,

Dec 3, 2009, 1:07:36 PM12/3/09

to Google App Engine

Eric,

Sorry my previous answer was wrong. The query time increases with the
number of entities fetched faster than I thought.
From a small benchmark in Python average time for your query is
150-200ms.
This is for fetching 130 entities of about 400 bytes, with 5
properties each, using an inequality filter.

But still, one second is definitively too long. Hope the Java gurus
can help you!

On Dec 3, 5:03 am, Eric Rannaud <eric.rann...@gmail.com> wrote:
> Crossposting to App Engine Java group: the original thread is athttp://groups.google.com/group/google-appengine/browse_thread/thread/...

Reply all

Reply to author

Forward