Large datastore queries much slower with Python than Java?

60 views
Skip to first unread message

Darshan Shaligram

unread,
Oct 11, 2010, 9:53:53 AM10/11/10
to Google App Engine
This is a followup query to my question on stackoverflow:
http://stackoverflow.com/questions/3886341/is-appengine-python-datastore-query-much-3x-slower-than-java

I've been evaluating the appengine to choose between Python and Java
and I noticed a large performance difference in datastore queries:
large queries are much slower in Python (by a factor of >3x) than in
Java. I'd like to confirm that this performance difference is known
behaviour, and not some mistake I'm making in my Python code.

My test entity looks like this:

Person
======
firstname (length 8)
lastname (length 8)
address (20)
city (10)
state (2)
zip (5)

I populate the datastore with 2000 Person records, with each field
exactly the length noted here, all filled with random data and with no
fields indexed (just so the inserts go faster).

I then query 1k Person records from Python (no filters, no ordering):

q = datastore.Query("Person")
objects = list(q.Get(1000))

And 1k Person records from Java (likewise no filters, no ordering):

DatastoreService ds =
DatastoreServiceFactory.getDatastoreService();
Query q = new Query("Person");
PreparedQuery pq = ds.prepare(q);
// Force the query to run and return objects so we can be sure
// we've timed a full query.
List<Entity> entityList = new
ArrayList<Entity>(pq.asList(withLimit(1000)));

With this code, the Java code returns results in ~200ms; the Python
code takes much longer, averaging >700ms. Both apps are on the same
app id (with different versions), so they use the same datastore and
should be on a level playing field.

I repeated the same test with much smaller fetches (fetch size 10-30)
and the small fetches show essentially the same performance for both
Python and Java, so the Python slowness affects only large fetches.


All my code is available here, in case I've missed any details:
http://github.com/greensnark/appenginedatastoretest


I also instrumented the sample apps with appstats (as suggested on
stackoverflow), and reran the tests (1k record fetch). Appstats
reports times like this "datastore_v3.RunQuery real=122ms api=9179ms"
for Java and times like "datastore_v3.RunQuery real=377ms api=9179ms"
for Python. I'm not entirely clear on how to read the appstats times.

From my examination of the Python code in
google.appengine.api.datastore, it looks like most of the extra
slowdown in the Python code involves decoding the queried entities
from their protocol buffers, but I haven't benchmarked this to be
sure.

Could anyone confirm if large datastore queries are just slower in
Python because Python is intrinsically slower than Java, or that my
code is broken in some way that's screwing with the performance in the
Python version?

Eli Jones

unread,
Oct 11, 2010, 5:18:26 PM10/11/10
to google-a...@googlegroups.com
Well.. for one.. you are doing a datastore.query() instead of a db.query()

Most all documentation on working with the datastore indicates to use db from google.appengine.ext instead of datastore from google.appengine.api.

Maybe there is a difference in how they perform in this context?

Also, are you doing these tests on Appengine or in the Dev_appserver? (I'm presuming you're doing them on appengine live.. but just to be sure).


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.


Eli Jones

unread,
Oct 11, 2010, 5:35:14 PM10/11/10
to google-a...@googlegroups.com
From glancing at the two (Get() from datastore.query and fetch() from db.query):



It seems like doing query.fetch(limit) from db just wraps query.Get(limit) from datastore.. but never hurts to do it the kosher way.. in case something else is happening under the hood on live appengine.

Also, maybe your Stopwatch() does something to slow it down? try using appstats on it without the Stopwatch() (Though, I can't imagine how that would result in it running several hundred milliseconds slower.)

Darshan Shaligram

unread,
Oct 11, 2010, 6:26:18 PM10/11/10
to google-a...@googlegroups.com
On Mon, Oct 11, 2010 at 5:18 PM, Eli Jones <eli....@gmail.com> wrote:
> Well.. for one.. you are doing a datastore.query() instead of a db.query()
> Most all documentation on working with the datastore indicates to use db
> from google.appengine.ext instead of datastore from google.appengine.api.
> Maybe there is a difference in how they perform in this context?

The performance of the high-level db api is pretty similar to the
performance of the low-level datastore API. I used the low-level api
so that I could keep the code reasonably similar for both Python and
Java. I'm much less familiar with the Java datastore API and I didn't
want to use Java layers that might muddy the waters performance-wise.
Having used the low-level api for Java, I used the closest
corresponding apis for Python.

The reason I wrote these simple Python and Java projects was to
investigate datastore query performance issues I had in real Python
code (using google.appengine.ext.db), and comparing notes with a
colleague who was familiar with Java datastore performance.

> Also, are you doing these tests on Appengine or in the Dev_appserver?

These tests are on the appengine, not the dev server. I use the same
application name for both Java and Python versions, and different
versions for both (Java = v1, Python = v2) so that they share the same
datastore.

Remigius

unread,
Oct 12, 2010, 3:52:26 AM10/12/10
to Google App Engine
Darshan,

Your API times being the same for both implementations suggests that
you are indeed doing the same operations in the data store.

I interpret the timings "datastore_v3.RunQuery real=122ms api=9179ms"
and "datastore_v3.RunQuery real=377ms api=9179ms" as showing
real=<elapse time> and api=<CPU time spent inside GAE API>, which also
explains that the latter can be much more than the former: when using
data access, I assume that the CPU time may be spent in several server
nodes (the data store is distributed). In the logs accessible from the
dashboard I usually see also the total CPU time (GAE API + user),
which allows to calculate the CPU time spent in my own code. In
addition to that I have my own timer that starts on entering my
request handling code and ends on exit (allows to separate my own
elapse time from the total elapse time per request), plus some
counters for the datastore api calls.

Cheers, R.

On 12 Okt., 00:26, Darshan Shaligram <scinti...@gmail.com> wrote:

Tim Hoffman

unread,
Oct 12, 2010, 7:43:45 AM10/12/10
to Google App Engine
Hi

I know this is a nit, but with the python version, is there any
particular reason why you are doing

objects = list(q.Get(fetch_size))
nfetched += len(objects)

at line 108 in test_load_query.py

the list() is redundant, the result from Get is for all purposes is a
list or at least exceedingly list like that you wouldn't bother
creating another list from it.

objects = g.Get(fetch_size)
nfetched += len(objects)

It really won't make any difference to the performance.

Rgds

T

On Oct 12, 6:26 am, Darshan Shaligram <scinti...@gmail.com> wrote:

Darshan Shaligram

unread,
Oct 12, 2010, 10:08:23 AM10/12/10
to google-a...@googlegroups.com
On Tue, Oct 12, 2010 at 7:43 AM, Tim Hoffman <zute...@gmail.com> wrote:

> I know this is a nit, but with the python version, is there any
> particular reason why you are doing

> objects = list(q.Get(fetch_size))
> nfetched += len(objects)

> at line 108 in test_load_query.py

> the list() is redundant, the result from Get is for all purposes is a
> list or at least exceedingly list like that you wouldn't bother
> creating another list from it.

You're right, that was unnecessary, removed.

[...]


> It really won't make any difference to the performance.

Yes, confirmed. :-)

sodso

unread,
Oct 14, 2010, 6:01:09 PM10/14/10
to google-a...@googlegroups.com
people say doing Batch Get and Puts does improve the performance
batch get = db.get(list of model isntances)
batch put = db.put(list of model instances)
hope this helps
Reply all
Reply to author
Forward
0 new messages