performance issue with SDK datastore with large volumne (>1000 rows)

29 views
Skip to first unread message

blep

unread,
Aug 3, 2008, 1:01:04 PM8/3/08
to Google App Engine
I'm running into some performance issue with the datastore stub
provided with the SDK: the original insertion time for a row is about
0.3 second when there is no row is present, but the insertion time
clearly increases with the number of rows (now about 5s per row with a
few thousand rows, with a rough linear increase as the number of rows
increases).

All this was absorbed by the app engine production datastore (e.g. not
the SDK datastore) at a constant time of 0.3 second, including
internet round-trip time. The final volume is reported to be about
30Mo in the dashboard for 43000 rows.

This makes it difficult to develop using the local devenv as I can not
reproduce the production environment localy (my application is about
querying a large dataset, and all the complexity is in handling
complex query efficiently)...

The original dataset is stored into a python bsddb (250Mo) and I have
no such issue when querying and feeding it.

Did anyone run into a similar issue?

Platform: Windows XP SP3 - python 2.5.2 - gae sdk: 1.1.1

nchauvat (Logilab)

unread,
Aug 3, 2008, 1:14:22 PM8/3/08
to Google App Engine
On 3 août, 19:01, blep <baptiste.lepill...@gmail.com> wrote:
> I'm running into some performance issue with the datastore stub
> provided with the SDK: ...

This was reported before and the only answer was "SDK is for making
development easier, not for simulating the performances of the actual
production environment".

> The original dataset is stored into a python bsddb (250Mo) and I have no such issue when querying and feeding it.

You have the source code so you could look for the bottle-neck and
fix, but I would suggest running performance tests on the actual
server instead, since the local db will never have the performances of
the actual servers anyway. If it does not block at 1e3, it will block
at 1e6 or 1e9.

Since you say that the underlying bsddb does not have the same speed
issues, you might open a ticket for tracking this at
http://code.google.com/p/googleappengine/issues/list

blep

unread,
Aug 3, 2008, 2:19:15 PM8/3/08
to Google App Engine
On 3 août, 19:14, "nchauvat (Logilab)" <nicolas.chau...@logilab.fr>
wrote:
> On 3 août, 19:01, blep <baptiste.lepill...@gmail.com> wrote:
>
> > I'm running into some performance issue with the datastore stub
> > provided with the SDK: ...
>
> This was reported before and the only answer was "SDK is for making
> development easier, not for simulating the performances of the actual
> production environment".

I know that. I'm not trying to do performance simulation, but plainly
input enough data into the local devenv to be able to do basic test
(check/debug algorithm...). I've already seen that performance of the
production environment are vastly different of my local one.

> > The original dataset is stored into a python bsddb (250Mo) and I have no such issue when querying and feeding it.
>
> You have the source code so you could look for the bottle-neck and
> fix, but I would suggest running performance tests on the actual
> server instead, since the local db will never have the performances of
> the actual servers anyway. If it does not block at 1e3, it will block
> at 1e6 or 1e9.

I don't expect the local test system to be able to plain around with
terabyte of data like the production environment could. But currently,
my local datastore is barely 2Mo and has a few thousand raw. Being
able to handle at least 100 000 locally seems like a reasonnable
target to me. This of course implies being able to insert them in a
reasonnable time.

I just gave a quick look at the code and in datastore_file_stub.py,
and it seems that _Dynamic_Put(), which if I guess correctly is called
somehow by model.put(), calls __WriteDatastore() which seems to
simply pickle all entities into a new file each time. So unless I'm
mistaken, the stub implementation goes through all the entities and
pickles them in a new file each time a put/commit is done, which
explains the linear increase in time taken to put an entity...

> Since you say that the underlying bsddb does not have the same speed
> issues, you might open a ticket for tracking this athttp://code.google.com/p/googleappengine/issues/list

I'm just saying that bsddb which is available with standard python
distribution does not have this issue (I'm using it with more than
250K rows without any performance issue). The SDK could use it.

Do you know if there is already an open issue for this ? I could not
find it and you said this has already been reported...

nchauvat (Logilab)

unread,
Aug 3, 2008, 4:06:57 PM8/3/08
to Google App Engine
On 3 août, 20:19, blep <baptiste.lepill...@gmail.com> wrote:
> I know that. I'm not trying to do performance simulation, but plainly
> input enough data into the local devenv to be able to do basic test
> (check/debug algorithm...). I've already seen that performance of the
> production environment are vastly different of my local one.

I misunderstood your question then.

> I just gave a quick look at the code and in datastore_file_stub.py,
> and it seems that _Dynamic_Put(), which if I guess correctly is called
> somehow by model.put(),  calls __WriteDatastore() which seems to
> simply pickle all entities into a new file each time. So unless I'm
> mistaken, the stub implementation goes through all the entities and
> pickles them in a new file each time a put/commit is done, which
> explains the linear increase in time taken to put an entity...
> [...]
> I'm just saying that bsddb which is available with standard python
> distribution does not have this issue (I'm using it with more than
> 250K rows without any performance issue). The SDK could use it.
>
> Do you know if there is already an open issue for this ? I could not
> find it and you said this has already been reported...

Sounds like you have almost solved it already. This
http://code.google.com/p/googleappengine/issues/detail?id=390 is the
same problem as the one you are about to fix. I suggest you add the
above information there and post a patch if you ever find time to
write one (I would not be suprised if the google gang were to include
it in the next release, btw).

blep

unread,
Aug 4, 2008, 4:31:28 AM8/4/08
to Google App Engine


On 3 août, 22:06, "nchauvat (Logilab)" <nicolas.chau...@logilab.fr>
wrote:
> On 3 août, 20:19, blep <baptiste.lepill...@gmail.com> wrote:
> [...]
> > Do you know if there is already an open issue for this ? I could not
> > find it and you said this has already been reported...
>
> Sounds like you have almost solved it already. Thishttp://code.google.com/p/googleappengine/issues/detail?id=390is the
> same problem as the one you are about to fix. I suggest you add the
> above information there and post a patch if you ever find time to
> write one (I would not be suprised if the google gang were to include
> it in the next release, btw).

I've made a small patch, which is more a work-around than anything
else. I added an option so that the file is only saved every N
seconds. I now got "constant" time insertion with 0.1s per row. The
only potential issue being memory usage, with a 10Mo pickled file
using more than 300Mo of memory but that is easier to deal with. From
what I've seen it should be possible to use a embedded database back-
end as the requirement is just to be able to update multiple key/value
in a single transaction. Though I've seen thread locking and such
database can not usually be accessed from multiple threads... Is the
devenv server multi-threaded ?

Dumb question: how do you "star" an issue. I haven't found anything in
the help (or I missed it)...

nchauvat (Logilab)

unread,
Aug 4, 2008, 6:38:34 AM8/4/08
to Google App Engine
On 4 août, 10:31, blep <baptiste.lepill...@gmail.com> wrote:
> I've made a small patch, which is more a work-around than anything
> ...

Nice.

> what I've seen it should be possible to use a embedded database back-
> end as the requirement is just to be able to update multiple key/value
> in a single transaction. Though I've seen thread locking and such
> database can not usually be accessed from multiple threads... Is the
> devenv server multi-threaded ?

That sounds like a new project: implement a better backend for
dev_appserver :)

> Dumb question: how do you "star" an issue. I haven't found anything in
> the help (or I missed it)...

When you are logged in, click on the white star at the left of the
title when displaying the issue in details.

Aral Balkan

unread,
Aug 4, 2008, 9:26:14 AM8/4/08
to Google App Engine
Hey blep,

> I've made a small patch, which is more a work-around than anything
> else. I added an option so that the file is only saved every N
> seconds. I now got "constant" time insertion with 0.1s per row.

That's awesome. Downloading the patch now.

This is going to help so much with restoring datastore backups
locally. (I've just gotten datastore backups working).

> database can not usually be accessed from multiple threads... Is the
> devenv server multi-threaded ?

Nope. It handles just one request at a time.

Thanks again for the patch! :)

Aral

blep

unread,
Aug 4, 2008, 4:29:37 PM8/4/08
to Google App Engine
On 4 août, 15:26, Aral Balkan <aralbal...@gmail.com> wrote:
> Hey blep,
>
> > I've made a small patch, which is more a work-around than anything
> > else. I added an option so that the file is only saved every N
> > seconds. I now got "constant" time insertion with 0.1s per row.
>
> That's awesome. Downloading the patch now.

Notes that you may run into another "performance" issue of the
datastore stub: from what I've seen query processing is also linear in
complexity (loop over all the entity of a given type and apply the
filter for each one). But this should not be as much a show stopper as
writing the file on each put as all entities are in memory.
Reply all
Reply to author
Forward
0 new messages