Google Groups Home
Help | Sign in
performance issue with SDK datastore with large volumne (>1000 rows)
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
blep  
View profile
 More options Aug 3, 1:01 pm
From: blep <baptiste.lepill...@gmail.com>
Date: Sun, 3 Aug 2008 10:01:04 -0700 (PDT)
Local: Sun, Aug 3 2008 1:01 pm
Subject: performance issue with SDK datastore with large volumne (>1000 rows)
I'm running into some performance issue with the datastore stub
provided with the SDK: the original insertion time for a row is about
0.3 second when there is no row is present, but the insertion time
clearly increases with the number of rows (now about 5s per row with a
few thousand rows, with a rough linear increase as the number of rows
increases).

All this was absorbed by the app engine production datastore (e.g. not
the SDK datastore) at a constant time of 0.3 second, including
internet round-trip time. The final volume is reported to be about
30Mo in the dashboard for 43000 rows.

This makes it difficult to develop using the local devenv as I can not
reproduce the production environment localy (my application is about
querying a large dataset, and all the complexity is in handling
complex query efficiently)...

The original dataset is stored into a python bsddb (250Mo) and I have
no such issue when querying and feeding it.

Did anyone run into a similar issue?

Platform: Windows XP SP3 - python 2.5.2 - gae sdk: 1.1.1


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
nchauvat (Logilab)  
View profile
 More options Aug 3, 1:14 pm
From: "nchauvat (Logilab)" <nicolas.chau...@logilab.fr>
Date: Sun, 3 Aug 2008 10:14:22 -0700 (PDT)
Local: Sun, Aug 3 2008 1:14 pm
Subject: Re: performance issue with SDK datastore with large volumne (>1000 rows)
On 3 août, 19:01, blep <baptiste.lepill...@gmail.com> wrote:

> I'm running into some performance issue with the datastore stub
> provided with the SDK: ...

This was reported before and the only answer was "SDK is for making
development easier, not for simulating the performances of the actual
production environment".

> The original dataset is stored into a python bsddb (250Mo) and I have no such issue when querying and feeding it.

You have the source code so you could look for the bottle-neck and
fix, but I would suggest running performance tests on the actual
server instead, since the local db will never have the performances of
the actual servers anyway. If it does not block at 1e3, it will block
at 1e6 or 1e9.

Since you say that the underlying bsddb does not have the same speed
issues, you might open a ticket for tracking this at
http://code.google.com/p/googleappengine/issues/list


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
blep  
View profile
 More options Aug 3, 2:19 pm
From: blep <baptiste.lepill...@gmail.com>
Date: Sun, 3 Aug 2008 11:19:15 -0700 (PDT)
Local: Sun, Aug 3 2008 2:19 pm
Subject: Re: performance issue with SDK datastore with large volumne (>1000 rows)
On 3 août, 19:14, "nchauvat (Logilab)" <nicolas.chau...@logilab.fr>
wrote:

> On 3 août, 19:01, blep <baptiste.lepill...@gmail.com> wrote:

> > I'm running into some performance issue with the datastore stub
> > provided with the SDK: ...

> This was reported before and the only answer was "SDK is for making
> development easier, not for simulating the performances of the actual
> production environment".

I know that. I'm not trying to do performance simulation, but plainly
input enough data into the local devenv to be able to do basic test
(check/debug algorithm...). I've already seen that performance of the
production environment are vastly different of my local one.

> > The original dataset is stored into a python bsddb (250Mo) and I have no such issue when querying and feeding it.

> You have the source code so you could look for the bottle-neck and
> fix, but I would suggest running performance tests on the actual
> server instead, since the local db will never have the performances of
> the actual servers anyway. If it does not block at 1e3, it will block
> at 1e6 or 1e9.

I don't expect the local test system to be able to plain around with
terabyte of data like the production environment could. But currently,
my local datastore is barely 2Mo and has a few thousand raw. Being
able to handle at least 100 000 locally seems like a reasonnable
target to me. This of course implies being able to insert them in a
reasonnable time.

I just gave a quick look at the code and in datastore_file_stub.py,
and it seems that _Dynamic_Put(), which if I guess correctly is called
somehow by model.put(),  calls __WriteDatastore() which seems to
simply pickle all entities into a new file each time. So unless I'm
mistaken, the stub implementation goes through all the entities and
pickles them in a new file each time a put/commit is done, which
explains the linear increase in time taken to put an entity...

> Since you say that the underlying bsddb does not have the same speed
> issues, you might open a ticket for tracking this athttp://code.google.com/p/googleappengine/issues/list

I'm just saying that bsddb which is available with standard python
distribution does not have this issue (I'm using it with more than
250K rows without any performance issue). The SDK could use it.

Do you know if there is already an open issue for this ? I could not
find it and you said this has already been reported...


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
nchauvat (Logilab)  
View profile
 More options Aug 3, 4:06 pm
From: "nchauvat (Logilab)" <nicolas.chau...@logilab.fr>
Date: Sun, 3 Aug 2008 13:06:57 -0700 (PDT)
Local: Sun, Aug 3 2008 4:06 pm
Subject: Re: performance issue with SDK datastore with large volumne (>1000 rows)
On 3 août, 20:19, blep <baptiste.lepill...@gmail.com> wrote:

> I know that. I'm not trying to do performance simulation, but plainly
> input enough data into the local devenv to be able to do basic test
> (check/debug algorithm...). I've already seen that performance of the
> production environment are vastly different of my local one.

I misunderstood your question then.

> I just gave a quick look at the code and in datastore_file_stub.py,
> and it seems that _Dynamic_Put(), which if I guess correctly is called
> somehow by model.put(),  calls __WriteDatastore() which seems to
> simply pickle all entities into a new file each time. So unless I'm
> mistaken, the stub implementation goes through all the entities and
> pickles them in a new file each time a put/commit is done, which
> explains the linear increase in time taken to put an entity...
> [...]
> I'm just saying that bsddb which is available with standard python
> distribution does not have this issue (I'm using it with more than
> 250K rows without any performance issue). The SDK could use it.

> Do you know if there is already an open issue for this ? I could not
> find it and you said this has already been reported...

Sounds like you have almost solved it already. This
http://code.google.com/p/googleappengine/issues/detail?id=390 is the
same problem as the one you are about to fix. I suggest you add the
above information there and post a patch if you ever find time to
write one (I would not be suprised if the google gang were to include
it in the next release, btw).

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
blep  
View profile
 More options Aug 4, 4:31 am
From: blep <baptiste.lepill...@gmail.com>
Date: Mon, 4 Aug 2008 01:31:28 -0700 (PDT)
Local: Mon, Aug 4 2008 4:31 am
Subject: Re: performance issue with SDK datastore with large volumne (>1000 rows)

On 3 août, 22:06, "nchauvat (Logilab)" <nicolas.chau...@logilab.fr>
wrote:

> On 3 août, 20:19, blep <baptiste.lepill...@gmail.com> wrote:
> [...]
> > Do you know if there is already an open issue for this ? I could not
> > find it and you said this has already been reported...

> Sounds like you have almost solved it already. Thishttp://code.google.com/p/googleappengine/issues/detail?id=390is the
> same problem as the one you are about to fix. I suggest you add the
> above information there and post a patch if you ever find time to
> write one (I would not be suprised if the google gang were to include
> it in the next release, btw).

I've made a small patch, which is more a work-around than anything
else. I added an option so that the file is only saved every N
seconds. I now got "constant" time insertion with 0.1s per row. The
only potential issue being memory usage, with a 10Mo pickled file
using more than 300Mo of memory but that is easier to deal with. From
what I've seen it should be possible to use a embedded database back-
end as the requirement is just to be able to update multiple key/value
in a single transaction. Though I've seen thread locking and such
database can not usually be accessed from multiple threads... Is the
devenv server multi-threaded ?

Dumb question: how do you "star" an issue. I haven't found anything in
the help (or I missed it)...


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
nchauvat (Logilab)  
View profile
 More options Aug 4, 6:38 am
From: "nchauvat (Logilab)" <nicolas.chau...@logilab.fr>
Date: Mon, 4 Aug 2008 03:38:34 -0700 (PDT)
Local: Mon, Aug 4 2008 6:38 am
Subject: Re: performance issue with SDK datastore with large volumne (>1000 rows)
On 4 août, 10:31, blep <baptiste.lepill...@gmail.com> wrote:

> I've made a small patch, which is more a work-around than anything
> ...

Nice.

> what I've seen it should be possible to use a embedded database back-
> end as the requirement is just to be able to update multiple key/value
> in a single transaction. Though I've seen thread locking and such
> database can not usually be accessed from multiple threads... Is the
> devenv server multi-threaded ?

That sounds like a new project: implement a better backend for
dev_appserver :)

> Dumb question: how do you "star" an issue. I haven't found anything in
> the help (or I missed it)...

When you are logged in, click on the white star at the left of the
title when displaying the issue in details.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral Balkan  
View profile
 More options Aug 4, 9:26 am
From: Aral Balkan <aralbal...@gmail.com>
Date: Mon, 4 Aug 2008 06:26:14 -0700 (PDT)
Local: Mon, Aug 4 2008 9:26 am
Subject: Re: performance issue with SDK datastore with large volumne (>1000 rows)
Hey blep,

> I've made a small patch, which is more a work-around than anything
> else. I added an option so that the file is only saved every N
> seconds. I now got "constant" time insertion with 0.1s per row.

That's awesome. Downloading the patch now.

This is going to help so much with restoring datastore backups
locally. (I've just gotten datastore backups working).

> database can not usually be accessed from multiple threads... Is the
> devenv server multi-threaded ?

Nope. It handles just one request at a time.

Thanks again for the patch! :)

Aral


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
blep  
View profile
 More options Aug 4, 4:29 pm
From: blep <baptiste.lepill...@gmail.com>
Date: Mon, 4 Aug 2008 13:29:37 -0700 (PDT)
Local: Mon, Aug 4 2008 4:29 pm
Subject: Re: performance issue with SDK datastore with large volumne (>1000 rows)
On 4 août, 15:26, Aral Balkan <aralbal...@gmail.com> wrote:

> Hey blep,

> > I've made a small patch, which is more a work-around than anything
> > else. I added an option so that the file is only saved every N
> > seconds. I now got "constant" time insertion with 0.1s per row.

> That's awesome. Downloading the patch now.

Notes that you may run into another "performance" issue of the
datastore stub: from what I've seen query processing is also linear in
complexity (loop over all the entity of a given type and apply the
filter for each one). But this should not be as much a show stopper as
writing the file on each put as all entities are in memory.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google