Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Datastore backup solution (almost ready)
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  15 messages - Expand all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Aral  
View profile  
 More options Aug 3 2008, 3:53 pm
From: Aral <a...@aralbalkan.com>
Date: Sun, 3 Aug 2008 12:53:23 -0700 (PDT)
Local: Sun, Aug 3 2008 3:53 pm
Subject: Datastore backup solution (almost ready)
I've been able to backup my deployment datastore on Google App Engine
and restore it into the local SDK but I'm running into quite a few
issues -- it's definitely not stable or ready for production.

Basically, what I'm doing is this:

I break down everything into tiny little tasks, getting data out of
the datastore and generating Python code to restore it (this is the
backup).

Then, I run the generated code to restore.

It works perfectly when testing locally with the SDK.

It works _to an extend_ when backing up the deployment environment.
I'm tweaking the settings to try and get everything stable. Currently,
it backs up 5 rows at a time before redirecting and that's not
generating High CPU/Over Quota errors). However, I am randomly getting
InternalErrors. Does anyone know what could cause this (high number of
puts over a short period of time?)

Currently, the best I've gotten is 1,227 rows out of the datastore
which I've successfully restored in the local SDK. Of course, due to
how slow the SDK gets when putting 100s of entries, it did take
forever for the restore to complete (but it did).

If I can solve the stability issues with the backup, I am planning on
making this into a Django app that you can include in your own
applications.

In the meanwhile, if the good folks at Google are about to release a
much better solution, I'd appreciate a heads up so I can devote my
efforts to building my app again instead of building infrastructure.

Is anyone else working on something similar? Any ideas on the
InternalErrors? Thoughts/suggestions?

Thanks,
Aral


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral  
View profile  
 More options Aug 3 2008, 3:57 pm
From: Aral <a...@aralbalkan.com>
Date: Sun, 3 Aug 2008 12:57:26 -0700 (PDT)
Local: Sun, Aug 3 2008 3:57 pm
Subject: Re: Datastore backup solution (almost ready)
Oh, and before I forget:

Due to the 1MB limit on datastore entities, I'm breaking up the
generated code into shards. And due to the 1MB limit on data
structures in general, I can't even stitch and return the code in a
single piece so you have to do that manually currently (looking into
whether I can stream it). (Trying to create a string over 1MB gives
MemoryErrors).

About the InternalErrors: I'm wondering if putting a small amount of
sleep() in the calls will help with these (if they're caused by too
many puts over a short period of time). Does anyone know what impact
sleep() has on CPU use. Is CPU use directly calculated by the amount
of time a call takes or what it does?

Thanks,
Aral

On Aug 3, 8:53 pm, Aral <a...@aralbalkan.com> wrote:

> I've been able to backup my deployment datastore on Google App Engine
> and restore it into the local SDK but I'm running into quite a few
> issues -- it's definitely not stable or ready for production.

<snip>

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral  
View profile  
 More options Aug 3 2008, 3:59 pm
From: Aral <a...@aralbalkan.com>
Date: Sun, 3 Aug 2008 12:59:01 -0700 (PDT)
Local: Sun, Aug 3 2008 3:59 pm
Subject: Re: Datastore backup solution (almost ready)

> It works _to an extend_ when backing up the deployment environment.

I meant "_to an extent_" :) Proof read, Aral, proof read!

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
nchauvat (Logilab)  
View profile  
 More options Aug 3 2008, 4:11 pm
From: "nchauvat (Logilab)" <nicolas.chau...@logilab.fr>
Date: Sun, 3 Aug 2008 13:11:45 -0700 (PDT)
Local: Sun, Aug 3 2008 4:11 pm
Subject: Re: Datastore backup solution (almost ready)
On 3 août, 21:53, Aral <a...@aralbalkan.com> wrote:

> [...]
> Currently, the best I've gotten is 1,227 rows out of the datastore
> which I've successfully restored in the local SDK. Of course, due to
> how slow the SDK gets when putting 100s of entries, it did take
> forever for the restore to complete (but it did).
> [...]
> Is anyone else working on something similar?

I suggest you read http://groups.google.com/group/google-appengine/browse_thread/thread/...
regarding the SDK slowness.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Barry Hunter  
View profile  
 More options Aug 3 2008, 4:45 pm
From: "Barry Hunter" <barrybhun...@googlemail.com>
Date: Sun, 3 Aug 2008 21:45:27 +0100
Local: Sun, Aug 3 2008 4:45 pm
Subject: Re: [google-appengine] Datastore backup solution (almost ready)

On Sun, Aug 3, 2008 at 8:53 PM, Aral <a...@aralbalkan.com> wrote:

<SNIP>

> In the meanwhile, if the good folks at Google are about to release a
> much better solution, I'd appreciate a heads up so I can devote my
> efforts to building my app again instead of building infrastructure.

They probably are:
http://groups.google.com/group/google-appengine/browse_thread/thread/...

> Thanks,
> Aral

--
Barry

- www.nearby.org.uk - www.geograph.org.uk -


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jorge Vargas  
View profile  
 More options Aug 3 2008, 5:57 pm
From: "Jorge Vargas" <jorge.var...@gmail.com>
Date: Sun, 3 Aug 2008 15:57:19 -0600
Local: Sun, Aug 3 2008 5:57 pm
Subject: Re: [google-appengine] Datastore backup solution (almost ready)

I don't see how this is a big issue. I wrote some code a while ago
that could be transform into a backup utility. all you need is a view
that will call entity.toXML() and write that to a file for each entity
in the db. Sure it's ugly but I didn't used it for a backup, I'm
actually using a another (desktop) client, now where it gets
complicated is with relationships and "foreign keys" because as far as
I know there is no way to reproduce the exact same key.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral Balkan  
View profile  
 More options Aug 4 2008, 3:55 am
From: Aral Balkan <aralbal...@gmail.com>
Date: Mon, 4 Aug 2008 00:55:55 -0700 (PDT)
Local: Mon, Aug 4 2008 3:55 am
Subject: Re: Datastore backup solution (almost ready)
Hi Nicolas,

> I suggest you readhttp://groups.google.com/group/google-appengine/browse_thread/thread/...
> regarding the SDK slowness.

Thanks for the link. I'd come across it yesterday and I agree with
blep on that thread: the slowness is affecting our ability to test
with real data (not to test performance but to test _functionality_).

I've added a note to that affect to the related issue also (http://
code.google.com/p/googleappengine/issues/detail?id=390).

Aral


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral Balkan  
View profile  
 More options Aug 4 2008, 4:06 am
From: Aral Balkan <aralbal...@gmail.com>
Date: Mon, 4 Aug 2008 01:06:04 -0700 (PDT)
Local: Mon, Aug 4 2008 4:06 am
Subject: Re: Datastore backup solution (almost ready)
Hi Barry,

I replied on that thread but I could only see a "Reply to author"
option, so here's what I wrote there:

Hi Pete,

The solution that I'm working on for this involves backing up the
datastore to Python code which can then be run to restore the data.

I've successfully done this with my app and I'm working on ironing out
a final issue (the max number of redirects getting exhausted while
running the backup script).

Otherwise, it works really well and meets my needs for the following
use cases:

1. Backup data from deployment server
2. Restore data on local SDK for testing
3. Restore data on a separate App Engine instance to use as a staging
server

The way I've done it is to break everything into little pieces. You
hit a backup handler in Django that kicks off the process. It gets 5
rows out of the datastore at a time, generates the code to restore
them, and saves that in Text shards of around 300KB each in the
datastore.

Of course, I ran into all sorts of limits while testing this out. To
work around short term Over Quota errors, I reduced the number of
rows  dealt with to 5. That seems sustainable (but my app's quotas
_have_ been raised). To work around MemoryErrors, I implemented the
code shards so that no data structure is over 1MB in size. Finally, I
tried using a generator in my HttpResponse to stitch the code shards
into a single .py file to be downloaded but I hit the 1MB limit on
responses ("HTTP response was too large: 3457738. The limit is:
1048576") so you currently have to stitch the output into a single
file yourself.

I also haven't implemented an iterative restore process yet, which
will be necessary to populate other App Engine instances.

Anyway, all this to say that I feel that backing up to Python code
that can be run to restore the data is, IMHO, the most practical
route.

Aral

On Apr 15, 6:44 am, Pete <pkoo...@google.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral Balkan  
View profile  
 More options Aug 4 2008, 4:18 am
From: Aral Balkan <aralbal...@gmail.com>
Date: Mon, 4 Aug 2008 01:18:50 -0700 (PDT)
Local: Mon, Aug 4 2008 4:18 am
Subject: Re: Datastore backup solution (almost ready)
Hi Jorge,

> I don't see how this is a big issue.

Being able to move data off of my deployment server is important for
several reasons:

1. Backups (data safety)
2. Testing (testing locally with real data)
3. Staging (having a staging server with real data for testing on the
App Engine environment without testing on your deployment instance)

> I wrote some code a while ago
> that could be transform into a backup utility. all you need is a view
> that will call entity.toXML() and write that to a file for each entity
> in the db.

It's not that simple if you want a _manageable_ solution that you can
_restore_ easily :) I'm not concerned with simply backing up the data
if I cannot easily restore it. And, as I mentioned above, I want to be
able to restore that data not just to the deployment instance but to
(a) the local SDK and, (b) to a separate App Engine instance to be
used as a staging environment.

> where it gets
> complicated is with relationships and "foreign keys" because as far as
> I know there is no way to reproduce the exact same key.

You don't have to reproduce the exact same key. I use the actual key
as the key name when generating the new keys. Any entities with the
original key are removed prior to the creation of the new key.

The way I am handling references is to make sure that I observe the
source order of the model classes (I tried using inspect to do this
but it relies on imp and os.readlink -- monkeypatching these worked on
the local SDK but not on the deployment environment so I ended up
simply reading in the source file myself and running a regex on it.)
Since the reference properties require a model to be defined before it
is referenced, this guarantees that the referred to entities are
created before the entities that reference them. This is working fine
currently.

As soon as I fix the max redirection issue (raising the
network.http.redirection-limit in FireFox didn't work so I'm going to
try the META refresh approach instead), I should have a working proof
of concept. Once I have that stable, I'll start work on making it a
generic Django solution that you can pop into any existing app.

Once that's done, we should be able to backup and restore data and
that will allow us to test locally with real data and to create
staging servers.

Aral


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral Balkan  
View profile  
 More options Aug 4 2008, 9:29 am
From: Aral Balkan <aralbal...@gmail.com>
Date: Mon, 4 Aug 2008 06:29:06 -0700 (PDT)
Local: Mon, Aug 4 2008 9:29 am
Subject: Re: Datastore backup solution (almost ready)
Just to update the thread, blep now has a patch that remedies the
local SDK slowness considerably.

See: http://groups.google.com/group/google-appengine/browse_thread/thread/...

Aral


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral Balkan  
View profile  
 More options Aug 4 2008, 9:30 am
From: Aral Balkan <aralbal...@gmail.com>
Date: Mon, 4 Aug 2008 06:30:18 -0700 (PDT)
Local: Mon, Aug 4 2008 9:30 am
Subject: Re: Datastore backup solution (almost ready)

> As soon as I fix the max redirection issue (raising the
> network.http.redirection-limit in FireFox didn't work so I'm going to
> try the META refresh approach instead), I should have a working proof
> of concept. Once I have that stable, I'll start work on making it a
> generic Django solution that you can pop into any existing app.

The proof of concept is working properly now with my app. Going to
implement blep's patch and try a local SDK restore of the full backup
from the datastore (2360 rows of data).

Aral


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Marasti-Georg  
View profile  
 More options Aug 5 2008, 8:38 am
From: "Chris Marasti-Georg" <c.marastige...@gmail.com>
Date: Tue, 5 Aug 2008 08:38:39 -0400
Local: Tues, Aug 5 2008 8:38 am
Subject: Re: [google-appengine] Re: Datastore backup solution (almost ready)

When I update my model, I have an update.py script that I run.  Depending on
the type of update I'll update anywhere from 5-40 rows of data, then output
a new page that points back to the action with start=40 or whatever the next
row to process is.  I set a timeout in the body.onload function to click the
link after 1 or 2 seconds, and I haven't had quota problems.  When I had the
onload function click the link right away, I would run into server errors
after about 10 successive clicks.

Actual code:

    self.response.out.write("""<html><body
onload="setTimeout('window.location=document.getElementById(\\\'next\\\').h ref',2000);">Updated
"""+str(count)+" Games.")
    if count < 15:
      self.response.out.write("  You're done!")
    else:
      self.response.out.write("  <a id='next'
href='update?action=update_games'>More</a>")
    self.response.out.write("</body></html>")


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jorge Vargas  
View profile  
 More options Aug 7 2008, 1:30 am
From: "Jorge Vargas" <jorge.var...@gmail.com>
Date: Wed, 6 Aug 2008 23:30:33 -0600
Local: Thurs, Aug 7 2008 1:30 am
Subject: Re: [google-appengine] Re: Datastore backup solution (almost ready)

well the restore part of that script was to take the XML and parse it
with (elementtree in my case) and then just pass that to the
constructor of the datastore objects.

With that method I don't have any real slowness, although I haven't
tested with millions of records.

> And, as I mentioned above, I want to be
> able to restore that data not just to the deployment instance but to
> (a) the local SDK and, (b) to a separate App Engine instance to be
> used as a staging environment.

why will you need a staging environment? the way versions are handle
in appengine you could deploy both on the same account. You have that
build in, with the admin console and setting the correct version to be
deployed.

>> where it gets
>> complicated is with relationships and "foreign keys" because as far as
>> I know there is no way to reproduce the exact same key.

> You don't have to reproduce the exact same key. I use the actual key
> as the key name when generating the new keys. Any entities with the
> original key are removed prior to the creation of the new key.

then it's not a backup, just a duplicate. Be careful to find a way to
check if the db has already been restored on that instance otherwise
you take the chance of having a DB with each peace of information
twice.

> The way I am handling references is to make sure that I observe the
> source order of the model classes (I tried using inspect to do this
> but it relies on imp and os.readlink -- monkeypatching these worked on
> the local SDK but not on the deployment environment so I ended up
> simply reading in the source file myself and running a regex on it.)

why you need that? if you are going to use the db then import it and
use it, if you want clone the file inside the backup. Since it's all
appengine you could duplicate the module somewhere on the backup data
and then use the __import__ statement
http://docs.python.org/lib/built-in-funcs.html, afterall the model is
part of the backup, isn't it?

if you are storing to python code, which means you are doing some sort
of code generation, then it means you are creating very big files of
constructor calls, are you sure the slowness isn't due to this
approach? depending on the code you could be eating up a lot of memory
on both sides of the transaction.

> Since the reference properties require a model to be defined before it
> is referenced, this guarantees that the referred to entities are
> created before the entities that reference them. This is working fine
> currently.

yes, this is one of the limitations of appengine's db, which is an
advantage here.

> As soon as I fix the max redirection issue (raising the
> network.http.redirection-limit in FireFox didn't work so I'm going to
> try the META refresh approach instead), I should have a working proof
> of concept. Once I have that stable, I'll start work on making it a
> generic Django solution that you can pop into any existing app.

I see no reason why that's happening, are you doing a page call for
each object?


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral  
View profile  
 More options Aug 7 2008, 2:13 pm
From: Aral <a...@aralbalkan.com>
Date: Thu, 7 Aug 2008 11:13:59 -0700 (PDT)
Local: Thurs, Aug 7 2008 2:13 pm
Subject: Re: Datastore backup solution (almost ready)
Hi Jorge,

> well the restore part of that script was to take the XML and parse it
> with (elementtree in my case) and then just pass that to the
> constructor of the datastore objects.

> With that method I don't have any real slowness, although I haven't
> tested with millions of records.

You don't need millions to run into App Engine quota limits :) Have
you tested with thousands?

> why will you need a staging environment? the way versions are handle
> in appengine you could deploy both on the same account. You have that
> build in, with the admin console and setting the correct version to be
> deployed.

Versions are nice but they are just that: versions of the same
application. What I want is a publicly deployed version and one where
I can test changes on the live environment but not on my live instance
(i.e., you would not want your users to use the staging "version"
while you tested). In other words, I need a staging environment for
all the reasons you normally need a staging environment for :)

> then it's not a backup, just a duplicate. Be careful to find a way to
> check if the db has already been restored on that instance otherwise
> you take the chance of having a DB with each peace of information
> twice.

The rows check for existing versions and replace them so duplicates
are not created.

It's not an exact duplicate/backup since there is no way to create the
exact same keys and IDs as on the deployment server.

> if you are storing to python code, which means you are doing some sort
> of code generation, then it means you are creating very big files of
> constructor calls, are you sure the slowness isn't due to this
> approach? depending on the code you could be eating up a lot of memory
> on both sides of the transaction.

I am generating code but there isn't any real slowness there.

> yes, this is one of the limitations of appengine's db, which is an
> advantage here.

Yes :)

> I see no reason why that's happening, are you doing a page call for
> each object?

Firefox's default limit is to display an error after 20 redirects. The
solution I went for eventually is to use META refresh which means that
I can give a visual update and do not run into redirection limits.

I realize that code generation is not the traditional approach to
backups/restores but it works very well for Google App Engine's unique
constraints.

Unfortunately, the one big issue is maintaining key() references saved
in ListProperty(db.Key) entities. That's where the inability to set
the exact same keys and the inability to set IDs or to set key_names
to existing IDs become a real problem. I've raised that on another
thread and would love any ideas anyone might have.

Otherwise, the only solution at the moment is to use a separate ID
field and always use that to refer to the entity if you want datastore
portability.

Aral


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aral  
View profile  
 More options Aug 7 2008, 2:15 pm
From: Aral <a...@aralbalkan.com>
Date: Thu, 7 Aug 2008 11:15:55 -0700 (PDT)
Local: Thurs, Aug 7 2008 2:15 pm
Subject: Re: Datastore backup solution (almost ready)
Hi Chris,

Thanks for sharing your setup.

Here's mine: I've limited my puts to no more than 5 at a time and I
use a META refresh of 0 seconds: I don't get any short-term quota
errors or long-term ones when backing up > 2300 rows with that setup.

Aral

On Aug 5, 1:38 pm, "Chris Marasti-Georg" <c.marastige...@gmail.com>
wrote:

> When I update my model, I have an update.py script that I run.  Depending on
> the type of update I'll update anywhere from 5-40 rows of data, then output
> a new page that points back to the action with start=40 or whatever the next
> row to process is.  I set a timeout in the body.onload function to click the
> link after 1 or 2 seconds, and I haven't had quota problems.  When I had the
> onload function click the link right away, I would run into server errors
> after about 10 successive clicks.

<snip>

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google