Datastore backup solution (almost ready)

164 views
Skip to first unread message

Aral

unread,
Aug 3, 2008, 3:53:23 PM8/3/08
to Google App Engine
I've been able to backup my deployment datastore on Google App Engine
and restore it into the local SDK but I'm running into quite a few
issues -- it's definitely not stable or ready for production.

Basically, what I'm doing is this:

I break down everything into tiny little tasks, getting data out of
the datastore and generating Python code to restore it (this is the
backup).

Then, I run the generated code to restore.

It works perfectly when testing locally with the SDK.

It works _to an extend_ when backing up the deployment environment.
I'm tweaking the settings to try and get everything stable. Currently,
it backs up 5 rows at a time before redirecting and that's not
generating High CPU/Over Quota errors). However, I am randomly getting
InternalErrors. Does anyone know what could cause this (high number of
puts over a short period of time?)

Currently, the best I've gotten is 1,227 rows out of the datastore
which I've successfully restored in the local SDK. Of course, due to
how slow the SDK gets when putting 100s of entries, it did take
forever for the restore to complete (but it did).

If I can solve the stability issues with the backup, I am planning on
making this into a Django app that you can include in your own
applications.

In the meanwhile, if the good folks at Google are about to release a
much better solution, I'd appreciate a heads up so I can devote my
efforts to building my app again instead of building infrastructure.

Is anyone else working on something similar? Any ideas on the
InternalErrors? Thoughts/suggestions?

Thanks,
Aral

Aral

unread,
Aug 3, 2008, 3:57:26 PM8/3/08
to Google App Engine
Oh, and before I forget:

Due to the 1MB limit on datastore entities, I'm breaking up the
generated code into shards. And due to the 1MB limit on data
structures in general, I can't even stitch and return the code in a
single piece so you have to do that manually currently (looking into
whether I can stream it). (Trying to create a string over 1MB gives
MemoryErrors).

About the InternalErrors: I'm wondering if putting a small amount of
sleep() in the calls will help with these (if they're caused by too
many puts over a short period of time). Does anyone know what impact
sleep() has on CPU use. Is CPU use directly calculated by the amount
of time a call takes or what it does?

Thanks,
Aral

On Aug 3, 8:53 pm, Aral <a...@aralbalkan.com> wrote:
> I've been able to backup my deployment datastore on Google App Engine
> and restore it into the local SDK but I'm running into quite a few
> issues -- it's definitely not stable or ready for production.
<snip>

Aral

unread,
Aug 3, 2008, 3:59:01 PM8/3/08
to Google App Engine
> It works _to an extend_ when backing up the deployment environment.

I meant "_to an extent_" :) Proof read, Aral, proof read!

nchauvat (Logilab)

unread,
Aug 3, 2008, 4:11:45 PM8/3/08
to Google App Engine
On 3 août, 21:53, Aral <a...@aralbalkan.com> wrote:
> [...]
> Currently, the best I've gotten is 1,227 rows out of the datastore
> which I've successfully restored in the local SDK. Of course, due to
> how slow the SDK gets when putting 100s of entries, it did take
> forever for the restore to complete (but it did).
> [...]
> Is anyone else working on something similar?

I suggest you read http://groups.google.com/group/google-appengine/browse_thread/thread/eed4b8ae210ed6f5#
regarding the SDK slowness.

Barry Hunter

unread,
Aug 3, 2008, 4:45:27 PM8/3/08
to google-a...@googlegroups.com
On Sun, Aug 3, 2008 at 8:53 PM, Aral <ar...@aralbalkan.com> wrote:
>

<SNIP>

> In the meanwhile, if the good folks at Google are about to release a
> much better solution, I'd appreciate a heads up so I can devote my
> efforts to building my app again instead of building infrastructure.
>

They probably are:
http://groups.google.com/group/google-appengine/browse_thread/thread/18d246b30e267da4/dc2a10eb6749339a?q=export+datastore&lnk=ol&


>
> Thanks,
> Aral
> >
>

--
Barry

- www.nearby.org.uk - www.geograph.org.uk -

Jorge Vargas

unread,
Aug 3, 2008, 5:57:19 PM8/3/08
to google-a...@googlegroups.com
I don't see how this is a big issue. I wrote some code a while ago
that could be transform into a backup utility. all you need is a view
that will call entity.toXML() and write that to a file for each entity
in the db. Sure it's ugly but I didn't used it for a backup, I'm
actually using a another (desktop) client, now where it gets
complicated is with relationships and "foreign keys" because as far as
I know there is no way to reproduce the exact same key.

> Thanks,
> Aral
> >
>

Aral Balkan

unread,
Aug 4, 2008, 3:55:55 AM8/4/08
to Google App Engine
Hi Nicolas,

> I suggest you readhttp://groups.google.com/group/google-appengine/browse_thread/thread/...
> regarding the SDK slowness.

Thanks for the link. I'd come across it yesterday and I agree with
blep on that thread: the slowness is affecting our ability to test
with real data (not to test performance but to test _functionality_).

I've added a note to that affect to the related issue also (http://
code.google.com/p/googleappengine/issues/detail?id=390).

Aral

Aral Balkan

unread,
Aug 4, 2008, 4:06:04 AM8/4/08
to Google App Engine
Hi Barry,

> They probably are:http://groups.google.com/group/google-appengine/browse_thread/thread/...

I replied on that thread but I could only see a "Reply to author"
option, so here's what I wrote there:

Hi Pete,

The solution that I'm working on for this involves backing up the
datastore to Python code which can then be run to restore the data.

I've successfully done this with my app and I'm working on ironing out
a final issue (the max number of redirects getting exhausted while
running the backup script).

Otherwise, it works really well and meets my needs for the following
use cases:

1. Backup data from deployment server
2. Restore data on local SDK for testing
3. Restore data on a separate App Engine instance to use as a staging
server

The way I've done it is to break everything into little pieces. You
hit a backup handler in Django that kicks off the process. It gets 5
rows out of the datastore at a time, generates the code to restore
them, and saves that in Text shards of around 300KB each in the
datastore.

Of course, I ran into all sorts of limits while testing this out. To
work around short term Over Quota errors, I reduced the number of
rows dealt with to 5. That seems sustainable (but my app's quotas
_have_ been raised). To work around MemoryErrors, I implemented the
code shards so that no data structure is over 1MB in size. Finally, I
tried using a generator in my HttpResponse to stitch the code shards
into a single .py file to be downloaded but I hit the 1MB limit on
responses ("HTTP response was too large: 3457738. The limit is:
1048576") so you currently have to stitch the output into a single
file yourself.

I also haven't implemented an iterative restore process yet, which
will be necessary to populate other App Engine instances.

Anyway, all this to say that I feel that backing up to Python code
that can be run to restore the data is, IMHO, the most practical
route.

Aral

On Apr 15, 6:44 am, Pete <pkoo...@google.com> wrote:
> Getting large amounts of data from existing apps into thedatastore
> can be a challenge, and to help we offer a Bulk Data Uploader. You
> can read more about this tool here:
>
> http://code.google.com/appengine/articles/bulkload.html
>
> Conversely, we know many of you would like a better tool for moving
> your data *off* of Google App Engine, and we'd like to ask for your
> feedback on what formats would be the most useful for a data
> exporter. XML output? CSV transform? RDF? Let us know how what you
> think!

Aral Balkan

unread,
Aug 4, 2008, 4:18:50 AM8/4/08
to Google App Engine
Hi Jorge,

> I don't see how this is a big issue.

Being able to move data off of my deployment server is important for
several reasons:

1. Backups (data safety)
2. Testing (testing locally with real data)
3. Staging (having a staging server with real data for testing on the
App Engine environment without testing on your deployment instance)

> I wrote some code a while ago
> that could be transform into a backup utility. all you need is a view
> that will call entity.toXML() and write that to a file for each entity
> in the db.

It's not that simple if you want a _manageable_ solution that you can
_restore_ easily :) I'm not concerned with simply backing up the data
if I cannot easily restore it. And, as I mentioned above, I want to be
able to restore that data not just to the deployment instance but to
(a) the local SDK and, (b) to a separate App Engine instance to be
used as a staging environment.

> where it gets
> complicated is with relationships and "foreign keys" because as far as
> I know there is no way to reproduce the exact same key.

You don't have to reproduce the exact same key. I use the actual key
as the key name when generating the new keys. Any entities with the
original key are removed prior to the creation of the new key.

The way I am handling references is to make sure that I observe the
source order of the model classes (I tried using inspect to do this
but it relies on imp and os.readlink -- monkeypatching these worked on
the local SDK but not on the deployment environment so I ended up
simply reading in the source file myself and running a regex on it.)
Since the reference properties require a model to be defined before it
is referenced, this guarantees that the referred to entities are
created before the entities that reference them. This is working fine
currently.

As soon as I fix the max redirection issue (raising the
network.http.redirection-limit in FireFox didn't work so I'm going to
try the META refresh approach instead), I should have a working proof
of concept. Once I have that stable, I'll start work on making it a
generic Django solution that you can pop into any existing app.

Once that's done, we should be able to backup and restore data and
that will allow us to test locally with real data and to create
staging servers.

Aral

Aral Balkan

unread,
Aug 4, 2008, 9:29:06 AM8/4/08
to Google App Engine
Just to update the thread, blep now has a patch that remedies the
local SDK slowness considerably.

See: http://groups.google.com/group/google-appengine/browse_thread/thread/eed4b8ae210ed6f5#

Aral

Aral Balkan

unread,
Aug 4, 2008, 9:30:18 AM8/4/08
to Google App Engine
> As soon as I fix the max redirection issue (raising the
> network.http.redirection-limit in FireFox didn't work so I'm going to
> try the META refresh approach instead), I should have a working proof
> of concept. Once I have that stable, I'll start work on making it a
> generic Django solution that you can pop into any existing app.

The proof of concept is working properly now with my app. Going to
implement blep's patch and try a local SDK restore of the full backup
from the datastore (2360 rows of data).

Aral

Chris Marasti-Georg

unread,
Aug 5, 2008, 8:38:39 AM8/5/08
to google-a...@googlegroups.com
When I update my model, I have an update.py script that I run.  Depending on the type of update I'll update anywhere from 5-40 rows of data, then output a new page that points back to the action with start=40 or whatever the next row to process is.  I set a timeout in the body.onload function to click the link after 1 or 2 seconds, and I haven't had quota problems.  When I had the onload function click the link right away, I would run into server errors after about 10 successive clicks.

Actual code:

    self.response.out.write("""<html><body onload="setTimeout('window.location=document.getElementById(\\\'next\\\').href',2000);">Updated """+str(count)+" Games.")
    if count < 15:
      self.response.out.write("  You're done!")
    else:
      self.response.out.write("  <a id='next' href='update?action=update_games'>More</a>")
    self.response.out.write("</body></html>")

Jorge Vargas

unread,
Aug 7, 2008, 1:30:33 AM8/7/08
to google-a...@googlegroups.com
On Mon, Aug 4, 2008 at 2:18 AM, Aral Balkan <aralb...@gmail.com> wrote:
>
> Hi Jorge,
>
>> I don't see how this is a big issue.
>
> Being able to move data off of my deployment server is important for
> several reasons:
>
> 1. Backups (data safety)
> 2. Testing (testing locally with real data)
> 3. Staging (having a staging server with real data for testing on the
> App Engine environment without testing on your deployment instance)
>
>> I wrote some code a while ago
>> that could be transform into a backup utility. all you need is a view
>> that will call entity.toXML() and write that to a file for each entity
>> in the db.
>
> It's not that simple if you want a _manageable_ solution that you can
> _restore_ easily :) I'm not concerned with simply backing up the data
> if I cannot easily restore it.

well the restore part of that script was to take the XML and parse it
with (elementtree in my case) and then just pass that to the
constructor of the datastore objects.

With that method I don't have any real slowness, although I haven't
tested with millions of records.

> And, as I mentioned above, I want to be
> able to restore that data not just to the deployment instance but to
> (a) the local SDK and, (b) to a separate App Engine instance to be
> used as a staging environment.
>

why will you need a staging environment? the way versions are handle
in appengine you could deploy both on the same account. You have that
build in, with the admin console and setting the correct version to be
deployed.

>> where it gets
>> complicated is with relationships and "foreign keys" because as far as
>> I know there is no way to reproduce the exact same key.
>
> You don't have to reproduce the exact same key. I use the actual key
> as the key name when generating the new keys. Any entities with the
> original key are removed prior to the creation of the new key.
>

then it's not a backup, just a duplicate. Be careful to find a way to
check if the db has already been restored on that instance otherwise
you take the chance of having a DB with each peace of information
twice.

> The way I am handling references is to make sure that I observe the
> source order of the model classes (I tried using inspect to do this
> but it relies on imp and os.readlink -- monkeypatching these worked on
> the local SDK but not on the deployment environment so I ended up
> simply reading in the source file myself and running a regex on it.)

why you need that? if you are going to use the db then import it and
use it, if you want clone the file inside the backup. Since it's all
appengine you could duplicate the module somewhere on the backup data
and then use the __import__ statement
http://docs.python.org/lib/built-in-funcs.html, afterall the model is
part of the backup, isn't it?

if you are storing to python code, which means you are doing some sort
of code generation, then it means you are creating very big files of
constructor calls, are you sure the slowness isn't due to this
approach? depending on the code you could be eating up a lot of memory
on both sides of the transaction.

> Since the reference properties require a model to be defined before it
> is referenced, this guarantees that the referred to entities are
> created before the entities that reference them. This is working fine
> currently.
>

yes, this is one of the limitations of appengine's db, which is an
advantage here.

> As soon as I fix the max redirection issue (raising the
> network.http.redirection-limit in FireFox didn't work so I'm going to
> try the META refresh approach instead), I should have a working proof
> of concept. Once I have that stable, I'll start work on making it a
> generic Django solution that you can pop into any existing app.
>

I see no reason why that's happening, are you doing a page call for
each object?

Aral

unread,
Aug 7, 2008, 2:13:59 PM8/7/08
to Google App Engine
Hi Jorge,

> well the restore part of that script was to take the XML and parse it
> with (elementtree in my case) and then just pass that to the
> constructor of the datastore objects.
>
> With that method I don't have any real slowness, although I haven't
> tested with millions of records.

You don't need millions to run into App Engine quota limits :) Have
you tested with thousands?

> why will you need a staging environment? the way versions are handle
> in appengine you could deploy both on the same account. You have that
> build in, with the admin console and setting the correct version to be
> deployed.

Versions are nice but they are just that: versions of the same
application. What I want is a publicly deployed version and one where
I can test changes on the live environment but not on my live instance
(i.e., you would not want your users to use the staging "version"
while you tested). In other words, I need a staging environment for
all the reasons you normally need a staging environment for :)

> then it's not a backup, just a duplicate. Be careful to find a way to
> check if the db has already been restored on that instance otherwise
> you take the chance of having a DB with each peace of information
> twice.

The rows check for existing versions and replace them so duplicates
are not created.

It's not an exact duplicate/backup since there is no way to create the
exact same keys and IDs as on the deployment server.

> if you are storing to python code, which means you are doing some sort
> of code generation, then it means you are creating very big files of
> constructor calls, are you sure the slowness isn't due to this
> approach? depending on the code you could be eating up a lot of memory
> on both sides of the transaction.

I am generating code but there isn't any real slowness there.

> yes, this is one of the limitations of appengine's db, which is an
> advantage here.

Yes :)

> I see no reason why that's happening, are you doing a page call for
> each object?

Firefox's default limit is to display an error after 20 redirects. The
solution I went for eventually is to use META refresh which means that
I can give a visual update and do not run into redirection limits.

I realize that code generation is not the traditional approach to
backups/restores but it works very well for Google App Engine's unique
constraints.

Unfortunately, the one big issue is maintaining key() references saved
in ListProperty(db.Key) entities. That's where the inability to set
the exact same keys and the inability to set IDs or to set key_names
to existing IDs become a real problem. I've raised that on another
thread and would love any ideas anyone might have.

Otherwise, the only solution at the moment is to use a separate ID
field and always use that to refer to the entity if you want datastore
portability.

Aral

Aral

unread,
Aug 7, 2008, 2:15:55 PM8/7/08
to Google App Engine
Hi Chris,

Thanks for sharing your setup.

Here's mine: I've limited my puts to no more than 5 at a time and I
use a META refresh of 0 seconds: I don't get any short-term quota
errors or long-term ones when backing up > 2300 rows with that setup.

Aral

On Aug 5, 1:38 pm, "Chris Marasti-Georg" <c.marastige...@gmail.com>
wrote:
> When I update my model, I have an update.py script that I run. Depending on
> the type of update I'll update anywhere from 5-40 rows of data, then output
> a new page that points back to the action with start=40 or whatever the next
> row to process is. I set a timeout in the body.onload function to click the
> link after 1 or 2 seconds, and I haven't had quota problems. When I had the
> onload function click the link right away, I would run into server errors
> after about 10 successive clicks.
<snip>
Reply all
Reply to author
Forward
0 new messages