Backing up AppEngine

179 views
Skip to first unread message

Joshua Fox יהושע פוקס

unread,
Oct 25, 2016, 2:29:17 AM10/25/16
to Google App Engine
What do you do for backing up AppEngine data?

From what I can see, there is no simple way to say "back up everything in this GAE project" with a button-click or API call, nor even a way to do this for most specific data-storage mechanisms.

E.g., Datastore has a  backup tool which  has serious bugs that make it inoperable. Blobstore has no backup mechanism.  

Do major users of AppEngine not have off-Google backups? True, GAE deals with disk failure, but you have to protect against hacker vandalism as well as  team-member error. (One response I have received is "Don't make mistakes." I won't bet on that.) 

The Google Data Liberation Front seems to have missed GAE. 

So, do you make do without? Do you write your own tools to backup (and regularly restore), where Google tools are lacking?

Joshua

Jeff Schnitzer

unread,
Oct 27, 2016, 4:16:40 PM10/27/16
to Google App Engine
I pretty much live without traditional backups. I use the cheesy old backup tool to make a copy of everything meaningful once every few days but it’s pretty much just a backstop against the worst-case-scenario. If we had to rely on it, it would be a TON of work. And keep in mind that the backup tool is not transactional, so it is likely backing up not-quite-consistent data.

On the other hand, pretty much any data-restore scenario is going to be a TON of work. If you revert to a day-old backup, what are you going to do about the thousands of transactions you just “forgot” about? Chances are this will not be a “restore” so much as a “load all the old data onto a separate datastore and write code to carefully merge the stuff you mangled”. In which case the cheesy backup strategy is not much worse than a “real” backup.

Unless you can afford to lose a day’s worth of data, traditional backups aren’t as helpful as they sound. Don’t make mistakes.

OH: Also, don’t use the Blobstore. It’s ancient and vastly inferior to GCS. Don’t even use the GAE Blobstore API to GCS, just use the native APIs. You can back it up with gsutil rsync if you want.

Jeff

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengine+unsubscribe@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/CABfTu2D1MM%2BTUdgGR_dnaeYj%3D3ECTe7F94LpktyydpYWPR3e0Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

George (Cloud Platform Support)

unread,
Oct 27, 2016, 5:17:46 PM10/27/16
to Google App Engine, joshu...@gmail.com

Hello Joshua,

The necessity to backup data in your App Engine environment is not overly intense. A data protection policy is in place, covering backups and all kind of risks to data in the cloud. If you backup your data, you do what has already been done for you in a thorough and systematic manner.

There is still a legitimate need to control your data and download it, and tools are in place to backup and restore . If you would like to set up an object versioning policy for your app environment, cloud storage buckets allow that.

Each tool individually offers ways to backup data. The backup process does not include values stored in Blobstore or Cloud Storage.


I hope this helps. Cheers!

Joshua Fox יהושע פוקס

unread,
Oct 28, 2016, 5:07:13 AM10/28/16
to George (Cloud Platform Support), Google App Engine
On Fri, Oct 28, 2016 at 12:17 AM, George (Cloud Platform Support) <gsuce...@google.com> wrote:

Hello Joshua,

The necessity to backup data in your App Engine environment is not overly intense. A data protection policy is in place, covering backups and all kind of risks to data in the cloud. If you backup your data, you do what has already been done for you in a thorough and systematic manner.


But this data protection mechanism does not protect against team-member error (nor hacker vandalism). Every team writes code and runs admin tools to purposely delete and overwrite data. Although everyone  works hard to avoid bugs and mistakes, we are all human. Backups are needed to deal with that.

My approach to this topic seems quite different from that of the creators of GAE, and I am wondering if I am missing something fundamental. Are most of the thousands of GAE customers, not to mention Google itself (as a GAE "customer")  really doing without protection against such scenarios? The Google Data Liberation Front showed a clear understanding of the need for easy data exportability, yet that is missing here.

 

There is still a legitimate need to control your data and download it, and tools are in place to backup and restore . If you would like to set up an object versioning policy for your app environment, cloud storage buckets allow that.


That is only for Storage, not Datastore. (Of course it might not be reasonable to expect that for Datastore, but the point is that versioning,  which woul help meet my requirement, is not a possibility, unless I am really missing something.)t  

Each tool individually offers ways to backup data.

Neither the Google Datastore Backup utility nor the Managed Backup utility work; I have confirmed these bugs with Google support. Hopefully these bugs will be fixed, and perhaps other users' datasets to not trigger the bugs, but again this raises questions about how hard it is to backup data.
 
Storage can be backed up easily  with cp

But there is absolutely no backup tool for Blobstore. (Is there?) Again, do most customers write their own, or do they just leave their data in a GAE Blobstore  and hope for the best?



I hope this helps. Cheers!



Thank you. It is helping me understand the  philosophy behind the paucity of backup tools, but I still don't get the difference in attitude towards accidental deletion or overwrite.

Joshua 

George (Cloud Platform Support)

unread,
Oct 28, 2016, 11:35:33 AM10/28/16
to Google App Engine, joshu...@gmail.com

Hello Joshua,

Leaving data in the Blobstore and hoping for the best, as you say, is not such an unreasonable policy: as mentioned, there is a standards-validated data protection policy in place, covering backups and all kind of risks to data in the cloud, including what you mention about hacker interference.

Your view on versioning and team member error seems to be influenced by a code development environment mindset. GAE is not specifically built for developing code on-line, you may refer to specialized third-party tools for that, for instance git. These tools take care of versioning, and are able to revert data to a healthy previous state in case of team-member/human error. There are in fact versioning features offered for the instances running in the cloud; please refer to the “Versioning and instances” paragraph in the overview of the app engine.

One can find a Backing up or restoring data chapter on the Managing Datastore from the Console page.  

You are perfectly right about the Blobstore, the backup process does not include values stored in Blobstore or Cloud Storage. Applications do not create or modify blob data directly; instead, blobs are created indirectly, by a submitted web form or other HTTP POST request. This means data is not modified within Blobstore workings, so it maintains identity with the original data uploaded by the user. There is no immediate need to backup cloud data that is already stored in identical state in your system.

Some tools may not yet work 100% as expected. You already submitted bugs for that, and we hope these issues are going to be address within reasonable timeframes.

In short,  the difference in attitude towards accidental deletion or overwrite originates in our confidence in the highest standards implemented by our data protection policy.

Evan Jones

unread,
Oct 28, 2016, 2:31:02 PM10/28/16
to Google App Engine, gsuce...@google.com, joshu...@gmail.com
To chime in on this: I agree with you that backups are important to protect against operator error. As a concrete example: we made a thankfully minor error with BigQuery, so now we periodically backup all our BigQuery tables.

The datastore backup tool is not great, but we do it for the same reason, and it does seem to work for us. Although now that I think about it: We haven't carefully manually checked our backups in a while, so I should go do that.

Worst case: We've discussed writing something ourselves multiple times, that would scan our Datastore and writes entities out in a more useful way, but we haven't prioritized it.

Alexey

unread,
Oct 29, 2016, 5:53:09 PM10/29/16
to Google App Engine, joshu...@gmail.com
I suppose it greatly depends on which persistence technology we're talking about and the reasons for wanting backups.  When using Cloud SQL, some standard MySQL tools should work.  For GCS, code might have to be written to efficiently extract and load files, but with a low level storage solution like that and with multiregional options, disaster recovery is not too problematic.  So the only reason to have good extraction and loading capabilities for GCS are for moving your project to or from other cloud providers and for ETL integrations.

Datastore gives a lot of flexibility as far as data management is concerned.  It can be done by the application in the most suitable way.  One intriguing feature of Datastore is that indexes can be chosen by each entity stored.  This can be a boon to a data management strategy that involves storing multiple versions or deltas of entities and uses a batch process to consolidate, replicate, or back up new entities.  Such a batch process can use a special index on a field that flags fresh entities that are up for processing and then remove them from this index upon completion.


On Tuesday, October 25, 2016 at 2:29:17 AM UTC-4, Joshua Fox wrote:

Joshua Fox יהושע פוקס

unread,
Oct 30, 2016, 9:05:30 AM10/30/16
to Google App Engine, evan....@triggermail.io, George (Cloud Platform Support)
George, thank you for that reply. I don't want to clutter the list, so feel free to not respond, but there seems to be a real difference of approaches here.  I now understand your approach better, based on responses from you and other Googlers, but to me it is so counter-intuitive that I want to understand where you're coming from.

I seem to be exceptionally concerned about the risk of accidental deletion/overwriting. Doesn't it make sense to handle that as a high priority?

Maybe I'll write up up an article if I get my head around this.

On Fri, Oct 28, 2016 at 5:35 PM, George (Cloud Platform Support) <gsuce...@google.com> wrote:

Hello Joshua,

Leaving data in the Blobstore and hoping for the best, as you say, is not such an unreasonable policy: as mentioned, there is a standards-validated data protection policy in place, covering backups and all kind of risks to data in the cloud, including what you mention about hacker interference.


How does the policy cover that? If a hacker deletes our  Blobstore or Datastore --  or indeed, if we delete them with bugs in our  code  or by accidentally hitting the Delete button at https://console.cloud.google.com/appengine/blobstore?project=myproject  or https://ah-builtin-python-bundle-dot-myproject.appspot.com/_ah/datastore_admin  -- does your data protection policy allow us to  recover?



Inline image 2Inline image 3


Your view on versioning and team member error seems to be influenced by a code development environment mindset. GAE is not specifically built for developing code on-line, you may refer to specialized third-party tools for that, for instance git.


Sure, we use Git for code. 

These tools take care of versioning, and are able to revert data to a healthy previous state in case of team-member/human error.


Of course we test new code on our dev laptops, then in  special GAE test-projects. We often delete the data in these test projects. We pay very close attention to make sure we don't delete the data in the production project -- but we are all human, and mistakes can happen.

 

There are in fact versioning features offered for the instances running in the cloud; please refer to the “Versioning and instances” paragraph in the overview of the app engine.

 
But from what I can see Git and AppEngine Versions do  not allow reverting data, just code.


One can find a Backing up or restoring data chapter on the Managing Datastore from the Console page.  

You are perfectly right about the Blobstore, the backup process does not include values stored in Blobstore or Cloud Storage. Applications do not create or modify blob data directly; instead, blobs are created indirectly, by a submitted web form or other HTTP POST request. This means data is not modified within Blobstore workings, so it maintains identity with the original data uploaded by the user. There is no immediate need to backup cloud data that is already stored in identical state in your system.

Some tools may not yet work 100% as expected. You already submitted bugs for that, and we hope these issues are going to be address within reasonable timeframes.

In short,  the difference in attitude towards accidental deletion or overwrite originates in our confidence in the highest standards implemented by our data protection policy.

On Fri, Oct 28, 2016 at 8:31 PM, Evan Jones <evan.jones@triggermail.io> wrote:
 Worst case: We've discussed writing something ourselves multiple times, that would scan our Datastore and writes entities out in a more useful way, but we haven't prioritized it.

Evan, can you explain why this is not a high priority? Do you see accidental deletion as unlikely? 

Thank you,

Joshua

Evan Jones

unread,
Oct 30, 2016, 1:30:54 PM10/30/16
to joshu...@gmail.com, Google App Engine, George (Cloud Platform Support)
On Sun, Oct 30, 2016 at 9:04 AM, Joshua Fox יהושע פוקס <joshu...@gmail.com> wrote:
 Worst case: We've discussed writing something ourselves multiple times, that would scan our Datastore and writes entities out in a more useful way, but we haven't prioritized it.

Evan, can you explain why this is not a high priority? Do you see accidental deletion as unlikely? 


So far we are trusting that the existing datastore backup tool does something useful. :) The reason we've considered doing our own backup thing is to be able to do more selecting restores. For example, per-entity, per-namespace, or per date (we have a bunch of entities that include a timestamp attribute), depending on the reason. We haven't had the right disaster yet to prioritize us doing that work.

I agree with you: I think protecting against programmer error is valuable! We are just hoping that we do enough of it already, and haven't been burned enough to cause us to put that ahead of new product work. Sadly, human nature being what it is, its hard to invest resources into a possible future disaster that has not yet happened.

Still: this discussion has reminded me that we still have a bit more work to do on our Datastore and BigQuery backup stuff, so thanks!

Evan


George (Cloud Platform Support)

unread,
Oct 31, 2016, 11:39:00 AM10/31/16
to Google App Engine, joshu...@gmail.com

Hello Joshua,

Protecting against human error is important, we all agree on that. You mention your production project along with test projects. The platform allows granting various levels of access to team members; IAM lets you adopt the security principle of least privilege, so you grant only the necessary access to your resources. Only programmers and team members can access test projects.

Access to data from the production process is granted to users in various forms, and is controlled by your app’s security features. Users have the data access rights you grant them.

One may notice that developer team members usually have a different level of expertise with data, and are able to control risks to a higher extent than an app user. They are generally considered more reliable.

It may be worth mentioning that automated backups are possible and can be scheduled, and that Cloud SQL allows you to enable binary logging.

Nick

unread,
Oct 31, 2016, 5:20:33 PM10/31/16
to Google App Engine
I think it would be great if GAE had better support for restoration of accidental deletes. For example, if you could store a unique Id for a transaction and somehow undo all the writes/deletes later. That would be great.

Given the nature of distributed systems, and the complexity required in engineering them, and the backup tools already provided, it seems obvious that the most likely error states wrt data loss will be bugged code or devops error. Effectively, managed data removes the traditional data loss mechanisms, so it makes sense for GAE to provide supportive tools in this space.

I'd be interested if anyone has used the data store versioning features - do they scale and perform well in the real world? I've not heard anyone else even acknowledge them let alone that they use them. I'd love to get a recommendation on whether it's worth using.

Re: deleting data - historically I've favored using a delete flag on important entities, and filtering those out in queries. This was pragmatic and also relatively favored by the platform before the delete costs changed.

Message has been deleted

Mark Cummins

unread,
Nov 4, 2016, 4:08:15 PM11/4/16
to Google App Engine
We use the old crusty Datastire Admin backup tool. We run it nightly from a cron. We backup most important entity types nightly, and some less important types weekly. We have tested disaster recovery, and it all seems to work fine

We do this to defend against programmer error or attack, e.g. If some programmatic process were to delete or mangle our whole Datastore.

The major downside of this aporozch is that is is insanely expensive. We have to read every Datastore instance every day, which takes a lot of Read and a lot of Instance Hours. For a modest sized datastore this costs us several hundred dollars per month. I really wish there was a better approach

Reply all
Reply to author
Forward
0 new messages