Data Transfer to AWS

241 views
Skip to first unread message

Millisecond

unread,
Sep 1, 2011, 8:22:33 AM9/1/11
to Google App Engine
Won't rehash the pricing discussions, but because of the pricing
changes and the way it's been handled we're 90% sure we're going to
move to AWS.

Anyways... now I'm tasked with figuring out how to get our almost TB
of data over there either into SimpleDB or into an RDS instance, not
sure yet.

Have other people done the move, how did it go? Did you pull from
AWS / push from GAE? Take an intermediate backup and then load from
some other mechanism (thinking S3) into SDB / RDS?

Our app is adding over 1 GB / day with 20MM reads and 20MM writes, and
we'd rather not take it offline for too long... Thinking of some
crazy scheme based on descending keys to move data over from key A
"downwards", shut off app, move over everything from key A "upwards"
as we're mostly only writing new data. Maybe special-case a few
object classes and use timestamps to detect deltas. Realize it's
pretty app-specific, but wondering if other people have tackled that
same problem and what their experiences were.

Thanks for any and all help,
-Casey

Daniel Upton

unread,
Sep 1, 2011, 1:15:17 PM9/1/11
to Google App Engine
We will be facing the same problem as we leave app engine. I believe
I'm going to attempt to use the bulk downloader to get the data to a
local computer. I am slightly worried about the cost of transferring
11TB. I believe that the process make take a couple of days, so I plan
on moving inbound records to the new database before I start the d/l,
that way I can have reduced functionality over the data transfer
period but no data will be lost.

-Daniel

Waleed Abdulla

unread,
Sep 1, 2011, 1:23:34 PM9/1/11
to google-a...@googlegroups.com
Another tool I used in the past to move my data into GAE (it works both ways) is App Rocket. It uses timestamps to do real-time replication between GAE and mysql. You might need to tweak it a little bit to fit your need exactly, but it gets you most of the way there.




--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.


David

unread,
Sep 1, 2011, 2:14:04 PM9/1/11
to Google App Engine
I'm having to go through the same thing. My plan was to have a
modified date on all the data. Then transfer over all data up to the
current date and keep track of the timestamp. Do this while the site
is live. Then put the site in read only mode then do the same
transfer starting at that timestamp up until the latest and that will
get all the extra difference, but it'll be a lot smaller. You also
have to account for data that was modified that might already by in
AWS from the first transfer and just update it instead of inserting
it.

I'm also planning on pruning as much data as I can before hand.

I haven't done any of this, but that's what my plan is right now. Let
me know if you come up with something different. Hope that helps.

David

Millisecond

unread,
Sep 1, 2011, 2:44:10 PM9/1/11
to Google App Engine
App Rocket looks awesome.

Assuming I can start it up in a non-default Python instance (I'm
normally Java) the same way I've done for some of the Google-supplied
python tools that'll really help.

Thanks!
-Casey

Robert Kluin

unread,
Sep 2, 2011, 2:20:06 AM9/2/11
to google-a...@googlegroups.com

Hey Casey,
  One other thought, you could run over the data bundling and dumping it to the blobstore.  Then pull the blobs to amazon.  That might let you get a little more efficiency in the transfer. I've done it going in, should work just as well going out.

  Aldo note that auto generated keys are not sequential nor strictly increasing. So you could potentially loose data.  A solution I've used is to make a small adjustment to my models, I'll add an indexed 'batch'  field that gets put on all new / updated entities.  Do your main transfer using key order, then when your ready, grab everything with a batch value. Old data won't have a value for batch, so it won't be picked up in your final conversion.  With a couple iterations you should be able to minimize your downtime.

  You could also use the remote API to fetch data during the final transfer stage. That should let you have zero downtime.

Robert





Robert Kluin

unread,
Sep 2, 2011, 2:22:31 AM9/2/11
to google-a...@googlegroups.com

Hi Daniel,
  I strongly suspect you're going to need a different solution to transfer that much data out in a timely manner.   The best solution depends on your write rates and update patterns

Robert

Casey Haakenson

unread,
Sep 2, 2011, 3:12:58 AM9/2/11
to google-a...@googlegroups.com
Thanks for the suggestions Robert. 

Is there documentation on using the remote API anywhere?  A wrapper lib would be even better...

Thanks,
-Casey

Brandon Wirtz

unread,
Sep 2, 2011, 3:57:28 AM9/2/11
to google-a...@googlegroups.com

File a ticket.  I think this should be part of Google Take Out

http://www.dataliberation.org/takeout-products

Robert Kluin

unread,
Sep 3, 2011, 1:35:45 AM9/3/11
to google-a...@googlegroups.com

Hey Casey,
  I think you can find some useful stuff in the SDK and maybe in the docs. I'm mobile now so I don't have links.

Robert




Natalie

unread,
Sep 4, 2011, 6:08:44 AM9/4/11
to google-a...@googlegroups.com

We are in the same situation. How’s everyone’s progress?

We’re planning to migrate to AWS but we have quite a bit of data to move like you guys. We’re trying to minimize service disruption and keeping our site online while making the move. Here’s our scheme (would love to hear suggestions form you guys):

1) Upload a new app version that adds a Boolean value to every type of entity in Datastore. Call it “updated”. All call to put() will set this Boolean to true, and push the put() data to a GAE pull queue.

2) Use Remote API to batch get all entity with Boolean=false. This will get any unmodified data from DS. Data that are modified by DS after fetch can be retrieved from the pull queue later.

3) Transform the data and push them to AWS.

4) From AWS, lease the data from pull queue and fill up the database

5) Modify DNS record to point to the new AWS site while keeping the GAE app alive until it receives no traffic.


We are trying to take advantage of the GAE pull queue’s ability to be accessed outside of GAE. Do you guy foresee any problem with this scheme? We’re busy coding this at the moment and would love to hear your input. Thank you.

Also we are planning to use AWS Elastic Beanstalk to ease Tomcat admin effort. Anyone could share their experience with this technology?


Robert Kluin

unread,
Sep 4, 2011, 2:08:27 PM9/4/11
to google-a...@googlegroups.com

The only problem I see is that you won't be able to get existing data with 'updated=False', the old data won't be indexed. You'll need to just get everything and maybe skip the stuff you have already pushed.  Otherwise sounds like the idea might work.




> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.

> To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/FWTtjG5urQ0J.

Raymond C.

unread,
Sep 4, 2011, 9:20:39 PM9/4/11
to google-a...@googlegroups.com
I think stopping everything and do a complete data dump once and for all is the only way to do a migration.  Especially if you are on HR already.

Millisecond

unread,
Sep 5, 2011, 11:55:31 AM9/5/11
to Google App Engine
Update on our progress:

-Currently rather busy optimizing what we can on GAE as we can't take
a $7,000 / mo bill until the transfer is complete and it's going to
take a non-trivial amount of time to move.

-Very app-specific, but our data is really broken out into two
categories, configuration and data. We're going to move the
configuration over, start the servers on the new side and then have a
process pull the data, taking however many days are required. User's
will just see data appear over time.

-Planning on using the remote API from a command-line utility (we'll
probably have to write from scratch as it's so specific) to pull into
SDB newer entries first. We'll run it on the AWS side to do the
transfer. http://code.google.com/appengine/docs/java/tools/remoteapi.html

-Biggest outstanding problem is id matching and allocation. We'll
need to start new data records with an ID higher than what's allocated
in GAE so the subsequent pull doesn't clobber those allocated in the
meantime. Haven't tackled that yet, but assuming the solution won't
be super hard.

Also spending some time wishing that AppScale would run on Beanstalk /
SDB, so our code changes would be super minimal. Digging through
their code now to see how hard it would be to add Beanstalk / SDB
support to their current setup.
Reply all
Reply to author
Forward
0 new messages