Cautionary Tale: Abusive price for data migration and deletion

372 views
Skip to first unread message

Yohan Launay

unread,
Dec 27, 2011, 8:43:33 PM12/27/11
to Google App Engine, appengine_up...@google.com
Hi fellow developers, just a cautionary tale for the new members out
there and people building up large datasets.

We already know that the difference in reported datastore size between
the actual data and the total size is due to the indexes and various
voodoo stuff that the datastore is doing to maintain our data safe. It
is even more relevant when you are trying to migrate your data out of
GAE or simply delete your data in bulk.

I was storing about 500 GB of data, translated into > 2 TB of data in
the datastore (x4...). After spending days to reprocess most of this
data to remove the unused indexes (and thus losing flexibility in my
Queries and cost me a few hundreds $), it went down to 1.6TB, still
costing me about $450 / month for storage alone. Important note is
that a lot of this data comes from individual small entities (about 1
billion of them), coming from reports and stuff. I don't deny that i
could have come up with a better design, and my latest codebase stores
the data in more efficient ways (aggregating into serialized Text or
Blobs), but I still have to make do for the v1 data set sitting there.

I started a migration of the data out of GAE into a simple MySQL
instance running on EC2. In reality, after migration, the entire
dataset only weighs < 150GB (including indexes) into MySQL so i have
no idea where the extra TB is coming from. The migration process was a
pain in the a** and took me 5 freaking weeks to complete. I tried the
bulk export from python which sucks because it only exports textual
data and integers but skips blobs and binary data (It seems they don't
learn base 64 encoding at google...). So i resorted to the remote API
after a quick email chat with Greg d'Alesandre and Ikai Lan which
basically concluded by "sorry cannot help and remote api is not a
solution". Cool then what is ? The remote API is damn slow and
expensive: I had to basically read the entities one by one, store the
extracted file somewhere and process it on the fly with backups and
failsafe everywhere because the GAE remote api will just break from
time to time (due to datastore exceptions mostly). The extraction job
had to be restarted a couple of time because of cursors being screwed
up. So reading 1 billion entities from datastore takes weeks and costs
a lot of dough. But then comes the axe: your data is still sitting on
GAE and you have to delete it. With 1 billion entries in the
datastore, a x3 / x4 writing factor, it will cost you 2-3 k$ to empty
your das bin.. I seriously don't mind paying for datastore writes, but
having to pay $2000 to delete data that already costs me $450 / month
is seriously pushing it.

Any mysql / nosql solution that i know of have some sort of flushing
mechanism that doesn't require deletion of each entry 1 by 1. How come
the datastore doesn't ? I am not paying the outrageous $500 / month of
support but I'm paying far more in platform usage (i have an open
credit of 300$ / day) and so far i didn't get any satisfying answer or
support from the GAE team. I love the platform but seriously knowing
what i know now, vendor lockin has never rang so true than with GAE
and I would not commit so much time and energy on GAE for my big/
serious projects, just leaving it to small quick and dirty jobs.

Please share and comment.

Cheers

Brandon Wirtz

unread,
Dec 27, 2011, 9:35:11 PM12/27/11
to google-a...@googlegroups.com
The cold hearted bastard in me has the following thoughts.

You wrote code that treated DataStore Like SQL.
You didn't set Do Not index on the things you didn't need to index.
You changed the structure of your data midway but didn't flush and start
over you just changed.
Likely you aren't doing any clean up.
Likely you aren't using the right typing for your data.

So what I hear is "Whine, whine, whine, I built my stuff wrong, Google Tried
to help me but I wanted to move to Amazon so they didn't have many
suggestions I liked, so now I'm sad, whine, whine, whine, woe is me. Please
tell others so I can get sympathy for not understanding the platform I was
working on."

Did I miss anything?

Please share and comment.

Cheers

--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to
google-appengi...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.


Jeff Schnitzer

unread,
Dec 27, 2011, 9:56:30 PM12/27/11
to google-a...@googlegroups.com
That's not quite fair. It's easy to get stuck in this trap.

Seems like there's a simple solution to deleting all the data, though:
After you've moved the important data to a new app, just stop billing
on the old app. Make reclaiming it Google's problem.

Jeff

--
We are the 20%

Brandon Wirtz

unread,
Dec 27, 2011, 11:33:15 PM12/27/11
to google-a...@googlegroups.com
I'd be less abusive if the title of the thread was less so.

"Cautionary tale: Building large Scale Data can cost lots if Datastore isn't
fully understood"

"Cautionary tale: Failure to be Scrap and Restart your DataStore when making
changes to the structure can be expensive"

But I don't think that "abusive price" is accurate.

André Pankraz

unread,
Dec 28, 2011, 3:05:41 AM12/28/11
to google-a...@googlegroups.com, appengine_up...@google.com
Sry Brandon...he has a point - deleting data should be cheaper, even if it's technically the same like writing.
Maybe he made some mistakes but you sometimes sound like a fanboy with GAE stockholm syndrome. ;) See what I did here...annoying accusations.
You have very good experience with Python, Cache stuff, Edge cache etc., but do you really have experience  with multiple 100 GB datastore to talk like this?
E.g.: I have also seen some answers from you (often very helpful) that are just plain wrong in the Java environment.

Brandon Wirtz

unread,
Dec 28, 2011, 4:45:29 AM12/28/11
to google-a...@googlegroups.com

Yes,

While the primary app I talk about is edge Cache, that’s because that’s the thing that people can most benefit from that people don’t seem to be using.

 

As part of my SEO tools we have what is now a 60 TB database of Backlinks and Crawler data about websites in the top 200k Alexa Sites. 

 

Why should Deleting be Cheaper? The Operation takes the same amount of CPU, and after you do the delete you don’t have to pay for storage.

 

I don’t do near as much in the Java Space but it doesn’t seem there should be much difference between Python and Java.  I ported both the primary apps to both languages to do comparative cost analysis, and there have been a few things that we found were faster or cheaper with one or the other, as a result in some case we deploy both and use different versioning so they can both be live and attached to the same data.

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/oJRZxuV7yQgJ.

Leandro Rezende

unread,
Dec 28, 2011, 7:48:06 AM12/28/11
to google-a...@googlegroups.com
u pay to write, pay to keep it stored... delete should be free.

2011/12/28 Brandon Wirtz <dra...@digerat.com>

Yohan Launay

unread,
Dec 28, 2011, 8:26:32 AM12/28/11
to Google App Engine
Hi Brandon,

Although i agree with you that the original dataset wasnt fully
optimized (that was over 2 years ago), i believe that i have a good
understanding of datatore vs SQL, caching etc. Im not building public
facing website im dealing with private apis and I am already
stretching memcache and custom built java cache to the limits.

I am also not talking about the reasons why im migrating out of GAE.
The points i highlighted were:

- no easy way to get your data out
- no cheap way to get your big data out
- bulk export in python doesn't handle binary/blob data
- remote api is unstable
- running database queries using cursors for long period of time is
unreliable (many times the cursor got reset for some reason or the
query would return a 0000000 cursor thus screwing 1 week of data
processing)
- it cost me an arm to delete my data

To answer other questions :
- of course i thought about migrating the remaining data to a new app
then alias from the old app to the new one. But it means interrupting
the service (disable datastore writes) and i cant afford that. Plus
the remaining data is still quite big.
- the multi indexes: everytime i changed the data structure i would
reprocess everything to conform it to the new schema. Im not using any
framework like objectify or jdo, im working with the raw api directly
(which is way more elegant)
- im not criticizing the platform i am criticizing the lack of tools
to export and the prohibitive cost of manipulating large data sets. I
actually love GAE, it is just not for this kind of dataset thats all.

@Brandon : If you have a way to delete 2 billions entities (whatever
their size) on the cheap please let me know.


On Dec 28, 8:48 pm, Leandro Rezende <leandro.reze...@gmail.com> wrote:
> u pay to write, pay to keep it stored... delete should be free.
>
> 2011/12/28 Brandon Wirtz <drak...@digerat.com>
>
>
>
>
>
>
>
> > Yes, ****
>
> > While the primary app I talk about is edge Cache, that’s because that’s
> > the thing that people can most benefit from that people don’t seem to be
> > using.****
>
> > ** **
>
> > As part of my SEO tools we have what is now a 60 TB database of Backlinks
> > and Crawler data about websites in the top 200k Alexa Sites.  ****
>
> > ** **
>
> > Why should Deleting be Cheaper? The Operation takes the same amount of
> > CPU, and after you do the delete you don’t have to pay for storage.****
>
> > ** **
>
> > I don’t do near as much in the Java Space but it doesn’t seem there should
> > be much difference between Python and Java.  I ported both the primary apps
> > to both languages to do comparative cost analysis, and there have been a
> > few things that we found were faster or cheaper with one or the other, as a
> > result in some case we deploy both and use different versioning so they can
> > both be live and attached to the same data.****
>
> > ** **
>
> > ** **
>
> > *From:* google-a...@googlegroups.com [mailto:
> > google-a...@googlegroups.com] *On Behalf Of *André Pankraz
> > *Sent:* Wednesday, December 28, 2011 12:06 AM
> > *To:* google-a...@googlegroups.com
> > *Cc:* appengine_updated_pric...@google.com
> > *Subject:* [google-appengine] Re: Cautionary Tale: Abusive price for data
> > migration and deletion****
>
> > ** **
>
> > Sry Brandon...he has a point - deleting data should be cheaper, even if
> > it's technically the same like writing.
> > Maybe he made some mistakes but you sometimes sound like a fanboy with GAE
> > stockholm syndrome. ;) See what I did here...annoying accusations.
> > You have very good experience with Python, Cache stuff, Edge cache etc.,
> > but do you really have experience  with multiple 100 GB datastore to talk
> > like this?
> > E.g.: I have also seen some answers from you (often very helpful) that are
> > just plain wrong in the Java environment.****
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To view this discussion on the web visit
> >https://groups.google.com/d/msg/google-appengine/-/oJRZxuV7yQgJ.
> > To post to this group, send email to google-a...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > google-appengi...@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/google-appengine?hl=en.****

Jeff Schnitzer

unread,
Dec 28, 2011, 2:41:15 PM12/28/11
to google-a...@googlegroups.com
It looks like you've discovered the hard way something that is not
wholly obvious at first: GAE is not good for Big Data.

The HRD is super-cool and perfect for building reliable web
applications. But it is way too slow and expensive for large-scale
data processing. And the uber-reliability is usually pointless - when
dealing with massive data volumes, your collection system is likely
somewhat lossy in the first place. Losing a few bits probably won't
hurt you, and "synchronously replicated to more than three data
centers" is massive overkill.

You probably have the right idea moving to another platform. Use the
right tool for the right job; maybe something like MongoDB or Hadoop.
You'll get much better map/reduce support, higher performance, and
lower cost. GAE is not a box that you're stuck in; you might still
run part of your application on GAE if it makes sense. Just keep an
eye on latency and communication costs.

This isn't a scathing indictment of GAE so much as a realization that
it's not a universal tool. There are a lot of things that are easier
to build with other tools... and a lot of things that are easier to
build on app engine. And some things that are best hybrids of GAE and
something else.

Jeff

--
We are the 20%

jon

unread,
Dec 29, 2011, 12:25:54 AM12/29/11
to Google App Engine
Yohan I agree that there should be an easy and cheap way to get your
data out. I think it's a little unfair that leaving GAE is made that
hard.

How much did you spend on your custom data download tool? Would you
consider open sourcing it for other developers who are caught in the
same position? I'd hate spending weeks building a custom tool just to
get my data out.

Thanks for sharing your experience.

Raymond

unread,
Dec 29, 2011, 3:13:02 AM12/29/11
to Google App Engine
Dear Yohan,

On my side I thank you for sharing your experience, I am beginning
with GAE and know that whatever the time I will put on this project I
will be making beginner mistakes and this kind of info is precious.
I have now a limited experience with GAE and have to compare it with
what I know and in some sectors GAE look very bad, for example I can't
imagine Oracle, DB2, Informix, etc, ...MsSQL, etc having any
commercial success if they would not have implemented rock solid
solutions to import and export data, backup, build and drop tables and
databases and of course calculate precisely the data space required to
build a data structure, in some cases down to the byte.
Although I understand the very different nature of GAE compared to
this traditional DB engines, I think that any professional developer,
IT manager, project manager, or person responsible for budget would
feel very uncomfortable building a system without a firm grip on it's
costs or a reasonable solution to modify an initial implementation or
migrate away from it. Also the fact that part of the GAE tools are
simply not reliable enough to be able to plan effort and time required
to do something is an other big minus for this solution.

Although DB's are not my main competence, my very first paid job
20+years ago was to migrate a critical database to a new structure on
a new machine (HP 9000 unix), using a long forgotten database engine,
the first attempt using SQL took 1 week to migrate, the second using
low level C calls took months to develop and migrated in the required
3.5 hours, but the important thing to note is that It never crossed my
mind to question the reliability of the machine, the database or the C
calls I was making to the DB, it just worked, the Server could be
locked for minutes swapping to disk because of lack of memory or
overload, but it never failed once and repeated the exercise time and
time again, reliably and in a predictable timeframe.

All this said there are advantages to GAE that are worth fighting with
it's limitations, I have not yet found anything else that is so
immediately and massively scalable and at the same time does not
require me to manage the software and hardware, this is invaluable,
and although I know that I could have a easier job moving to MySQL, I
just don't want to manage an OS and a DB engine, I don't have the
time, I have done it and don't think that's where I am going to earn
my bacon.

I will always envy some of the people answering your message for the
depth of knowledge they have of this platform and the fact that they
always have the right solution and right answer to everything, it must
be great to never make mistakes.

-R

Brandon Wirtz

unread,
Dec 29, 2011, 3:31:30 AM12/29/11
to google-a...@googlegroups.com
Development is not about not making mistakes, it is about doing structured
performance testing and cost analysis.

My team writes 500 lines of code for every 50 that make it in to the final
product.

We know things about the efficiencies of Do While vs. ForEach that quite
possibly Google doesn't even know. We are that anal about testing. We test
query speed done different way's and compare cost and performance based on
the anticipated ratios of use.

We just never let "mistakes" grow to the point we can't control them.

Yohan Launay

unread,
Dec 29, 2011, 4:41:25 AM12/29/11
to Google App Engine
Hi Jeff,

I am actually still on Master-Slave. I would expect that using HRD
would have cost me even more.
Like you pointed out, I am indeed working on hybrid solutions now, not
letting GAE in charge of everything.
> > For more options, visit this group athttp://groups.google.com/group/google-appengine?hl=en.

Yohan Launay

unread,
Dec 29, 2011, 4:45:41 AM12/29/11
to Google App Engine
Hi Jon,

*cheap* is relative, i wouldn't mind receiving a harddrive from Google
with all my datastore in it for $X00 which would still have been
cheaper in time, energy and money put into migrating out.

I built the tools myself, simple java programs reading from the
datastore by batches of 30 entities and dumping them to disk, saving
the cursor and continuing from there. A few lines of code really using
the java remote api. The issue lies in error management because the
datastore will break at least a few times a day due to high latency
and stuff (same issues you see directly within GAE but you experience
it remotely). So you continuously have to restart the job (manually or
not). That's where cursors are crucial since there is no way to
iterate through the database in order. And if the cursor gets
corrupted which happened to me 3 times in 5 weeks, you have to erase
everything you've done and start from scratch. Very frustrating...

Yohan Launay

unread,
Dec 29, 2011, 4:52:58 AM12/29/11
to Google App Engine
Hi Raymond,

Don't misunderstand me. GAE is a great tool, i seriously love it and
advocate it everywhere I go. But since my last experience I would
recommend an hybrid solution instead of full steam GAE. At least for
data gathering and processing.

The cost structure is quite clear :
$1 / 1 million writes
Each entity write / delete = at least 2writes (entity + key) + N
writes for the indexes (+ maybe replication i dont know if that's
counted)
So if you have 100 millions entities that's an easy $200-300 to delete
it. And believe me it is really easy to generate that many entities
when your app processes 1500 req/s, even by aggregating you are still
limited to 1MB / entity member (but i don't like to play near the
bytes limits due to potential serialization overhead so i won't store
more than 75-80% of 1MB / entity member).

I believe that the datastore (or even GAE memcache) doesn't offer a
simple flush mechanism because the entire platform is shared among
multiple (all?) the apps and the way data is stored doesn't allow for
a simple flush. (counting is a different matter). I just hope that the
GAE team will read my article and maybe lower the price of deletion a
bit.

On Dec 29, 4:13 pm, Raymond <raymond.othenin-gir...@raydropin.com>
wrote:

Yohan Launay

unread,
Dec 29, 2011, 4:59:39 AM12/29/11
to Google App Engine
Hi Brandon,

Well I started using GAE simply because 2 years ago i was a tech team
of 1 and I couldn't afford to hire full time sysadmins. I'm migrating
some of my stuff out now that i have more guys to help me. And GAE is
a great platform that runs on its own and doesn't require much
administration (i launched games and apps on it that just run for
months with no major issues). So great for starting up. But as soon as
you enter the big data domain, you need more control about the way you
can process and move your data around (the big companies all have
their own datacenters because they need full control about the
infrastructure) and thus a PAAS may not be suited anymore.

It's hard to plan that your business will grow 10x within a few months
and the tech infrastructure must suddenly grow from 50 req/s to 5,000
req/s. BTW GAE can't handle such load well (latency of min 500ms on
java seriously suck, not talking about write contention on the
datastore). It is easy to plan when everything can be defined in
advance (with budgets and stuff) but you don't always have the option.

But thanks for sharing your inputs anyway, always appreciated ;)

Brandon Wirtz

unread,
Dec 29, 2011, 5:58:45 AM12/29/11
to google-a...@googlegroups.com
If you check the archives I have shared times when my requests were well
over 5000/s.

I would say GAE handles big data really well. But you have to do testing to
make sure your structure is correct, and that your indexes are well thought
out.

Planning is always possible. Testing is always possible. But like driving
my Mini Cooper around LeGuna Seca, vs. driving a Ferrari around it. The
Ferrari is only faster if you can handle it. My mom can run laps in the
mini cooper, but would end up in the wall in a Ferrari.

Or like the discussion about executing code from students.

GAE is cycles on demand, so if you can build your app to be efficient it is
cheap. If you build it with errors it is expensive.

I recently found I could knock 3% off of my bill by disabling logging.
That's the level of testing we do. People say "but how can you afford to
pay devs to write code if you worry that much" well we are betting on the
long haul. We only need to learn the lesson once to capitalize on it for
years.

You say you can't predict growth. Sure I can. I either engineer something to
work for me and 3 of my friends, or I engineer it to be the next facebook.
There is room for some differences along the way, but I could build facebook
on GAE. No worry about big data, or scaling. (I think the GAE team would
deploy servers for me as fast as I could fill them)

Things that are designed for you and your friends you don't market, you
don't tell people about, so they don't grow. When we went from CDNinabox
going from something brandon uses for his sites to being a product, the
product got lots of complete re-writes. Testing in Java and Python, the
caching mechanism we use ended up using 4 different models based on the type
of site traffic the site we are accelerating gets. 1 hack for me became a
software with 40+ optimizations that can be turned on and off to make things
run up to 80% cheaper than the defaults. And to pick those settings we test.
We even schedule changes to test real traffic for periods of time.

I think the real lesson I'm trying to convey is one I learned at MSFT. For
every dev there is 1/40th of a CTO, 1/10 of a product manager 2 test
engineers 1/5 of a release manager, and 1/5 of a performance engineer. That
is 2.5 support staff for every programmer. If you are just writing code you
are working in a vacuum that makes it hard to plan, test, debug, and run
scalability metrics.

Yohan Launay

unread,
Dec 29, 2011, 6:40:35 AM12/29/11
to Google App Engine

Hi Brandon,

Interesting story but you rarely design facebook for 500 millions
people right from the start and alone...

Anyway i would love to know how much it would cost you and how long
you would need to get your data out of your super/big apps.

Please share.

Cheers
> ...
>
> read more »

Brandon Wirtz

unread,
Dec 29, 2011, 2:41:39 PM12/29/11
to google-a...@googlegroups.com
Lots.

Did you see the thread about the push the button check back in 48 hours?

Though to be fair on RDS we just did a data dump to move to a new system
which we won't mention here, and our SQL export to 288 hours 17 minutes.

Data migration over the internet is tough when you get above 1 TB. And
making sure you don't have corruption during the move is rough.


Hi Brandon,

Please share.

Cheers

> > 20+on

--

Jeff Schnitzer

unread,
Dec 29, 2011, 3:39:44 PM12/29/11
to google-a...@googlegroups.com
Just a thought (and it would probably be expensive) but perhaps you
should do a two-phase export strategy:

1) Export data into Blobstore as very large blobs
2) Suck data out of the Blobstore

The export can run at Map/Reduce speeds... as fast as you want to pay
for. Bulk downloads from the blobstore should be fast. Unless each
of your entities are huge, fetching 30 at a time is an awfully small
number.

Jeff

--
We are the 20%

Brandon Wirtz

unread,
Dec 29, 2011, 3:54:19 PM12/29/11
to google-a...@googlegroups.com
I haven't written code to do it, but I had been thinking about writing stuff
that serialized entities in to Blobs, Zip Compressing and put them in blob
store, then sucking down the blobs later.

This was also what I was thinking about for a Back-up strategy.


-----Original Message-----
From: google-a...@googlegroups.com
[mailto:google-a...@googlegroups.com] On Behalf Of Jeff Schnitzer
Sent: Thursday, December 29, 2011 12:40 PM
To: google-a...@googlegroups.com

Jeff Schnitzer

unread,
Dec 29, 2011, 3:55:19 PM12/29/11
to google-a...@googlegroups.com
On Thu, Dec 29, 2011 at 2:58 AM, Brandon Wirtz <dra...@digerat.com> wrote:
>
> I would say GAE handles big data really well. But you have to do testing to
> make sure your structure is correct, and that your indexes are well thought
> out.

I think we are talking about two different things. I'm thinking of
Big Data like this:

http://en.wikipedia.org/wiki/Big_data

Typically characterized by:

* Large data volumes
* Batch updates
* Frequent need to analyze/sift through large quantities of data

The GAE datastore performs poorly in this regard. Map/reduce support
is anemic at best. Per-gigabyte storage is expensive. Raw I/O
performance is *dreadful*. Indexes consume excessive amounts of
space.

I love the GAE datastore, I think it's hands-down the Best Storage
Around for web applications that need scalability and availability.
But there's no way in hell I would use it to store a large-scale OLAP
system or any other kind of serious analytics product. You don't want
EC2 either. You need something like Hadoop on bare metal hardware
with really fat I/O pipes. It will cost you a tiny fraction of what
you'll spend at Google will cost and perform 10X better.

Jeff

Yohan Launay

unread,
Dec 29, 2011, 6:43:14 PM12/29/11
to Google App Engine
Yep the blobstore approach is a better solution IMHO but costs will
still be prohibitive. Im afraid
nothong much you can do about that. Initially i thought that we could
get an export of the datastore, download it as a binary file and
process it locally (like the local .bin used for devs) but i'm
dreaming out loud :-)

On Dec 30, 4:55 am, Jeff Schnitzer <j...@infohazard.org> wrote:
Reply all
Reply to author
Forward
0 new messages