|Cautionary Tale: Abusive price for data migration and deletion||Yohan||12/27/11 5:43 PM|
Hi fellow developers, just a cautionary tale for the new members out
there and people building up large datasets.
We already know that the difference in reported datastore size between
the actual data and the total size is due to the indexes and various
voodoo stuff that the datastore is doing to maintain our data safe. It
is even more relevant when you are trying to migrate your data out of
GAE or simply delete your data in bulk.
I was storing about 500 GB of data, translated into > 2 TB of data in
the datastore (x4...). After spending days to reprocess most of this
data to remove the unused indexes (and thus losing flexibility in my
Queries and cost me a few hundreds $), it went down to 1.6TB, still
costing me about $450 / month for storage alone. Important note is
that a lot of this data comes from individual small entities (about 1
billion of them), coming from reports and stuff. I don't deny that i
could have come up with a better design, and my latest codebase stores
the data in more efficient ways (aggregating into serialized Text or
Blobs), but I still have to make do for the v1 data set sitting there.
I started a migration of the data out of GAE into a simple MySQL
instance running on EC2. In reality, after migration, the entire
dataset only weighs < 150GB (including indexes) into MySQL so i have
no idea where the extra TB is coming from. The migration process was a
pain in the a** and took me 5 freaking weeks to complete. I tried the
bulk export from python which sucks because it only exports textual
data and integers but skips blobs and binary data (It seems they don't
learn base 64 encoding at google...). So i resorted to the remote API
after a quick email chat with Greg d'Alesandre and Ikai Lan which
basically concluded by "sorry cannot help and remote api is not a
solution". Cool then what is ? The remote API is damn slow and
expensive: I had to basically read the entities one by one, store the
extracted file somewhere and process it on the fly with backups and
failsafe everywhere because the GAE remote api will just break from
time to time (due to datastore exceptions mostly). The extraction job
had to be restarted a couple of time because of cursors being screwed
up. So reading 1 billion entities from datastore takes weeks and costs
a lot of dough. But then comes the axe: your data is still sitting on
GAE and you have to delete it. With 1 billion entries in the
datastore, a x3 / x4 writing factor, it will cost you 2-3 k$ to empty
your das bin.. I seriously don't mind paying for datastore writes, but
having to pay $2000 to delete data that already costs me $450 / month
is seriously pushing it.
Any mysql / nosql solution that i know of have some sort of flushing
mechanism that doesn't require deletion of each entry 1 by 1. How come
the datastore doesn't ? I am not paying the outrageous $500 / month of
support but I'm paying far more in platform usage (i have an open
credit of 300$ / day) and so far i didn't get any satisfying answer or
support from the GAE team. I love the platform but seriously knowing
what i know now, vendor lockin has never rang so true than with GAE
and I would not commit so much time and energy on GAE for my big/
serious projects, just leaving it to small quick and dirty jobs.
Please share and comment.
|RE: [google-appengine] Cautionary Tale: Abusive price for data migration and deletion||Brandon Wirtz||12/27/11 6:35 PM|
The cold hearted bastard in me has the following thoughts.
You wrote code that treated DataStore Like SQL.
So what I hear is "Whine, whine, whine, I built my stuff wrong, Google Tried
Did I miss anything?
Please share and comment.
|Re: [google-appengine] Cautionary Tale: Abusive price for data migration and deletion||Jeff Schnitzer||12/27/11 6:56 PM|
That's not quite fair. It's easy to get stuck in this trap.
Seems like there's a simple solution to deleting all the data, though:
|RE: [google-appengine] Cautionary Tale: Abusive price for data migration and deletion||Brandon Wirtz||12/27/11 8:33 PM|
I'd be less abusive if the title of the thread was less so.
"Cautionary tale: Building large Scale Data can cost lots if Datastore isn't
"Cautionary tale: Failure to be Scrap and Restart your DataStore when making
But I don't think that "abusive price" is accurate.
|Re: Cautionary Tale: Abusive price for data migration and deletion||André Pankraz||12/28/11 12:05 AM|
Sry Brandon...he has a point - deleting data should be cheaper, even if it's technically the same like writing.
Maybe he made some mistakes but you sometimes sound like a fanboy with GAE stockholm syndrome. ;) See what I did here...annoying accusations.
You have very good experience with Python, Cache stuff, Edge cache etc., but do you really have experience with multiple 100 GB datastore to talk like this?
E.g.: I have also seen some answers from you (often very helpful) that are just plain wrong in the Java environment.
|RE: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Brandon Wirtz||12/28/11 1:45 AM|
While the primary app I talk about is edge Cache, that’s because that’s the thing that people can most benefit from that people don’t seem to be using.
As part of my SEO tools we have what is now a 60 TB database of Backlinks and Crawler data about websites in the top 200k Alexa Sites.
Why should Deleting be Cheaper? The Operation takes the same amount of CPU, and after you do the delete you don’t have to pay for storage.
I don’t do near as much in the Java Space but it doesn’t seem there should be much difference between Python and Java. I ported both the primary apps to both languages to do comparative cost analysis, and there have been a few things that we found were faster or cheaper with one or the other, as a result in some case we deploy both and use different versioning so they can both be live and attached to the same data.
From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of André Pankraz
You received this message because you are subscribed to the Google Groups "Google App Engine" group.To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/oJRZxuV7yQgJ.
|Re: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Leandro Rezende||12/28/11 4:48 AM|
|Re: Cautionary Tale: Abusive price for data migration and deletion||Yohan||12/28/11 5:26 AM|
Although i agree with you that the original dataset wasnt fully
optimized (that was over 2 years ago), i believe that i have a good
understanding of datatore vs SQL, caching etc. Im not building public
facing website im dealing with private apis and I am already
stretching memcache and custom built java cache to the limits.
I am also not talking about the reasons why im migrating out of GAE.
The points i highlighted were:
- no easy way to get your data out
- no cheap way to get your big data out
- bulk export in python doesn't handle binary/blob data
- remote api is unstable
- running database queries using cursors for long period of time is
unreliable (many times the cursor got reset for some reason or the
query would return a 0000000 cursor thus screwing 1 week of data
- it cost me an arm to delete my data
To answer other questions :
- of course i thought about migrating the remaining data to a new app
then alias from the old app to the new one. But it means interrupting
the service (disable datastore writes) and i cant afford that. Plus
the remaining data is still quite big.
- the multi indexes: everytime i changed the data structure i would
reprocess everything to conform it to the new schema. Im not using any
framework like objectify or jdo, im working with the raw api directly
(which is way more elegant)
- im not criticizing the platform i am criticizing the lack of tools
to export and the prohibitive cost of manipulating large data sets. I
actually love GAE, it is just not for this kind of dataset thats all.
@Brandon : If you have a way to delete 2 billions entities (whatever
their size) on the cheap please let me know.
> 2011/12/28 Brandon Wirtz <drak...@digerat.com>
> > Yes, ****
>> > using.****
> > ** **
>> > and Crawler data about websites in the top 200k Alexa Sites. ****
> > ** **
>> > CPU, and after you do the delete you don’t have to pay for storage.****
> > ** **
>> > both be live and attached to the same data.****
> > ** **
> > ** **
> > *From:* google-a...@googlegroups.com [mailto:
> > google-a...@googlegroups.com] *On Behalf Of *André Pankraz
> > *Sent:* Wednesday, December 28, 2011 12:06 AM
> > *To:* google-a...@googlegroups.com
> > *Cc:* appengine_updated_pric...@google.com
> > *Subject:* [google-appengine] Re: Cautionary Tale: Abusive price for data
> > migration and deletion****
> > ** **
>> > just plain wrong in the Java environment.****
|Re: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Jeff Schnitzer||12/28/11 11:41 AM|
It looks like you've discovered the hard way something that is not
wholly obvious at first: GAE is not good for Big Data.
The HRD is super-cool and perfect for building reliable web
You probably have the right idea moving to another platform. Use the
This isn't a scathing indictment of GAE so much as a realization that
|Re: Cautionary Tale: Abusive price for data migration and deletion||jon||12/28/11 9:25 PM|
Yohan I agree that there should be an easy and cheap way to get your
data out. I think it's a little unfair that leaving GAE is made that
How much did you spend on your custom data download tool? Would you
consider open sourcing it for other developers who are caught in the
same position? I'd hate spending weeks building a custom tool just to
get my data out.
Thanks for sharing your experience.
|Re: Cautionary Tale: Abusive price for data migration and deletion||Raymond||12/29/11 12:13 AM|
On my side I thank you for sharing your experience, I am beginning
with GAE and know that whatever the time I will put on this project I
will be making beginner mistakes and this kind of info is precious.
I have now a limited experience with GAE and have to compare it with
what I know and in some sectors GAE look very bad, for example I can't
imagine Oracle, DB2, Informix, etc, ...MsSQL, etc having any
commercial success if they would not have implemented rock solid
solutions to import and export data, backup, build and drop tables and
databases and of course calculate precisely the data space required to
build a data structure, in some cases down to the byte.
Although I understand the very different nature of GAE compared to
this traditional DB engines, I think that any professional developer,
IT manager, project manager, or person responsible for budget would
feel very uncomfortable building a system without a firm grip on it's
costs or a reasonable solution to modify an initial implementation or
migrate away from it. Also the fact that part of the GAE tools are
simply not reliable enough to be able to plan effort and time required
to do something is an other big minus for this solution.
Although DB's are not my main competence, my very first paid job
20+years ago was to migrate a critical database to a new structure on
a new machine (HP 9000 unix), using a long forgotten database engine,
the first attempt using SQL took 1 week to migrate, the second using
low level C calls took months to develop and migrated in the required
3.5 hours, but the important thing to note is that It never crossed my
mind to question the reliability of the machine, the database or the C
calls I was making to the DB, it just worked, the Server could be
locked for minutes swapping to disk because of lack of memory or
overload, but it never failed once and repeated the exercise time and
time again, reliably and in a predictable timeframe.
All this said there are advantages to GAE that are worth fighting with
it's limitations, I have not yet found anything else that is so
immediately and massively scalable and at the same time does not
require me to manage the software and hardware, this is invaluable,
and although I know that I could have a easier job moving to MySQL, I
just don't want to manage an OS and a DB engine, I don't have the
time, I have done it and don't think that's where I am going to earn
I will always envy some of the people answering your message for the
depth of knowledge they have of this platform and the fact that they
always have the right solution and right answer to everything, it must
be great to never make mistakes.
|RE: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Brandon Wirtz||12/29/11 12:31 AM|
Development is not about not making mistakes, it is about doing structured
performance testing and cost analysis.
My team writes 500 lines of code for every 50 that make it in to the final
We know things about the efficiencies of Do While vs. ForEach that quite
We just never let "mistakes" grow to the point we can't control them.
|Re: Cautionary Tale: Abusive price for data migration and deletion||Yohan||12/29/11 1:41 AM|
I am actually still on Master-Slave. I would expect that using HRD
would have cost me even more.
Like you pointed out, I am indeed working on hybrid solutions now, not
letting GAE in charge of everything.
> > For more options, visit this group athttp://groups.google.com/group/google-appengine?hl=en.
|Re: Cautionary Tale: Abusive price for data migration and deletion||Yohan||12/29/11 1:45 AM|
*cheap* is relative, i wouldn't mind receiving a harddrive from Google
with all my datastore in it for $X00 which would still have been
cheaper in time, energy and money put into migrating out.
I built the tools myself, simple java programs reading from the
datastore by batches of 30 entities and dumping them to disk, saving
the cursor and continuing from there. A few lines of code really using
the java remote api. The issue lies in error management because the
datastore will break at least a few times a day due to high latency
and stuff (same issues you see directly within GAE but you experience
it remotely). So you continuously have to restart the job (manually or
not). That's where cursors are crucial since there is no way to
iterate through the database in order. And if the cursor gets
corrupted which happened to me 3 times in 5 weeks, you have to erase
everything you've done and start from scratch. Very frustrating...
|Re: Cautionary Tale: Abusive price for data migration and deletion||Yohan||12/29/11 1:52 AM|
Don't misunderstand me. GAE is a great tool, i seriously love it and
advocate it everywhere I go. But since my last experience I would
recommend an hybrid solution instead of full steam GAE. At least for
data gathering and processing.
The cost structure is quite clear :
$1 / 1 million writes
Each entity write / delete = at least 2writes (entity + key) + N
writes for the indexes (+ maybe replication i dont know if that's
So if you have 100 millions entities that's an easy $200-300 to delete
it. And believe me it is really easy to generate that many entities
when your app processes 1500 req/s, even by aggregating you are still
limited to 1MB / entity member (but i don't like to play near the
bytes limits due to potential serialization overhead so i won't store
more than 75-80% of 1MB / entity member).
I believe that the datastore (or even GAE memcache) doesn't offer a
simple flush mechanism because the entire platform is shared among
multiple (all?) the apps and the way data is stored doesn't allow for
a simple flush. (counting is a different matter). I just hope that the
GAE team will read my article and maybe lower the price of deletion a
On Dec 29, 4:13 pm, Raymond <raymond.othenin-gir...@raydropin.com>
|Re: Cautionary Tale: Abusive price for data migration and deletion||Yohan||12/29/11 1:59 AM|
Well I started using GAE simply because 2 years ago i was a tech team
of 1 and I couldn't afford to hire full time sysadmins. I'm migrating
some of my stuff out now that i have more guys to help me. And GAE is
a great platform that runs on its own and doesn't require much
administration (i launched games and apps on it that just run for
months with no major issues). So great for starting up. But as soon as
you enter the big data domain, you need more control about the way you
can process and move your data around (the big companies all have
their own datacenters because they need full control about the
infrastructure) and thus a PAAS may not be suited anymore.
It's hard to plan that your business will grow 10x within a few months
and the tech infrastructure must suddenly grow from 50 req/s to 5,000
req/s. BTW GAE can't handle such load well (latency of min 500ms on
java seriously suck, not talking about write contention on the
datastore). It is easy to plan when everything can be defined in
advance (with budgets and stuff) but you don't always have the option.
But thanks for sharing your inputs anyway, always appreciated ;)
|RE: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Brandon Wirtz||12/29/11 2:58 AM|
If you check the archives I have shared times when my requests were well
I would say GAE handles big data really well. But you have to do testing to
Planning is always possible. Testing is always possible. But like driving
Or like the discussion about executing code from students.
GAE is cycles on demand, so if you can build your app to be efficient it is
I recently found I could knock 3% off of my bill by disabling logging.
You say you can't predict growth. Sure I can. I either engineer something to
Things that are designed for you and your friends you don't market, you
I think the real lesson I'm trying to convey is one I learned at MSFT. For
|Re: Cautionary Tale: Abusive price for data migration and deletion||Yohan||12/29/11 3:40 AM|
Interesting story but you rarely design facebook for 500 millions
people right from the start and alone...
Anyway i would love to know how much it would cost you and how long
you would need to get your data out of your super/big apps.
> read more »
|RE: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Brandon Wirtz||12/29/11 11:41 AM|
Did you see the thread about the push the button check back in 48 hours?
Though to be fair on RDS we just did a data dump to move to a new system
Data migration over the internet is tough when you get above 1 TB. And
> > 20+on
|Re: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Jeff Schnitzer||12/29/11 12:39 PM|
Just a thought (and it would probably be expensive) but perhaps you
should do a two-phase export strategy:
1) Export data into Blobstore as very large blobs
The export can run at Map/Reduce speeds... as fast as you want to pay
|RE: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Brandon Wirtz||12/29/11 12:54 PM|
I haven't written code to do it, but I had been thinking about writing stuff
that serialized entities in to Blobs, Zip Compressing and put them in blob
store, then sucking down the blobs later.
This was also what I was thinking about for a Back-up strategy.
|Re: [google-appengine] Re: Cautionary Tale: Abusive price for data migration and deletion||Jeff Schnitzer||12/29/11 12:55 PM|
On Thu, Dec 29, 2011 at 2:58 AM, Brandon Wirtz <dra...@digerat.com> wrote:
I think we are talking about two different things. I'm thinking of
Typically characterized by:
* Large data volumes
The GAE datastore performs poorly in this regard. Map/reduce support
I love the GAE datastore, I think it's hands-down the Best Storage
|Re: Cautionary Tale: Abusive price for data migration and deletion||Yohan||12/29/11 3:43 PM|
Yep the blobstore approach is a better solution IMHO but costs will
still be prohibitive. Im afraid
nothong much you can do about that. Initially i thought that we could
get an export of the datastore, download it as a binary file and
process it locally (like the local .bin used for devs) but i'm
dreaming out loud :-)
On Dec 30, 4:55 am, Jeff Schnitzer <j...@infohazard.org> wrote: