Cost of mapreduce was $6,500 to update a ListProperty on 14.1 million entities

3,707 views
Skip to first unread message

Petey

unread,
Jan 4, 2012, 8:41:27 PM1/4/12
to Google App Engine
Before the pricing change it cost us about $100.

I understand the prices need to go up, but I don't think this is even
what App Engine had in mind when they changed the numbers. My guess
for how much this cost Google is about $5 on their end.
The main reason why this cost so much of course is because they now
charge per write and our list property, which has about 18 values per
entity, is indexed.

If there was some way to set a lower priority or rate on the writes,
and thus cost us less money, that would be great. We don't necessarily
need those writes to go through instantly and at least with instances
we can make sure to spread out our processing to save us instance
cost. With writes there is no low priority option. It's sad because
with regular servers you don't get charged for writes, just for how
much processing power/memory it takes to do something in a certain
amount of time.

We are going to have to update these values every once in a while and
can't afford for it to cost us thousands of dollars every time we need
to change some data.

Please help. We love App Engine, except for this issue.

Nickolas Daskalou

unread,
Jan 5, 2012, 2:14:54 AM1/5/12
to google-a...@googlegroups.com
Is this something you could move into Google Cloud SQL (http://code.google.com/apis/sql/) once it's up and running? You could request beta access and take it for a test run.

Nick



--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.


Brian Peterson

unread,
Jan 5, 2012, 2:30:05 AM1/5/12
to google-a...@googlegroups.com
There's no guarantee that the pricing for that with our data and usage needs will be any cheaper. Plus the limit of the storage for the preview to only 10GB and we don't want to move to something that isn't stable yet. 

Richard Watson

unread,
Jan 5, 2012, 9:54:29 AM1/5/12
to google-a...@googlegroups.com
A couple thoughts.

Maybe the GAE team should borrow the idea of spot prices from Amazon. That's a great way to have lower-priority jobs that can run when there are instances available. We set the price we're willing to pay, if the spot cost drops below that, we get the resources. It creates a market where more urgent jobs get done sooner and Google makes better use of quiet periods.

On your issue:
Do you need to update every entity when you do this? How many items on the listproperty need to be changed? Could you tell us a bit more of what the data looks like?

I'm thinking that 14 million entities x 18 items each is the amount of entries you really have, each distributed across at least 3 servers and then indexed. That seems like a lot of writes if you're re-writing everything.  It's likely a bad idea to rely on an infrastructure change to fix this (recurring) issue, but there is hopefully a way to reduce the amount of writes you have to do.

Also, could you maybe run your mapreduce on smaller sets of the data to spread it out over multiple days and avoid adding too many instances? Has anyone done anything like this?

sb

unread,
Jan 5, 2012, 10:32:33 AM1/5/12
to Google App Engine
Google Cloud SQL looks interesting.

From
http://code.google.com/apis/sql/faq.html#cost
"We will give you at least 30 days’ advance notice before we begin
billing in the future."

30 days is not enough notice to respond to changes/decisions that may
be made.


On Jan 5, 2:14 am, Nickolas Daskalou <n...@daskalou.com> wrote:
> Is this something you could move into Google Cloud SQL (http://code.google.com/apis/sql/) once it's up and running? You could
> request beta access and take it for a test run.
>
> Nick
>

Petey

unread,
Jan 5, 2012, 1:08:16 PM1/5/12
to Google App Engine
In this one case we had to change all of the items in the
listproperty. In our most common case we might have to add and delete
a couple items to the list property every once in a while. That would
still cost us well over $1,000 each time.

Most of the reasons for this type of data in our product is to
compensate for the fact that there isn't full text search yet. I know
they are beta testing full text, but I'm still worried that that also
might be too expensive per write.

Ikai Lan (Google)

unread,
Jan 5, 2012, 2:58:58 PM1/5/12
to google-a...@googlegroups.com
Brian (apologies if that is not your name),

How much of the costs are instance hours versus datastore writes? There's probably something going on here. The largest costs are to update indexes, not entities. Assuming $6500 is the cost of datastore writes alone, that breaks down to:

~$0.0004 a write

Pricing is $0.10 per 100k operations, so that means using this equation:

(6500.00 / 14000000) / (0.10 / 100000)

You're doing about 464 write operations per put, which roughly translates to 6.5 billion writes. 

I'm trying to extrapolate what you are doing, and it sounds like you are doing full text indexing or something similar ... and having to update all the indexes. When you update a property, it takes a certain amount of writes. Assuming you are changing String properties, each property you update takes this many writes:

- 2 indexes deleted (ascending and descending)
- 2 indexes update (ascending and descending)

So if you were only updating all the list properties, that means you are updating 100 list properties. 

Given that this is a regular thing you need to do, perhaps there is an engineering solution for what you are trying to do that will be more cost effective. Can you describe why you're running this job? What features does this support in your product?

--
Ikai Lan 
Developer Programs Engineer, Google App Engine



--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.

Iván Rodríguez

unread,
Jan 5, 2012, 3:48:40 PM1/5/12
to google-a...@googlegroups.com
I think your problem is similar to the mine.

http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74

Ikai, please, can explain us how many cost in terms of write ops, should us expect for updating indexed list property adding X items to the list?

For example

Modeling (Objectify annotations)

@Entity 
class RelationIndex () {
@Parent
Key<User> ownerKey;
@Indexed
List<Key> receiverKeyList;
}

Define

X = nº New items for add to the list.
Y = nº Entities to update (same entity group), 1 list property indexed per entity 
Z = nº Items before updating list properties.


Magic calculator 

Total write ops = Y * ???? 




2012/1/5 Ikai Lan (Google) <ika...@google.com>

Amy Unruh

unread,
Jan 5, 2012, 4:00:49 PM1/5/12
to google-a...@googlegroups.com
Iván,

2012/1/6 Iván Rodríguez <ivan....@gmail.com>

I think your problem is similar to the mine.

http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74

Ikai, please, can explain us how many cost in terms of write ops, should us expect for updating indexed list property adding X items to the list?

This page can help you work out the costs for your particular entities and indexes:
E.g., it details the costs for the different datastore operations given an entity's properties and indexes.

Amy Unruh

unread,
Jan 5, 2012, 4:05:35 PM1/5/12
to google-a...@googlegroups.com
On Fri, Jan 6, 2012 at 8:00 AM, Amy Unruh <amyu+...@google.com> wrote:
Iván,

2012/1/6 Iván Rodríguez <ivan....@gmail.com>
I think your problem is similar to the mine.

http://groups.google.com/group/google-appengine-java/browse_thread/thread/1ace5bd8658d89d/a62d0b3f2b3c4e74#a62d0b3f2b3c4e74

Ikai, please, can explain us how many cost in terms of write ops, should us expect for updating indexed list property adding X items to the list?

This page can help you work out the costs for your particular entities and indexes:
E.g., it details the costs for the different datastore operations given an entity's properties and indexes.

See this as well: http://code.google.com/appengine/docs/python/datastore/entities.html#Understanding_Write_Costs , which discusses multi-value properties.  These can lead to expensive indexes.

 -Amy

Corey [Firespotter]

unread,
Jan 5, 2012, 6:24:22 PM1/5/12
to Google App Engine
I work with Petey on this and can help clarify some of the details.

The Entities;
We have a lot of entities (~14mi) each of which have a
StringListProperty called "geoboxes". Like so:
class Place(search.SearchableModel):
name = db.StringProperty()
...
# Location specific fields.
coordinates = db.GeoPtProperty(default=None)
geohash = db.StringProperty()
geoboxes = db.StringListProperty()

Background (details on geoboxing at bottom):
We're running a mapreduce to change the geobox sizes/precision for a
large number of entities. These entities currently have a 'geoboxes'
StringListProperty with ~20 strings. For example:
geoboxes = [u'37.341|-121.894|37.339|-121.892', u'37.341|-121.892|
37.339|-121.891', ...]
We are changing those 20 strings to 20 new strings. Example:
geoboxes = [u'37.3411|-121.8940|37.3395|-121.8926',
u'37.3411|-121.8929|37.3395|-121.8916', ...]

The Cost:
We did almost this same mapreduce when we first added the geoboxes
back in July. In that case we were populating the list for the first
time so we can assume half as many operations were required (no
removing of old values). Total cost i July was ~$160 for the CPU
time.

When we ran the mapreduce again this week to change the box sizes the
cost was $18 for Frontend Instance Hours, $15 for Datastore Reads
(21mil) and $2,500 for Datastore Writes (2500mil). This was not a
complete run of the mapreduce. We aborted it after 5.4mil (38%) of
the entities were updated. Hence Petey's estimate that the full
update would cost $6,500.

The Operations:
Each entity update is removing ~20 existing strings from the geoboxes
StringList and adding 20 more. The geobox property is indexed (and
has to be) and is involved in 3 composite indexes so as best I
understand it this means each string change results in 10 writes (4 +
2 * 3). So on every entity we update the geoboxes we perform 401
write operations (1 + 10 * 40).

This agrees pretty well with the charges (2,500,000,000 ops /
5,424,000 entities) = 460 ops per entity.

That's a lot of writes and likely the core of the surprising cost.
However, I'm not sure how we could avoid that with App Engine (open to
ideas!), and since we could pay for dedicated servers for that amount,
I think the pricing is probably off as well.

Even if we treat the geobox update as a one-time cost, we have other
properties like scores, labels, etc that require occasional tweaking.
Updating even a single indexed property across all these entities
costs us $60-$100 and typically many times that in practice because
these interesting fields tend to be used in composite indexes.

-Corey

Geoboxing Details
Geoboxing is a technique used to search for entities near a point on
the earth in a database that can only perform equality queries (like
App Engine). In short, you break up the world into boxes and record
which box each entity belongs to as well as any nearby boxes. Then
you break up the world into larger boxes and repeat until you have a
good range of sizes covered.
There's a good article on the logic of algorithm here:
http://code.google.com/appengine/articles/geosearch.html

Yohan Launay

unread,
Jan 5, 2012, 6:57:06 PM1/5/12
to Google App Engine

Hi,

I feel your pain. it cost me a few thousand dollars to delete my
millions enities from the datastore after a migration job (ikai never
replied my post though...) and im still paying since the deletion is
not completed yet (spending 100-300$ a day for the past 2 weeks
now!!). Not doing much just running the "delete all" mapreduce job
from the admin panel.

There is totally somethig wrong with the way datastore writes are
priced and google should seriously do something about it before they
lose their big customers (i.e. the ones affected by this problem).

It is simply too costly to go through your data to change an index or
update stuff or delete your data. And in your case (like mine) even if
you want to take your data out to externalize
your custom search an storage it will cost you X000$+ to take it out
and another XX,000$ to cleanup behind you (you seem to have a lot of
indexed properties in your dataset).

Please keep me posted on how things go with you as I'm still hoping i
can get some credit/refund/assisance from google at this stage
although i havent heard from them.



On Jan 6, 7:24 am, "Corey [Firespotter]" <co...@firespotter.com>
wrote:

de Witte

unread,
Jan 5, 2012, 7:35:38 PM1/5/12
to google-a...@googlegroups.com
What if you disable the app for maintenance, doing the following steps:
  • - make sure the users can't access the app.
    • - delete all indexes.
    • - redeploy a subset of your application for updating the 15 million entities.
    • - redeploy the full application.
    • - rebuild the indexes.
  • - open access to your application.

A rebuild of an index is likely much cheaper in cost than updating one for each write.

-Wendel

salim

unread,
Jan 5, 2012, 8:08:47 PM1/5/12
to Google App Engine
Our pricing went up by 5X! This is after the 50% discount, so it would
have been 10X. I had disable bunch of stuff and we are trying to find
a way to move our site away.
Here is a quick graph of our pricing change:

https://plus.google.com/114790424055754975707/posts/eUMhYDVf6i5

salim

unread,
Jan 5, 2012, 8:10:39 PM1/5/12
to Google App Engine
I just don't think google can do that, our price was increased by 5X!
And we are stock right now and we need to re-enginer the entire site
now to wove it away. I was considering seeing a lawyer regarding.
Perhaps if there is enough of us we can do it all together.

https://plus.google.com/114790424055754975707/posts/eUMhYDVf6i5

Jon Stevens

unread,
Jan 5, 2012, 8:23:30 PM1/5/12
to google-a...@googlegroups.com
For our application, we used Geohashing based on this article: http://code.google.com/apis/maps/articles/geospatial.html

It is a slightly different twist on the indexes which I think would prevent you from having to re-index every time you want to change the precision. 

"Geohashes offer properties like arbitrary precision and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision)."

Instead of an indexed property that looks like this (what you currently have):

[u'37.3411|-121.8940|37.3395|-121.8926', u'37.3411|-121.8929|37.3395|-121.8916', ...] 

We have this...

/** Geocells in which this entity resides */
@Index
List<String> cells;

Which looks like this:

[8, 8f, 8f1, 8f12, 8f12a, 8f12ac, 8f12ac6, 8f12ac60, 8f12ac605, 8f12ac605f, 8f12ac605fb, 8f12ac605fb3, 8f12ac605fb34]

The query then looks like this:

List<String> cells = [compute list of cells from a lat/lng, there is code out there to do that]
Query<Entity> query = ofy.load().type(Entity.class) .filter("cells in", cells)

hope that helps,

jon

Vivek Puri

unread,
Jan 5, 2012, 10:28:26 PM1/5/12
to Google App Engine
Even i have a table with 1.5TB of data. I need to truncate it but dont
want to give thousands to delete data(i had paid thousands in old
pricing model for another table. Not sure how much more it will cost
now), while i pay hundreds for the data to be there. AppEngine team
really needs to have a cheaper way to delete data.

Andrin von Rechenberg

unread,
Jan 6, 2012, 5:10:35 AM1/6/12
to google-a...@googlegroups.com
I find Google's posted solution quite suboptimal as it is too expensive

There is a simple trick to get rid of this problem. Instead of indexing
all geocells in a StringListProperty you only index the most detailed
cell: instead of [7, 7e, 7e3, 7e3a, 7e3a4] you only index "7e3a4"
converted to a int64. To search you do range scans. Finding all
items in the cell 7e3 is a range scan like
"geohash >= 7e3 and geohash < 7e4"

I have a library in python that does all this. And some more performance
tricks like merging 2 cells next to each other into a single range scan
etc etc. I found that my solution performs a tiny bit better and is
much cheaper cause I dont need StringListProperty in my index but
just a simple IntegerProperty. Of course my solution has one major
drawback: You can not do additional inequality searches, since my
range scans already uses the inequality (but you can still do bucketing to
solve this issue) and of corse you can do additional filters.
If enough people are interested in my solution ill open source it.

Cheers,
-Andrin

Richard Watson

unread,
Jan 6, 2012, 10:47:54 AM1/6/12
to google-a...@googlegroups.com
What if you had the gps data as children of each entry and then used a keys-only query to match, and then fetch the parents. I forget the technique's name, maybe someone else remembers.  The benefit is that when you need to edit gps coords you leave the parent alone. Data in the parent isn't duplicated and all changes only happen to the children. No parent data is re-indexed so you reduce datastore charges on updates. I'm not 100% sure it'd help but it might be worth testing.

Also, I don't know the order in which you're going through your data, but there could be a hot tablet issue on indexes if you're changing lots of closely ordered data as you go. See http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ - not sure if it applies in this case, but it's possible you might get some weird side effects if things are being retried, etc.

Aside from that, if I were you I'd pay up for a premium account (at least for a few months, not sure if that can be done) if you don't have one already and the team isn't helping you offline. Hopefully they are!

Jeff Schnitzer

unread,
Jan 6, 2012, 2:59:45 PM1/6/12
to google-a...@googlegroups.com
On Fri, Jan 6, 2012 at 7:47 AM, Richard Watson <richard...@gmail.com> wrote:
> What if you had the gps data as children of each entry and then used a
> keys-only query to match, and then fetch the parents. I forget the
> technique's name, maybe someone else remembers.  The benefit is that when
> you need to edit gps coords you leave the parent alone. Data in the parent
> isn't duplicated and all changes only happen to the children. No parent data
> is re-indexed so you reduce datastore charges on updates. I'm not 100% sure
> it'd help but it might be worth testing.

This shouldn't help. Re-puting an entity won't cause index updates if
the indexed values don't change. The "relation index entity" pattern
is only useful when you have very large #s of index items (many
thousands). You wouldn't want to do it for 20 short strings.

Jeff

George

unread,
Jan 6, 2012, 11:50:51 AM1/6/12
to Google App Engine
Corey,

Did you guys consider something along the lines of SimpleGeo to
outsource your spatial stuff?

Is there a political or philosophical reason to keep everything inside
of GAE?

-- George



On Jan 5, 3:24 pm, "Corey [Firespotter]" <co...@firespotter.com>
wrote:

Corey

unread,
Jan 6, 2012, 6:44:46 PM1/6/12
to Google App Engine
Wow! Thanks very much to everyone who posted suggestions and to those
who sent me direct replies.

First, please let me say that I believe it's in Google's best interest
and my company's for us to keep core portions of our application in
AppEngine. Furthermore, given the amount of investment we have in
this infrastructure, I intend to pursue any avenues we have to
continue using AppEngine. I believe that with Google's help we can
find an engineering solution and/or pricing model that allows for both
the platform and its customers to be successful.

Ok. That said, let me summarize and respond to some of the concerns/
recommendations above:

[NickolasD] Is this something you could move into Google Cloud SQL?
Yes, but it's not clear what the pricing model will be for CloudSQL
and whether it will be any cheaper than AppEngine.

[RichardW] Maybe the GAE team should borrow the idea of spot prices
from Amazon.
Love this idea. It would serve to spread out resource usage,
provide market pricing, and benefit all involved.

[RichardW] Maybe run your mapreduce on smaller sets of the data to
spread it out over multiple days and avoid adding too many instances?
As detailed above, the costly component here is the database
operations charge wrt large datasets and indexed properties.

[sb] Google Cloud SQL looks interesting. but 30 days is not enough
notice to respond to changes/decisions that may be made.
Totally agree. I get 45 days notice on my rent increases and it
takes far less effort for me to change apartments.

[de Witte] What if you disable the app for maintenance, doing the
following steps...
Really interesting suggestion! Would love to hear if someone's
tried this.
a) We'd really like to avoid turning off the app for the 1-2 days it
would take to create the indexes.
b) I'm not certain a rebuild would be any cheaper. If it is, that's
probably an unintentional pricing discrepancy that I'd prefer not to
rely on.

[JonS] For our application, we used Geohashing.
We used geohasing before geboxing, but it didn't work for us.
a) AppEngine requires that a query have at most one inequality
comparison, and geohashing uses it.
b) We found that geohashing queries were much slower than geoboxing
for the same parameters, adding human noticeable delay (>400ms).

[VivekP] I have a table with 1.5TB of data. It costs me ten of
thousands (one-time) to delete it and a few thousand (per year) to
keep it.

[Andrin] I use a version of geohashing which only uses the most
precise value.
Our geohash does the same thing, but the above limitations still
exist.

[RichardW] What if you had the gps data as children of each entry and
then used a keys-only query to match?
Love the suggestions! We thought about doing something like this as
well, but we'd still have one entity per StringListProperty.
That's not so bad, but we'd also need to copy down other properties
so we could restrict the query based on other values on the entity.
For example: Finding entities within a certain distance sorted by
popularity or filtered by user.
I'm not certain there would be cost benefits but I am certain it
would add substantial complexity to the data + app.

[IkaiL] please describe the engineering details and business purpose
Provided in an earlier post. Do you need any more info? Any
thoughts? We tried registering for Premier support a few days ago but
haven't heard back yet.

Thanks again,
-Corey

blackpawn

unread,
Jan 7, 2012, 12:16:05 AM1/7/12
to Google App Engine
I'm in the same boat and glad to hear I'm not alone! It's way too
expensive to delete things right now, it makes me afraid to add any
more data to GAE. :-/

HOCINE BENFERHAT

unread,
Jan 7, 2012, 7:28:23 PM1/7/12
to Google App Engine
How about buffering the index changes, so only one index update needs
to be written?

Corey

unread,
Jan 8, 2012, 6:02:09 PM1/8/12
to Google App Engine
Johnson, could you please elaborate a bit? Are you suggesting that
AppEngine provide a way to buffer the index updates or that there's
already a way to do that?

Corey

unread,
Jan 12, 2012, 8:40:40 PM1/12/12
to Google App Engine
Ikai Lan, any suggestions or are we just SOL?
Reply all
Reply to author
Forward
0 new messages