CPU accounting for Datastore writes and pricing feedback

17 views
Skip to first unread message

diomedes

unread,
Feb 14, 2009, 6:18:03 PM2/14/09
to Google App Engine
Hi all,

I have a question with regards to the pricing and how it relates to
the relatively high API CPU associated with each write (400-500msec)

I have started working on my first GAE app about a month ago - and
overall I am both excited and very satisfied. My initial plan, was to
run it on Amazon's EC2 - but eventually I took the plunge, started
learning python, move to GAE and (almost) never looked back :-)

My app, a cacti-like webservice that monitors the performance ( think
response time) of a website using google analytics-like beacons, is
rather resource demanding. On top of that GAE best practices imply
that any expensive reports/aggregates etc should be precalculated/
stored instead of dynamically produced on demand. All that result in
many writes and given that the simplest write (single key-val pair, no
indexes) gets "charged" approx 500msec of API cpu time (see related
thread by Matija http://groups.google.com/group/google-appengine/browse_thread/thread/9db986d7ea3ff901/0ad0767787d67a97)
a normal DB design that would have been meaningful in terms of cost on
EC2 becomes impossible on GAE.

Because I am a google-aholic I decided to change the app design to
minimize writes - I fetch a bunch of pickled data as a blob, update
in mem and write them back as blob (just like people did before DBs
came along :-) )
Before I commit to that design I wanted to get the confirmation that
my understanding is correct:
- Google is going to charge 10-12cents per CPU-hour and it will
include in that all the CPU used from APIs etc. (http://
googleappengine.blogspot.com/2008/05/announcing-open-signups-
expected.html)
- This means that if your site does 10M pageviews a month and does
a couple writes per pageview at 500msec per write it will be "10M CPU
secs/mo just from the writes, i.e. 10M/3600 * $.10/hr = $280/mo just
from the writes.

Is this correct?

For the record, I find Google's planned pricing extremely attractive
when compared to Amazon's primarily due to the fact that Amazon
charges 10c for CPU-hr of the machine while google (will) charge 10c
for CPU-hr *actually used* by your requests. This makes a huge
difference -- a server running at 50+% capacity (thats rather
aggressive - but with Amazon/RightScale combination you can be
aggressive) will still use less than 20% of its CPU during that time.
However, when comparing the cost writes between Google and the
corresponding setup of a [high CPU EC2 server + elastic storage] combo
(able to provide quite more than 20-50 "simple" writes per sec)
Amazon is much cheaper than Google.

Ok, that's all I had to say,
Sorry for the rather long post,
Looking forward to hear comments

Ah and thank you very very much for lifting the high cpu quota!!

Diomedes

diomedes

unread,
Feb 17, 2009, 12:01:16 AM2/17/09
to Google App Engine
Ok,

(a bit bummed that nobody bother to reply to my first post to the
group)
I spent the last few days reading related posts, transcripts of chats
with Marzia, went through the related tickets and still I have no
answer.

Let me try once more to explain my question:
It is not about quotas.
It is neither about performance per se.
My question is about cost.
My app implements a beacon service - my client websites that will be
using the beacon have 5-10M pageviews per month and these pageviews
will result into actual beacon hits to my app.
These requests do not hit the Datastore - I buffer them in memcached
doing minimal processing to keep the per beacon-hit cost low and
"process them" in batches every few seconds.

Still, in spite of all the buffering, if the cheapest write (1 attr
table, no indexes) gets charged 250-500msec, this make the app design
for such a high throughput service non-obvious.

I understand that I can decrease the number of my writes by using a
pickled blob attr that contains lots of records inside - that what I
am about to do.
I just wanted a confirmation from any gurus out there, or from the
google team that my understanding about the cost of the "cheapest
write" is correct.

I tried batch db.put( of 100 single cell objects) - still took 23
seconds of data store CPU , i.e. my cheapest write = 230msec - a bit
less than before but not by much.
I made all 100 objects children of same parent and performed the same
test - again same datastore CPU utilization

Is there another method to update/insert multiple records that costs
less? I.e. is there any cheaper write? ( I am very flexible :-) )
Is my understanding that the planned 10-12cents per CPU hr will be
applied towards the datastore CPU usage as well?

Thanks a lot,

diomedes


On Feb 14, 3:18 pm, diomedes <alakaz...@gmail.com> wrote:
> Hi all,
>
> I have a question with regards to the pricing and how it relates to
> the relatively high API CPU associated with each write (400-500msec)
>
> I have started working on my first GAE app about a month ago - and
> overall I am both excited and very satisfied. My initial plan, was to
> run it on Amazon's EC2 - but eventually I took the plunge, started
> learning python, move to GAE and (almost) never looked back :-)
>
> My app, a cacti-like webservice that monitors the performance ( think
> response time) of a website using google analytics-like beacons, is
> rather resource demanding. On top of that GAE best practices imply
> that any expensive reports/aggregates etc should be precalculated/
> stored instead of dynamically produced on demand. All that result in
> many writes and given that the simplest write (single key-val pair, no
> indexes) gets "charged" approx 500msec of API cpu time (see related
> thread by Matijahttp://groups.google.com/group/google-appengine/browse_thread/thread/...)

Sharp-Developer.Net

unread,
Feb 17, 2009, 10:47:59 AM2/17/09
to Google App Engine
Hi Diomedes,

Pitty you have not got any reply yet as the question is very valid and
interesting.

I'm not Google representative, just a newbie GAE developer and will
share my assumptions.

As I understand the GAE is more suitable for apps where writes are
triggered by user actions.

And any kind of statistics/logging/etc systems are not best candidates
to implement in GAE.

This all about BigTable design limitations. My understanding it was
designed to have much less writes then reads.

Like 1 user write a post/twitt and hundreds are reading it.

In you case even with all optimizations you have exactly oppozite -
lot's of writes and just few reads.

So I think you could have big problems to use GAE effectivly and
benefit from all it features.
--
Alexander Trakhimenok

Dave Warnock

unread,
Feb 17, 2009, 11:15:12 AM2/17/09
to google-a...@googlegroups.com
Diomedes,

I wonder if you can spread the load in some way. For example

- a report application that grabs the data via api calls to a data
store app, does processing and shoves it in memcache.

- a set of capture apps that collect data in batches and then use api
calls to the data store app to dump the data

- you could even shard the data store app by segmenting the data. The
capture and report apps would write to the appropriate data store.

each customer then gets their own 3 separate apps (capture, store and
report) which should keep costs as low as possible. If required you
can have multiple versions of each layer.

Obviously key problems are complexity (but clean REST API's should
keep that to a minimum) and time lag between capture and report.

Just an idea of the top of my head (and you would need to check terms
of service would be ok with this).

Dave
--
Dave Warnock: http://42.blogs.warnock.me.uk

Geoffrey Spear

unread,
Feb 17, 2009, 12:05:56 PM2/17/09
to Google App Engine


On Feb 17, 11:15 am, Dave Warnock <dwarn...@gmail.com> wrote:
> - a report application that grabs the data via api calls to a data
> store app, does processing and shoves it in memcache.

Relying on memcache as reliable storage even temporarily is almost
certainly a bad idea.

Dave Warnock

unread,
Feb 17, 2009, 12:25:18 PM2/17/09
to google-a...@googlegroups.com
Geoffrey,

>> - a report application that grabs the data via api calls to a data
>> store app, does processing and shoves it in memcache.
>
> Relying on memcache as reliable storage even temporarily is almost
> certainly a bad idea.

Agreed. But a report server would not be doing so. If the data is not
available via memcache it would grab it via api calls to the data
store app. Ok slower then than direct bigtable access but just the
same principal (except maybe you add a bit more processing between
bigtable and memcache to get the data ready for the report).

Dave--
Dave Warnock: http://42.blogs.warnock.me.uk

diomedes

unread,
Feb 17, 2009, 1:59:03 PM2/17/09
to Google App Engine
@alexander
I agree that implementing a "few reads/lots of writes" app on GAE is
not the easy/typical case.
GAE with the current planned pricing is a perfect deal for small lots-
reads/few writes sites (almost free)
is a great deal for large (lots-reads/few writes) sites (better than
ec2 )
and and is a not a such a good deal for a (few-reads/lots writes) app.
On the other hand, running a beacon service (whether you are serving
ads, or running stats) has significant scalability challenges anyway
and I am willing to take on the challenge to implement it on GAE to
leverage its scalability even if that means that the there will not be
much "profit margin" left.

@dave
I have followed exactly the architecture you suggest internally : 3
web services:
The first I call "recorder: (your capturer) - just captures the beacon
hit
The second I call the "processor" (your "store") it updates the
structures and incurs most of the CPU cost (in your suggestion you
spread some of that CPU load to the reporter and that could be a
promising alternative)
The third, the reporter, which is really a Google Data Source
implementation fetches from data store the precalculated chart data
and drives google chart based reports
I do the break up to 1) improve batching 2) make beacon hits fast 3)
make the reporting fast
Doing that breakup as fully independent apps doesn't actually change
the cost per se (except if you take into account the free first 5M
hits).

@geoffrey
> Relying on memcache as reliable storage even temporarily is almost
> certainly a bad idea.
Geoffrey, I am taking a gamble with this:
GAE currently misses the concept of a file system "logfile" : a very
efficient (think cheap) append-only buffered file - that sequences
chronologically all writes received from web servers. Thats what they
use internally to implement the apache log facility.
The only way to simulate this is via memcached.
So I have implemented the equivalent of a "Buffered Append Only
LOGFILE", which accumulates writes in mem, and every 100 lines or
1minute (which ever happens first) flushes to the disk.
Remember that OS-based logfiles also do not guarantee persistence
until the "flush" actually happened.
If it "behaves" as expected I will suggest it to the gaeutilities
guys. My hope is that a frequently accessed memcached item doesn't
disappear in 30-60 secs except in rare cases - and log files are ok
with that - they are not truly transactional storage.

diomedes

Marzia Niccolai

unread,
Feb 17, 2009, 4:55:55 PM2/17/09
to google-a...@googlegroups.com
Hi,

We are still determining how to appropriately calculate the CPU to take in to account both runtime and various API's CPU.  I don't have any specific details yet on how datastore CPU will factor in to this calculation, but rest assured we'll let you know.

-Marzia

boson

unread,
Feb 17, 2009, 7:40:19 PM2/17/09
to Google App Engine
I agree this is going to be an issue for applications that are not
read-heavy. The more interactivity and state changing done by users
in your app, the more expensive it is going to be to host on GAE.

Googlers have hinted that they're working on getting the CPU-cost for
DS operations down, but it's likely to be shaving some % off more than
anything drastic.

Dennis

unread,
Feb 23, 2009, 12:49:50 PM2/23/09
to Google App Engine
I wonder what google uses for their own apps.
Personalized tracking (like search histories and gmail) means lots of
writes.
Do they simply use big table writes and eat the high cpu cost
associated with those services or
do they have a different architecture for high-volume write apps.
Social networks (facebook, friendfeed) seem like they have a huge
volume of writes also
as they distribute a user's actions to the news feed of all the
friends...

Dennis
> > Dave Warnock:http://42.blogs.warnock.me.uk- Hide quoted text -
>
> - Show quoted text -
Reply all
Reply to author
Forward
0 new messages