Understanding the value proposition of App Engine for data processing

Duo Zmo

unread,

Aug 11, 2015, 2:30:03 PM8/11/15

to Google App Engine

I'm just digging into map reduce on Google App Engine, and my early results are discouraging. I had in mind that I'd process about 10GB of data for an analysis I want to do, and I didn't even think that'd be that big a deal (given all the talk about petabyte-scale storage and such), but it's currently looking impossible.

I did a simple word count mapreduce on some Gutenberg books (63MB zipped, 166MB unzipped), once using Google's Python mapreduce example (https://cloud.google.com/appengine/docs/python/dataprocessing/) and once using the dumb-as-rocks standalone Python scripts posted at the top of Michael Noll's Hadoop tutorial (http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/).

Experimental results:

Simple Python: 1 minute 22 seconds

GAE dev server: 2 hours 17 minutes 12 seconds

Given the staggering difference in run time, even if computation in the cloud were free, I'd still opt to compute locally unless my hand were forced somehow (e.g. input files that didn't fit on my disk). Of course, the computation is not free, which means you're not only enduring all that overhead, but paying for it too.

I did try running this same test "in production," i.e. on Google Cloud's infrastructure. At first it failed, because just getting the job started exceeded the 128MB memory limit for the free tier. I turned on billing, bumped up the instance class to F4, and let it go. It chewed through the free tier quickly, then about USD$8 of instance time before one of the shuffle-merge shards seemed to enter an infinite loop (ran for 2 hours, no errors in logs). I aborted and gave up at that point.

Everything I hear about cloud computing makes it sound like the gleaming, glossy future, but these results makes it seem expensive and slow. $8+ to do a mapreduce across 60MB of data just doesn't seem like a good deal to me. At that rate, there's no way I can afford to process my 10GB dataset on App Engine. I understand that with the pipeline model you get fault tolerance and status reports and basic job management, but none of that is worth the expense or a 100x performance hit.

I think there's two possible problems going on here:

1) I made a technical mistake in my experiment and my results are invalid

2) I'm not understanding the benefits / value proposition of App Engine

Are my results consistent with what others would expect? Do either or both of my candidate explanations ring true? What else am I not considering? I think this is more a discussion topic than a discrete ask-and-answer, which is why I'm posting here instead of Stack Overflow.

Barry Hunter

unread,

Aug 11, 2015, 2:58:26 PM8/11/15

to google-appengine

Experimental results:

Simple Python: 1 minute 22 seconds
GAE dev server: 2 hours 17 minutes 12 seconds

The Dev server isnt designed for performance. Its designed to be a close approximation of the functionality of the online service. Enough so developers can work.

Much functionality is emulated in software, so will not perform like real 'production'. It will be a lot slower due to lots of overhead. .

Given the staggering difference in run time, even if computation in the cloud were free, I'd still opt to compute locally

If you *can* do the computation locally, it will nearly always be quicker. The overhead of a solution that CAN scale, is too much, if dont need that scale.

The crunch as you say comes when it gets too big.

Everything I hear about cloud computing makes it sound like the gleaming, glossy future, but these results makes it seem expensive and slow.

Its not 'magic' you don't access to supercomputers.

What you get is access to lots and lots (and lots) of small computers. And someone else manages them, and gives you pre-configured algorithms.

Its horizontal scaling, rather than vertical scaling.

$8+ to do a mapreduce across 60MB of data just doesn't seem like a good deal to me.

If you only looking at AppEngine purely for data processing, then no its not a good place.

There are much better value ways of processing data.

At that rate, there's no way I can afford to process my 10GB dataset on App Engine. I understand that with the pipeline model you get fault tolerance and status reports and basic job management, but none of that is worth the expense or a 100x performance hit.

I think there's two possible problems going on here:

1) I made a technical mistake in my experiment and my results are invalid

Can't really comment on that. It sounds like with a bit more work, it should at least execute cleanly.

2) I'm not understanding the benefits / value proposition of App Engine

You do seem to have a unrealistic idea of what App Engine is.

Its more of a hosting platform* (that can do data processing) - rather than a data processing system (that can do hosting too)

Technically a PaaS https://en.wikipedia.org/wiki/Platform_as_a_service

I'm just digging into map reduce on Google App Engine, and my early results are discouraging. I had in mind that I'd process about 10GB of data for an analysis I want to do, and I didn't even think that'd be that big a deal (given all the talk about petabyte-scale storage and such), but it's currently looking impossible.

Going back to this. It sound that AppEngine is not a ideal platform for your task.

Compute Engine, sounds a better fit

https://cloud.google.com/solutions/hadoop/

Tom Kaitchuck

unread,

Aug 11, 2015, 3:21:41 PM8/11/15

to google-a...@googlegroups.com

I think you may be more interested in Cloud Dataflow:

https://cloud.google.com/dataflow/what-is-google-cloud-dataflow

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/467aec55-e518-40b8-81b4-d62fb54a3dcb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christian F. Howes

unread,

Aug 11, 2015, 10:42:09 PM8/11/15

to Google App Engine

I've found map reduce to be painful, but i don't do tons of large-scale data processing. i do run a mobile application API on app engine and for that app engine is simply amazing! with a carefully designed schema and using the "limits" of app engine to my advantage we get great performance without the overhead of managing our own servers.

at this point i'm doing my large scale data crunching in big query (though your use case sounds like it might not be easy to fit in there).

christian

Duo Zmo

unread,

Aug 12, 2015, 3:21:31 PM8/12/15

to Google App Engine

Okay, thanks for your responses.

Looking forward to a Python SDK for the Dataflow service.

Reply all

Reply to author

Forward