Why not let httpmr work like Hadoop

3 views
Skip to first unread message

Woody

unread,
Aug 15, 2008, 4:39:34 AM8/15/08
to httpmr-discuss
Hi,

First I want to tell you that your idea, I mean the http map-reduce,
is so cool. GAE is only a web server and has so many restrictions. I
don't think it's an instance of cloud computing until I find your
httpmr. ( I am doing a research on Cloud Computing now)

But your httpmr can only run on one GAE application. Since every
account has at least 10 applications, why not use them all?

I do think the idea of Nikhil S' Perl http mapreduce is great. My idea
is to make every GAE application as a slave node and let a master
(e.g. your driver.py) to assign tasks to them to calculate. Slaves get
the map/reduce function and input key-value pairs from master , map/
reduce them and return output key-value pairs back to master. Just
like Hadoop slave nodes does.

Of course, this way has some limitation too:
1. Data sent/received Quota.
2. Master has to store the intermediate data locally.
3. May be others ...

Thanks.
Woody

Peter Dolan

unread,
Aug 15, 2008, 11:45:52 AM8/15/08
to httpmr-...@googlegroups.com
Hey Woody,

Thanks!  I'm glad you like the idea, have you had a chance to use it yet?

Distributing the computation across multiple application instances is an interesting idea, thanks for bringing it up.  Basically (correct me if I misunderstand), what the idea achieves is trading CPU quota for transfer quota, making the assumption that the 'master' application will have enough transfer quota to deal with the data.  I can see it as a workable optimization for some problems, specifically those with very high CPU needs, but I believe the vast majority of applications that would find HTTPMR useful don't fit that description.  In my development, the quotas that typically run out first are database query quotas (an undocumented database CPU quota) that quickly runs out with the large amount of intermediate and final data writes.  Distributing computation across multiple application instances wouldn't help with this quota, as the 'master' application would be doing exactly as many database accesses as in the current model.

In Hadoop's Map/Reduce implementation, I believe that the slave nodes actually do not transfer their output data back to the Map/Reduce master.  The slave nodes write their data directly to the persistent data store (HDFS, HBase, etc.) and send messages to the master informing it of their status.  The master issues and coordinates commands, not commands and data.  I may be wrong about this, let me know if I am.

Though there are a few things that can be done to work around the current quota limitations, I'd rather leave those sorts of hacks alone, as one of AppEngine's top priorities is a system for purchasing additional quota.

- Peter

Woody

unread,
Aug 17, 2008, 9:44:30 PM8/17/08
to httpmr-discuss
Hi Peter,

Yes I totally agree with you that my idea is not suitable for all the
applications.

One comment: The 'master' in the idea I mentioned last time should be
running in user's machine. That's how to trade CPU quota for transfer
quota. Your project makes me to remember the "power server model" in
Grid computing, which works just like what I mentioned in last
mail. :-)

Thanks,
- Woody
Reply all
Reply to author
Forward
0 new messages