Hey Woody,
Thanks! I'm glad you like the idea, have you had a chance to use it yet?
Distributing the computation across multiple application instances is an interesting idea, thanks for bringing it up. Basically (correct me if I misunderstand), what the idea achieves is trading CPU quota for transfer quota, making the assumption that the 'master' application will have enough transfer quota to deal with the data. I can see it as a workable optimization for some problems, specifically those with very high CPU needs, but I believe the vast majority of applications that would find HTTPMR useful don't fit that description. In my development, the quotas that typically run out first are database query quotas (an undocumented database CPU quota) that quickly runs out with the large amount of intermediate and final data writes. Distributing computation across multiple application instances wouldn't help with this quota, as the 'master' application would be doing exactly as many database accesses as in the current model.
In Hadoop's Map/Reduce implementation, I believe that the slave nodes actually do not transfer their output data back to the Map/Reduce master. The slave nodes write their data directly to the persistent data store (HDFS, HBase, etc.) and send messages to the master informing it of their status. The master issues and coordinates commands, not commands and data. I may be wrong about this, let me know if I am.
Though there are a few things that can be done to work around the current quota limitations, I'd rather leave those sorts of hacks alone, as one of AppEngine's top priorities is a system for purchasing additional quota.
- Peter