Using HTTPMR to solve numerical problems

3 views
Skip to first unread message

Isaac Llamas

unread,
May 18, 2010, 6:41:14 AM5/18/10
to httpmr-discuss
Hi,

I am investigating how to solve numerical problems using App Engine
and I came across your implementation of Map Reduce which could ease
my pain...

My idea is to have 2 applications, one would be the master and other
would be the slave. Since App Engine is in charge of the scalability,
if I do some Map on the master it would be distributed to several
slaves (since one wouldn't be able to process everything) where some
operation would take place, then several results should be given back
to the master (which may be scaled automatically...) so the Reduce
function would be performed.

Does this idea seem reasonable to you ? Could HTTPMR be applied in any
way to help me with this issue ?

I'm afraid I have never touched Python so the coding seems a bit
strange to me at the moment, but if this solves my problem I might
want to implement it in Java.


Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "httpmr-discuss" group.
To post to this group, send email to httpmr-...@googlegroups.com.
To unsubscribe from this group, send email to httpmr-discus...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/httpmr-discuss?hl=en.

Peter Dolan

unread,
May 18, 2010, 11:25:07 AM5/18/10
to httpmr-...@googlegroups.com
Hi Isaac,

Yeah, you basically just described map-reduce.  Take a look at http://en.wikipedia.org/wiki/MapReduce for a broad overview of how map-reduce works, and read some tutorials online, you should get a better idea of how people usually structure their computations.

- Peter

Isaac Llamas

unread,
May 18, 2010, 12:04:10 PM5/18/10
to httpmr-...@googlegroups.com
Peter,

I don't really know how the distribution of the data mapped (emisions) would be done since it uses the HTTP protocol and the use of threads is forbidden on GAE.

Correct me if I am wrong, but the way I see HTTPMR works is:

A uses map
A emits the data using HTTP to communicate with Bs
Each B receives from A
Each B does some calculation
Each B uses HTTP to communicate with A
A receives from Bs
A uses reduce

I don't really get the idea of emisions, since A will send data to one B, wouldn't it wait for a response from this B, making it impossible to manage the other requests to the rest of Bs ?


2010/5/18 Peter Dolan <peter...@gmail.com>

Peter Dolan

unread,
May 18, 2010, 12:33:08 PM5/18/10
to httpmr-...@googlegroups.com
Close, but not quite.  All is done within one application.

HTTPMR uses map, iterating over map inputs within HTTP requests, writes its map output (key -> value pair) to persistent storage
  (wait for all input to be mapped)
Persistent storage is responsible for sorting the output by key, so that all map output values for a key are contiguous.
HTTPMR uses reduce, iterating over sorted map outputs by key, computing reduce outputs, writes them to persistent storage (final job output)
HTTPMR iterates over the intermediate map output, deleting it from persistent storage.

Isaac Llamas

unread,
May 18, 2010, 1:07:11 PM5/18/10
to httpmr-...@googlegroups.com
I still don't know at which step the scalability comes into place. I belive you have to do several requests to the application for it to scale, but I just can't see where or when.

Thanks for your replies, they are helping a lot.

2010/5/18 Peter Dolan <peter...@gmail.com>

Peter Dolan

unread,
May 18, 2010, 7:05:10 PM5/18/10
to httpmr-...@googlegroups.com
Mapping is done in parallel, and Reducing is done in parallel.  Take a look at the wikipedia page, it should clarify the concept for you.  This is just an implementation of that, using CPU time on an HTTP server to do the mapping processing and reducing processing, and handling parallelizing the requests that trigger those operations.

Isaac Llamas

unread,
May 24, 2010, 8:15:09 AM5/24/10
to httpmr-discuss
I was thinking about this for a bit, so...

Map would take the input and write into the persistent storage
something like:

A 1
B 1
A 1
...

Then it is sorted like:

A 1
A 1
...
B 1

Then the input for the reduce is a key and a list of values for the
same key, as in:

A 1 1 ...
B 1 ...

Right ?

On May 19, 12:05 am, Peter Dolan <peterjdo...@gmail.com> wrote:
> Mapping is done in parallel, and Reducing is done in parallel.  Take a look
> at the wikipedia page, it should clarify the concept for you.  This is just
> an implementation of that, using CPU time on an HTTP server to do the
> mapping processing and reducing processing, and handling parallelizing the
> requests that trigger those operations.
>
>
>
> On Tue, May 18, 2010 at 10:07 AM, Isaac Llamas <kpu...@gmail.com> wrote:
> > I still don't know at which step the scalability comes into place. I belive
> > you have to do several requests to the application for it to scale, but I
> > just can't see where or when.
>
> > Thanks for your replies, they are helping a lot.
>
> > 2010/5/18 Peter Dolan <peterjdo...@gmail.com>
> >>> 2010/5/18 Peter Dolan <peterjdo...@gmail.com>
>
> >>>>  Hi Isaac,
>
> >>>> Yeah, you basically just described map-reduce.  Take a look at
> >>>>http://en.wikipedia.org/wiki/MapReducefor a broad overview of how
> >>>>> httpmr-discus...@googlegroups.com<httpmr-discuss%2Bunsu...@googlegroups.com>
> >>>>> .
> >>>>> For more options, visit this group at
> >>>>>http://groups.google.com/group/httpmr-discuss?hl=en.
>
> >>>>  --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "httpmr-discuss" group.
> >>>> To post to this group, send email to httpmr-...@googlegroups.com.
> >>>> To unsubscribe from this group, send email to
> >>>> httpmr-discus...@googlegroups.com<httpmr-discuss%2Bunsu...@googlegroups.com>
> >>>> .
> >>>> For more options, visit this group at
> >>>>http://groups.google.com/group/httpmr-discuss?hl=en.
>
> >>>  --
> >>> You received this message because you are subscribed to the Google Groups
> >>> "httpmr-discuss" group.
> >>> To post to this group, send email to httpmr-...@googlegroups.com.
> >>> To unsubscribe from this group, send email to
> >>> httpmr-discus...@googlegroups.com<httpmr-discuss%2Bunsu...@googlegroups.com>
> >>> .
> >>> For more options, visit this group at
> >>>http://groups.google.com/group/httpmr-discuss?hl=en.
>
> >>  --
> >> You received this message because you are subscribed to the Google Groups
> >> "httpmr-discuss" group.
> >> To post to this group, send email to httpmr-...@googlegroups.com.
> >> To unsubscribe from this group, send email to
> >> httpmr-discus...@googlegroups.com<httpmr-discuss%2Bunsu...@googlegroups.com>
> >> .
> >> For more options, visit this group at
> >>http://groups.google.com/group/httpmr-discuss?hl=en.
>
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "httpmr-discuss" group.
> > To post to this group, send email to httpmr-...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > httpmr-discus...@googlegroups.com<httpmr-discuss%2Bunsu...@googlegroups.com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/httpmr-discuss?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "httpmr-discuss" group.
> To post to this group, send email to httpmr-...@googlegroups.com.
> To unsubscribe from this group, send email to httpmr-discus...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/httpmr-discuss?hl=en.

Peter Dolan

unread,
May 24, 2010, 11:10:55 AM5/24/10
to httpmr-...@googlegroups.com
Yep, exactly.

Also, Map is done in parallel, and Reduce is done in parallel (parallelized by keys).

Isaac Llamas

unread,
May 24, 2010, 1:24:23 PM5/24/10
to httpmr-...@googlegroups.com
Does this persistence storage use Google's storage by any means ?

Wouldn't it be better if instead of storing the key/value pair, every time it tries to store them, check if a key is present (using a query) and then adding its value to its list of values ? This way you shouldn't sort it and then arrange the values by key.


2010/5/24 Peter Dolan <peter...@gmail.com>

Peter Dolan

unread,
May 24, 2010, 1:41:45 PM5/24/10
to httpmr-...@googlegroups.com
Yes, the persistence layer uses AppEngine storage.  However, the system is structured such that you can insert any other kind of persistence layer by implementing a simple interface.  I encourage you to take a look at the code to see how this is done.

Writing a single new record to AppEngine's datastore is orders of magnitude faster than doing a transactional query-read-write operation, and avoids the transactions that would reduce the system's parallelism.

Isaac Llamas

unread,
May 25, 2010, 7:19:08 AM5/25/10
to httpmr-...@googlegroups.com
Well, I have been looking at some code now, specially the driver.py code, and I just read that it is a "Simple multithreaded HTTP request driver" while I can also read here that App Engine cannot "spawn a sub-process or thread". Would you mind explaining that contradiction for me please ?

2010/5/24 Peter Dolan <peter...@gmail.com>

Peter Dolan

unread,
May 25, 2010, 12:31:33 PM5/25/10
to httpmr-...@googlegroups.com
Hi Isaac,

Driver.py is meant to be run on a machine that you control, not from AppEngine.

- Peter

Isaac Llamas

unread,
May 25, 2010, 1:18:33 PM5/25/10
to httpmr-...@googlegroups.com
Then that was the part I was trying to figure out from the very beginning ! I didn't know how you were making all those parallel calls (with different threads now) to HTTPMR, I got confused when you said it was "done within one application". I totally misunderstood it, now everything is clear.

Have you tried using AppEngine's task queue instead of running a local application to run different threads ? This is the solution I came up with, but it won't allow any user defined map or reduce function like yours, besides, my idea wasn't about allowing them to do it anyways.

Thank you very much.

2010/5/25 Peter Dolan <peter...@gmail.com>

Peter Dolan

unread,
May 25, 2010, 1:22:44 PM5/25/10
to httpmr-...@googlegroups.com
Hey Isaac,

Yeah, task queues would be a good way to go about implementing this, but I wrote the whole damn thing about a year before task queues existed, and I don't have time to update it to work with them.  There are a bunch of people out there that have done similar things / talked about how to do safely do this with task queues.

- Peter
Reply all
Reply to author
Forward
0 new messages