Using EventListeners to implement MapReduce?

Tomas Olsson

unread,

Sep 23, 2011, 11:56:34 AM9/23/11

to terrastore-discussions

Hi,
I have been considering using Terrastore for some data analysis using
its MapReduce functionality. However, I would like to store
intermediate result and only update it when needed, that is, as a
document is updated. Specifically, I want to compute the euclidean
distance between all documents. Thus, the current MapReduce model is
not suitable. But, when I looked at the EventListeners I got the
following idea: why not use one event listener for the map part and
another for the reduce/aggregation part? The map listener can store
the result in a separate bucket that then are processed by the reduce
listener. Of course, that depends how the listeners are executed. If
they are independently executed on each server as I come to believe,
this might work. What do you think?

/Tomas

Sergio Bossa

unread,

Sep 23, 2011, 6:28:53 PM9/23/11

to terrastore-...@googlegroups.com

Hi Tomas!

My answers in-line.

> I have been considering using Terrastore for some data analysis using
> its MapReduce functionality. However, I would like to store
> intermediate result and only update it when needed, that is, as a
> document is updated. Specifically, I want to compute the euclidean
> distance between all documents. Thus, the current MapReduce model is
> not suitable.

You could try the following:
1) Run M/R and return: your result of interest and all the processed keys.
2) Store the result inside the same bucket under an "special" key and
move all documents whose key has been processed inside another bucket.
3) As more new documents accumulate inside the bucket, run M/R again
this time starting from the intermediate stored result.
4) Continue from #2.
Would that work for you?

> But, when I looked at the EventListeners I got the
> following idea: why not use one event listener for the map part and
> another for the reduce/aggregation part? The map listener can store
> the result in a separate bucket that then are processed by the reduce
> listener. Of course, that depends how the listeners are executed. If
> they are independently executed on each server as I come to believe,
> this might work. What do you think?

Not sure I get how you would use event listeners: btw, you could
simply use active event listeners by raising an update action every
time a new document is inserted: the update action would actually
update a document containing your interested result.

Let me know if any of the two would work for you.
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Tomas Olsson

unread,

Sep 26, 2011, 6:22:29 AM9/26/11

to terrastore-...@googlegroups.com

Yes, the second option is what I was thinking of.

One listener would react on the update/insertion of a document by only
computing/updating the part of the euclidean distance that has changed
that is stored in a separate bucket containing the distance between each
attribute/feature of two documents.

A second listener would react on changes to the separate bucket and
compute/update the total sum of all attribute distances.

Would that work? Would each node process the updates in parallel? So I
gain speed compared to computing this in a sequence? Would the overhead
of a distribution (communication?) be an obstacle?

/Tomas

Tomas Olsson

unread,

Sep 26, 2011, 6:54:09 AM9/26/11

to terrastore-...@googlegroups.com

In addition, I want to use Terrastore for storing energy reports. I
would like to only recompute when a document changes and only the part
of the document that has changed. An update action event seems to refer
to the entire document, would it be possible to somehow get a reference
to the part of the document that was updated?

/Tomas

Sergio Bossa

unread,

Sep 26, 2011, 11:17:40 AM9/26/11

to terrastore-...@googlegroups.com

On Mon, Sep 26, 2011 at 12:22 PM, Tomas Olsson <t...@sics.se> wrote:

> One listener would react on the update/insertion of a document by only
> computing/updating the part of the euclidean distance that has changed that
> is stored in a separate bucket containing the distance between each
> attribute/feature of two documents.
> A second listener would react on changes to the separate bucket and
> compute/update the total sum of all attribute distances.

Got it.

> Would that work? Would each node process the updates in parallel? So I gain
> speed compared to computing this in a sequence?

Yes: events caused by documents belonging to the same bucket will be
processed in FIFO order to preserve consistency, but events caused by
documents in different buckets, as it is your case, will be processed
independently.

> Would the overhead of a
> distribution (communication?) be an obstacle?

It shouldn't.

Let us know how it goes.

Sergio Bossa

unread,

Sep 26, 2011, 11:19:13 AM9/26/11

to terrastore-...@googlegroups.com

On Mon, Sep 26, 2011 at 12:54 PM, Tomas Olsson <t...@sics.se> wrote:

> An update action event seems to refer to the
> entire document, would it be possible to somehow get a reference to the part
> of the document that was updated?

Unfortunately it wouldn't, nor there are any plans to implement such a
feature in the near future (but contributions and patches are
obviously very welcome).

Reply all

Reply to author

Forward