My answers in-line.
> I have been considering using Terrastore for some data analysis using
> its MapReduce functionality. However, I would like to store
> intermediate result and only update it when needed, that is, as a
> document is updated. Specifically, I want to compute the euclidean
> distance between all documents. Thus, the current MapReduce model is
> not suitable.
You could try the following:
1) Run M/R and return: your result of interest and all the processed keys.
2) Store the result inside the same bucket under an "special" key and
move all documents whose key has been processed inside another bucket.
3) As more new documents accumulate inside the bucket, run M/R again
this time starting from the intermediate stored result.
4) Continue from #2.
Would that work for you?
> But, when I looked at the EventListeners I got the
> following idea: why not use one event listener for the map part and
> another for the reduce/aggregation part? The map listener can store
> the result in a separate bucket that then are processed by the reduce
> listener. Of course, that depends how the listeners are executed. If
> they are independently executed on each server as I come to believe,
> this might work. What do you think?
Not sure I get how you would use event listeners: btw, you could
simply use active event listeners by raising an update action every
time a new document is inserted: the update action would actually
update a document containing your interested result.
Let me know if any of the two would work for you.
Cheers,
Sergio B.
--
Sergio Bossa
http://www.linkedin.com/in/sergiob
One listener would react on the update/insertion of a document by only
computing/updating the part of the euclidean distance that has changed
that is stored in a separate bucket containing the distance between each
attribute/feature of two documents.
A second listener would react on changes to the separate bucket and
compute/update the total sum of all attribute distances.
Would that work? Would each node process the updates in parallel? So I
gain speed compared to computing this in a sequence? Would the overhead
of a distribution (communication?) be an obstacle?
/Tomas
/Tomas
> One listener would react on the update/insertion of a document by only
> computing/updating the part of the euclidean distance that has changed that
> is stored in a separate bucket containing the distance between each
> attribute/feature of two documents.
> A second listener would react on changes to the separate bucket and
> compute/update the total sum of all attribute distances.
Got it.
> Would that work? Would each node process the updates in parallel? So I gain
> speed compared to computing this in a sequence?
Yes: events caused by documents belonging to the same bucket will be
processed in FIFO order to preserve consistency, but events caused by
documents in different buckets, as it is your case, will be processed
independently.
> Would the overhead of a
> distribution (communication?) be an obstacle?
It shouldn't.
Let us know how it goes.
> An update action event seems to refer to the
> entire document, would it be possible to somehow get a reference to the part
> of the document that was updated?
Unfortunately it wouldn't, nor there are any plans to implement such a
feature in the near future (but contributions and patches are
obviously very welcome).