On Jun 26, 10:18 pm, "Brett Morgan" <
brett.mor...@gmail.com> wrote:
>
> So, the question for the group is, how to design a background processing api
> such that it is distributable, and works in with DataStore's transactions?
> What do we need to be able to do what we want to do. In fact, what is it
> that you guys want to achieve with background processing?
Well, generally speaking, what I'd like to do is chew through rather
large amounts of collected data (say, user ratings or user bookmarks),
and extract recommendations ('we think you'll like these items') of
various sorts. An extreme example that comes to mind (in terms of the
size of the corpus and the CPU-intensiveness of the iterated
algorithm) is pagerank, but even much smaller problems like movie
recommendations are quite large.
There are any number of data structures and algorithms that fit into
this general pattern, but from my perspective, what they have in
common is:
1) operating across multiple users' data
1a) a subset of these problems can be sharded into more manageable
groups of users, ie. recommendations based on my friends' (or even
friends-of-friends') data only.
1b) a different subset cannot, ie. finding a set of users whose
recommendations are similar to mine.
2) periodic (episodic?) offline processing. ie. no need to recalculate
every time a user adds data, nor should the recalculation prevent
adding new data. This potentially implies operating on a snapshot.
3) there are native-code libraries (such as NumPy or LIBSVM) that
would be *very* helpful, but not strictly necessary.
There is a completely different set of problems that involve only
infrequent (even one-time) CPU-intensive transformation of a single
user's data that you would want to fire-and-forget without being bound
to an HTTP request's limits. However, most (if not quite all) of these
more-or-less *require* native-code libraries that aren't usable in GAE
right now (even for non-CPU-intensive purposes that *could* be so
bound) anyway.
- Michael