Streaming algorithm plugins

Doug Judd

unread,

Mar 26, 2008, 6:00:25 PM3/26/08

to hyperta...@googlegroups.com

Gordon and I had a brief conversation regarding Issue #82 (Add support for linear stats). He suggested including a top-k estimator and pointed me to the following paper that describes a similar algorithm: http://infolab.stanford.edu/~manku/papers/02vldb-freq.pdf. It describes a streaming algorithm for estimating frequency counts that exceed a user-specified threshold. It's a great paper (creative solutions to the problem) and seems like a good starting point for thinking about streaming algorithms in general. It got me thinking about how to generically support streaming algorithms in Hypertable. You could define an abstract base class called StreamingAlgorithm with the following methods:

initialize - for for initializing the algorithm (with parameters)
add - for adding updates
serialize - for serializing the state of the algorithm
deserialize - for de-serializing the state of the algorithm
combine - combines the state from another instance of this streaming algo with the current object
print - display summary to the user

A streaming algorithm could be implemented simply by deriving a class from the StreamingAlgorithm class and filling out the methods. It would be implemented under the hood as follows: A streaming algorithm could get installed on a table by having all RangeServers, that serve part of the table, create an instance of the algorithm class and adding it to a list of algorithms associated with the Table object. Whenever a batch of updates arrives for a table, they get pumped through all of the algorithms associated with the table via calls to the add() method. To query the current "summary data structure", the client library would issue a query to all of the Table's RangeServers (in parallel). The RangeServers would serialize the state and send it back to the client. The client could then deserialize all of the states and combine them together and then invoke the print() method to display the result to the user.

- Doug

Luke

unread,

Mar 26, 2008, 6:44:07 PM3/26/08

to Hypertable Development

Here is a paper on distributed top-k monitoring that also accounts for
the cost of network traffic, which cites the Manku paper, which
focused on memory footprint only.

http://infolab.stanford.edu/~olston/publications/topk.pdf

On Mar 26, 3:00 pm, "Doug Judd" <d...@zvents.com> wrote:
> Gordon and I had a brief conversation regarding Issue #82 (Add support for
> linear stats). He suggested including a top-k estimator and pointed me to

> the following paper that describes a similar algorithm:http://infolab.stanford.edu/~manku/papers/02vldb-freq.pdf<http://infolab.stanford.edu/%7Emanku/papers/02vldb-freq.pdf>.

Doug Judd

unread,

Mar 26, 2008, 7:12:29 PM3/26/08

to hyperta...@googlegroups.com

Looks interesting. I'm not sure about top-k, but any algorithm that is both associative and commutative could be implemented via my proposal, where the only network communication happens at query time, when all of the summary states are fetched and combined together. My guess is that you could design a top-k estimator algorithm that is both associative and commutative while remaining below some error threshold.

Reply all

Reply to author

Forward