Gordon and I had a brief conversation regarding Issue #82 (Add support for linear stats). He suggested including a top-k estimator and pointed me to the following paper that describes a similar algorithm:
http://infolab.stanford.edu/~manku/papers/02vldb-freq.pdf. It describes a streaming algorithm for estimating frequency counts that exceed a user-specified threshold. It's a great paper (creative solutions to the problem) and seems like a good starting point for thinking about streaming algorithms in general. It got me thinking about how to generically support streaming algorithms in Hypertable. You could define an abstract base class called StreamingAlgorithm with the following methods:
initialize - for for initializing the algorithm (with parameters)
add - for adding updates
serialize - for serializing the state of the algorithm
deserialize - for de-serializing the state of the algorithm
combine - combines the state from another instance of this streaming algo with the current object
print - display summary to the user
A streaming algorithm could be implemented simply by deriving a class from the StreamingAlgorithm class and filling out the methods. It would be implemented under the hood as follows: A streaming algorithm could get installed on a table by having all RangeServers, that serve part of the table, create an instance of the algorithm class and adding it to a list of algorithms associated with the Table object. Whenever a batch of updates arrives for a table, they get pumped through all of the algorithms associated with the table via calls to the add() method. To query the current "summary data structure", the client library would issue a query to all of the Table's RangeServers (in parallel). The RangeServers would serialize the state and send it back to the client. The client could then deserialize all of the states and combine them together and then invoke the print() method to display the result to the user.
- Doug