Hi
Thought some of you may be interested in this - I took the streaming Twitter example and amended it to include Twitter's Algebird library. Specifically the HyperLogLog as a monoid for approximate distinct object counting (userids in this case).
The example shows how easy it is to integrate Algebird's complex datastructures since they have nice neat apply and + methods for integration into Spark's map and reduce functions. Boom! approximate distincts for hundreds of millions of unique values with minimal memory footprint!
I also worked up a "heavy hitters" approximation with CountMinSketch but am struggling to get it to give results as the heavyHitters always seems empty. Still looking into it.
Nick