Spark Streaming meets Algebird

MLnick

unread,

Feb 12, 2013, 8:15:31 AM2/12/13

to spark...@googlegroups.com

Hi

Thought some of you may be interested in this - I took the streaming Twitter example and amended it to include Twitter's Algebird library. Specifically the HyperLogLog as a monoid for approximate distinct object counting (userids in this case).

https://gist.github.com/MLnick/4761966

The example shows how easy it is to integrate Algebird's complex datastructures since they have nice neat apply and + methods for integration into Spark's map and reduce functions. Boom! approximate distincts for hundreds of millions of unique values with minimal memory footprint!

I also worked up a "heavy hitters" approximation with CountMinSketch but am struggling to get it to give results as the heavyHitters always seems empty. Still looking into it.

Nick

Tathagata Das

unread,

Feb 12, 2013, 3:39:10 PM2/12/13

to spark...@googlegroups.com

Hi Nick,

This is great! The code is indeed short and concise for doing what its doing. Have you tested this on a cluster?

Let me know if you need help with the CountMinSketch example.

TD

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Nick Pentreath

unread,

Feb 13, 2013, 10:11:38 AM2/13/13

to spark...@googlegroups.com

Thanks TD

I realised that the reason I was struggling with the CMS was my own error, since I set the threshold percentage quite high, so no userid met the threshold for a "heavy hitter". I adjusted it and it works fine.

I've put up a gist of a rough example: https://gist.github.com/MLnick/4945185

Actually for this small time window userids are not that interesting a usecase since the data is not very skewed at all (it would become more so over larger timescales), so I will perhaps amend to do say a hashtag count for using CMS for global trending topics etc.

Tathagata Das

unread,

Feb 13, 2013, 1:08:39 PM2/13/13

to spark...@googlegroups.com

Are you running using the default sample stream from Twitter. If so, then yes, it does not give enough data to do something useful in small windows. Note that doing a window operation over a very large window may be expensive unless the invertible reduceByKeyAndwindow is used (the one which takes a reduce and another inverseReduce function). So it would be nice if you can figure out a way to do this heavy hitter computation in an incremental fashion using the invertible reduceByKeyAndWindow

Reply all

Reply to author

Forward