memory efficiency of MinHasher (when used in Spark for example)

Prasad Chalasani

unread,

Oct 6, 2016, 10:12:02 AM10/6/16

to algebird

Say we want to write a Spark transformation that counts the number of unique ints in an rdd:

val targetThreshold = 0.8
val numHashes = 3
val numBands = MinHasher.pickBands(targetThreshold, numHashes)
implicit lazy val minHasher = new MinHasher32(numHashes, numBands)

val uids = Array(1,1,3,1,2,4,1,3,5,1,3,5,2,4,3,3,1,2,5,3)
val rddUids = sc.parallelize(uids)
val sig = rddUids.map(minHasher.init(_)).reduce(minHasher.plus)
val uniquesEst = minHasher.approxCount(sig.bytes)

The monoid structure of MinHasher requires creating a new MinHasher data-structure (and hence allocate memory) for every new item encountered. Since there is no "update" method that takes a MinHashSignature and updates it with a single item, we cannot do a combineByKey or mapPartitions to avoid this and instead create the data-structure only once per partition.

Am I rightly worried about this, and are there any work-arounds? I've seen some blogs talking about this issue, but I'm curious if I'm missing something

Koert Kuipers

unread,

Oct 6, 2016, 11:30:29 AM10/6/16

to Prasad Chalasani, algebird

If I remember correctly you only need to create a MinHashSignature for every item, which is just a tiny wrapper around a byte array. The Minhasher itself is a shared monoid

--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

P. Oscar Boykin

unread,

Oct 6, 2016, 3:17:21 PM10/6/16

to Koert Kuipers, Prasad Chalasani, algebird

You might want to look at `algebird-spark` which makes some of this easier.

https://github.com/twitter/algebird/blob/develop/algebird-spark/src/main/scala/com/twitter/algebird/spark/AlgebirdRDD.scala

For the case of aggregating an entire RDD, you can use `.map(minHasher.init(_)).algebird.sum`. This will call the method: sumOption on the underlying Semigroup, which can be more efficient (it does not need to make any intermediate values).

That said, MinHasher does not implement `sumOption`.

As Koert said, the cost you are paying is making a (short-lived) copy of the internal byte array.

All-in-all, I would recommend you just run the code, and if you find it is slower than you need, the code can be optimized. We have been happy to accept contributions in the past without a high performance bar. This is one that (looking at the code) would be relatively easy to optimize. Yet, in distributed systems like spark and hadoop, serialization is often the biggest cost, so such optimizations have not always been top-of-mind.

To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "algebird" group.

To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.

Reply all

Reply to author

Forward