val targetThreshold = 0.8
val numHashes = 3
val numBands = MinHasher.pickBands(targetThreshold, numHashes)
implicit lazy val minHasher = new MinHasher32(numHashes, numBands)
val uids = Array(1,1,3,1,2,4,1,3,5,1,3,5,2,4,3,3,1,2,5,3)
val rddUids = sc.parallelize(uids)
val sig = rddUids.map(minHasher.init(_)).reduce(minHasher.plus)
val uniquesEst = minHasher.approxCount(sig.bytes)
If I remember correctly you only need to create a MinHashSignature for every item, which is just a tiny wrapper around a byte array. The Minhasher itself is a shared monoid
--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.