Understanding MinHasher parameters

108 views
Skip to first unread message

Alex Zhicharevich

unread,
Jul 13, 2014, 5:31:39 AM7/13/14
to alge...@googlegroups.com
Hi,

I'm using scalding and algebird for nearest neighbors computation of a large data set.
After representing each entry in my data as a set of features, I want to produce all potentially similar pairs, which afterwards I reduce with the TopK monoid.
To produce the similar pairs I'm using algebird's MinHasher32. After assigning each entry with a set of buckets, I'm doing a flatMap to produce pairs of (bucketId,entry) and then do a self join on the bucketId, to get all pairs of entries with at least one mutual bucket. I then filter to only those pairs that above an internal threshold to the next step to reduce the size of the intermediate output.
However, this step still takes me too long to run. From MinHasher's documentation I see that "The targetThreshold controls the desired level of similarity - the higher the threshold, the more efficiently you can find all the similar sets" I've increased it, but I don't want to push it too high, since I still want for each entry all (or most) of its N closest entries to end up paired with it. My understanding of the problem is that some buckets were too large, containing too many pairs, such that computing all distances within this bucket is too computationally expensive. I assume that if each entry will be assigned with more buckets, it is possible to produce smaller buckets, while keeping the same probability of two entries to end up in at least one bucket (in other words - make it a bit more parallel).
There's an additional parameter on MinHasher (maxBytes), that seems to affect the number of bands and rows. It says "The more bytes in the signature, the more accurate all of the above will be" but I'm not sure I understand how this parameter affects the algorithm.

Is there some combination of the two parameters of MinHasher to help speed it up?

Thanks,
Alex

Avi Bryant

unread,
Jul 13, 2014, 12:31:56 PM7/13/14
to Alex Zhicharevich, alge...@googlegroups.com
Your understanding seems good - if you do end up with a large bucket, that could cause reducer skew and a very long run time. Increasing the bytes used for the signatures will increase the number of bands, but more importantly, for the same threshold will increase the size of each band (how many rows of hashes it includes), which will make it less likely that you get a large number of sets in the same bucket. On the other hand, it will also increase the total amount of data and computation in the system, so it may not have the performance impact you're hoping for.

Can you tell me what parameters you are using? Unless you're using a very low threshold or very small signatures, it should already be pretty unlikely that you get a large single bucket unless you have a large number of very similar sets (in which case changing the parameters won't help). This is especially likely in practice when you have a large number of empty sets (which are all identical), or a large number of sets with the same, single item (again, all identical). Those are all going to fall into the same bucket no matter what you do, and a better strategy is to filter them out right at the beginning if you can. (Yes, it would be nice to make the algorithm more robust to this in general, and I'd be curious to hear if anyone has any suggestions for that).

Avi


--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Zhicharevich

unread,
Jul 15, 2014, 9:09:31 AM7/15/14
to alge...@googlegroups.com
Thanks for your help Avi,

I've computed some stats on the problem:
The number of sets is ~20M. There are no "very common" sets, from what I've seen, the most common set appears ~300 times. Another thing to note is that the sets encode text features, which make them very sparse.
My original parameters were a threshold of 0.2 and numBytes = 1024. This resulted one bucket with 1.1M items and almost 100 buckets with over 100K items. I would like for all buckets to be under 30K items, so that the corresponding reducer will finish in a reasonable time.
A different setting was a threshold of 0.5 and maxBytes=12000, which resulted in a max bucket size of ~12000. It looks reasonable, but I think this produces a lot of buckets per item. Are those reasonable settings?

Thanks,
Alex
Reply all
Reply to author
Forward
0 new messages