Q2 (set intersections) in Hw2

Yonatan

unread,

Dec 27, 2012, 10:05:59 AM12/27/12

to algorithms-in-...@googlegroups.com

hi,

I am having trouble with q2 in the home assignment. if we are not allowed to completely iterate through all the groups, it means we must sample elements from them. calculating the number of distinct elements in the our sample can be done efficiently via frequent items algorithm, but i don't see a strong correlation between result in the sample and the number of distinct elements in the origin.

edo

unread,

Dec 27, 2012, 2:20:46 PM12/27/12

to algorithms-in-...@googlegroups.com

Try thinking about frequency estimation, specifically f_0

Jonathan Avron

unread,

Dec 27, 2012, 2:27:31 PM12/27/12

to algorithms-in-...@googlegroups.com

I agree that f0 returns the number of different elements. But only if I stream all elements through the algorithm. This will require O(sumS) and not o(sumS).

Edo Liberty

unread,

Dec 28, 2012, 3:08:32 AM12/28/12

to algorithms-in-...@googlegroups.com

I agree but you should try to changing the algorithm a little bit.

You are very close to the solution.

Ron Bigman

unread,

Dec 30, 2012, 12:47:57 PM12/30/12

to algorithms-in-...@googlegroups.com

I still don't understand how can we reduce the time complexity from O(sum) to o(sum).

A tip would be very helpful at this point.

Thanks

בתאריך יום שישי, 28 בדצמבר 2012 10:08:32 UTC+2, מאת edo:

Yonatan

unread,

Dec 30, 2012, 2:07:02 PM12/30/12

to algorithms-in-...@googlegroups.com

The way I did it:
1) recall in the algorithm for frequent items when finding f0 we used a fixed number of hash functions and used the average of the minimums they got for all the items. So:
1) in preprocessing I kept the minimum recieved for each hash function for each set
2) in estimate union size I just took the minimum for each hash in all of the sets.

In my understanding this works..
-Yonatan.

Ron Bigman

unread,

Dec 30, 2012, 2:59:05 PM12/30/12

to algorithms-in-...@googlegroups.com

Thanks

Ron
בתאריך יום ראשון, 30 בדצמבר 2012 21:07:02 UTC+2, מאת Yonatan:

Ron Bigman

unread,

Dec 30, 2012, 3:17:31 PM12/30/12

to algorithms-in-...@googlegroups.com

Thank you for your reply

I have another question.

If Si are sets, than they have non repeating elements, what is the point of calculating freqeuncy item for each set seperatly if it is clear that the expected result should be the size of the set?

Also, please explain the following example.

Suppose I comprixsed of 2 sets

one very large (Let's say 1000000 elements) and one very small (lets say one element).

and lets assume s=100.

The minimum for each hash will be taken from the small set, as the hash functions are expected to return minimal values (Expectance is around 1)

So it seems

like the large set will have no effect whatsoever (In fact, the larger it gets, the less chances it has to have an affect)

I must be missing something...

Thanks

Ron

בתאריך יום ראשון, 30 בדצמבר 2012 21:59:05 UTC+2, מאת Ron Bigman:

Jonathan Avron

unread,

Dec 30, 2012, 3:25:40 PM12/30/12

to algorithms-in-...@googlegroups.com

I don't agree that the minimum will come from the smaller set. For each hash function the minimum comes at random from one item so I would suppose that most hash functions will actually get the minimum from the larger set.

Remember that in frequent items we passed the entire stream through all the hash functions, and that gave a good approximation. Here in fact we pass each part of the "stream" through the hash functions during the preprocessing.

Reply all

Reply to author

Forward