tweet ID deletion

2 views

Skip to first unread message

Ahmet Uyar

unread,

Jul 23, 2020, 11:21:44 AM7/23/20

to Twister2

Hi Supun,

I implemented the tweet ID membership determination program.

It is at: https://github.com/DSC-SPIDAL/twister-perf/blob/master/src/main/java/iu/iuni/deletion/MembershipFinder3.java

First, we generate tweetID-date pairs and tweetIDs to be deleted. We write these data to HDFS.

Second, we execute the membership finding program.

When there are N workers, we assume there are N hdfs files for tweetID-date pairs and delete tweetIDs. So, each worker reads one hdfs file for tweetID-date pairs and delete tweetIDs.

Then, we partition both data sets with hashing.

We persist both dataset after partitioning.

Then, when writing to hdfs, we only write tweetIDs that exist in both tsets.

I am not sure about this last step. How should we check the membership? Should we use join or item by item comparison? Currently, I copied all delete tweetIDs to a TreeSet and iterated over tweetID-date pairs and checked their existence in the treeset.

I tested it on a small dataset and it works. But I am not sure whether it is the correct algorithm.

Ahmet

Supun Kamburugamuve

unread,

Jul 24, 2020, 2:01:26 AM7/24/20

to Ahmet Uyar, Twister2

Hi Ahmet,

I think we can assume the second input is small and keep it in a hash (in-memory). When we go through the larger list we can look up this hash.

Best,

Supun..

--
You received this message because you are subscribed to the Google Groups "Twister2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twister2+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/twister2/CAPBRfYdK4Bp9_VCdezC92tCm-%2BAOEAXj_VngGPdjWFirWj%2BzKQ%40mail.gmail.com.

Supun Kamburugamuve, PhD

Digital Science Center, Indiana University

Member, Apache Software Foundation; http://www.apache.org
E-mail: supun@apache.org; Mobile: +1 812 219 2563

Reply all

Reply to author

Forward

0 new messages