tweet ID deletion

2 views
Skip to first unread message

Ahmet Uyar

unread,
Jul 23, 2020, 11:21:44 AM7/23/20
to Twister2
Hi Supun,

I implemented the tweet ID membership determination program. 

First, we generate tweetID-date pairs and tweetIDs to be deleted. We write these data to HDFS.

Second, we execute the membership finding program. 
When there are N workers, we assume there are N hdfs files for tweetID-date pairs and delete tweetIDs. So, each worker reads one hdfs file for tweetID-date pairs and delete tweetIDs. 
Then, we partition both data sets with hashing. 
We persist both dataset after partitioning. 
Then, when writing to hdfs, we only write tweetIDs that exist in both tsets. 
I am not sure about this last step. How should we check the membership? Should we use join or item by item comparison? Currently, I copied all delete tweetIDs to a TreeSet and iterated over tweetID-date pairs and checked their existence in the treeset. 

I tested it on a small dataset and it works. But I am not sure whether it is the correct algorithm.  

Ahmet

Supun Kamburugamuve

unread,
Jul 24, 2020, 2:01:26 AM7/24/20
to Ahmet Uyar, Twister2
Hi Ahmet,

I think we can assume the second input is small and keep it in a hash (in-memory). When we go through the larger list we can look up this hash.

Best,
Supun..

--
You received this message because you are subscribed to the Google Groups "Twister2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twister2+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/twister2/CAPBRfYdK4Bp9_VCdezC92tCm-%2BAOEAXj_VngGPdjWFirWj%2BzKQ%40mail.gmail.com.


--
Supun Kamburugamuve, PhD
Digital Science Center, Indiana University
Member, Apache Software Foundation; http://www.apache.org
E-mail: supun@apache.org;  Mobile: +1 812 219 2563


Reply all
Reply to author
Forward
0 new messages