Hi Supun,
I implemented the tweet ID membership determination program.
First, we generate tweetID-date pairs and tweetIDs to be deleted. We write these data to HDFS.
Second, we execute the membership finding program.
When there are N workers, we assume there are N hdfs files for tweetID-date pairs and delete tweetIDs. So, each worker reads one hdfs file for tweetID-date pairs and delete tweetIDs.
Then, we partition both data sets with hashing.
We persist both dataset after partitioning.
Then, when writing to hdfs, we only write tweetIDs that exist in both tsets.
I am not sure about this last step. How should we check the membership? Should we use join or item by item comparison? Currently, I copied all delete tweetIDs to a TreeSet and iterated over tweetID-date pairs and checked their existence in the treeset.
I tested it on a small dataset and it works. But I am not sure whether it is the correct algorithm.
Ahmet