Hello,
I have a question regarding how to sort the input data for the sweep operation. My understanding is that we need to scan through the adjacency list data in descending order of Personalized PageRank (PPR) that we compute using our PPR program for a particular seed vertex.
Necessarily, this operation would require us to resort the adjacency list, which could be very time consuming and would require a large amount of disk space (unless we'd want to overwrite the original data). Of course though, once re-sorted, we'd only have to pass through the data once to do the sweep.
Question: Can we skip all of this disk re-shuffling all together?
If we assume that the number of non-zero PPR vertices is small, then we could keep a set of these non-zero PPR vertices in memory. We could then scan through the adjacency list and collect the full vertex-neighbor information for each vertex in the non-zero PPR set. Then, we could re-order this significantly smaller subset of the data in main memory and then perform the sweep operation.
Is this acceptable?
- Malcolm