Google Groups

Calculating a Graph's Degree Distribution using R MapReduce over Hadoop

Marko A. Rodriguez Feb 3, 2012 12:02 AM
Posted in group: Gremlin-users
Howdy ho,

So no one answered my question from the email entitled "Calculating In-Degree using R MapReduce over Hadoop."

        "Can you tell me how to calculate a graph's degree distribution? --- HINT: its this MapReduce job composed with another."

As we saw before, this is how you calculate the in-degree for every vertex in the graph. Lets do it again, but this time, over a larger random graph -- 100 vertices and 10,000 edges.

> graph <- matrix(,10000,replace=TRUE),ncol=2)
> graph[1:5,]
     [,1] [,2]
[1,]   97    4
[2,]   87   63
[3,]    9   79
[4,]   92   62
[5,]   22   77
> x <- to.dfs(graph)
> y <- mapreduce(x, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v)))

The variable y is a reference to an HDFS file that has the in-degree of every vertex. We can then use this result to produce a degree distribution. First, what is a degree distribution? It is the histogram/density of the various degrees -- i.e. how many nodes have a degree of 1? of 2? of 3? of 4? etc.... This is the MapReduce job that does the trick.

> z <- mapreduce(y, map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k, length(v)))

        MAP: Take all the vertex degrees and emit each degree with the the value 1.
        REDUCE: Take all the degrees and their list of 1s and count their list -- that is the distribution for each degree.

Now lets plot this in R. First we load the distribution into memory (which is much smaller than the graph.

> dist <- from.dfs(z)
> length(dist)
[1] 27
> hist(matrix(c(as.numeric(keys(dist)),as.numeric(values(dist))),nrow=2),breaks=10,main="Random Graph Degree Distribution",xlab="in-degree")