Howdy ho,

So no one answered my question from the email entitled "Calculating In-Degree using R MapReduce over Hadoop."

"Can you tell me how to calculate a graph's degree distribution? --- HINT: its this MapReduce job composed with another."

As we saw before, this is how you calculate the in-degree for every vertex in the graph. Lets do it again, but this time, over a larger random graph -- 100 vertices and 10,000 edges.

> graph <- matrix(sample.int(100,10000,replace=TRUE),ncol=2)

> graph[1:5,]

[,1] [,2]

[1,] 97 4

[2,] 87 63

[3,] 9 79

[4,] 92 62

[5,] 22 77

> x <- to.dfs(graph)

> y <- mapreduce(x, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v)))

The variable y is a reference to an HDFS file that has the in-degree of every vertex. We can then use this result to produce a degree distribution. First, what is a degree distribution? It is the histogram/density of the various degrees -- i.e. how many nodes have a degree of 1? of 2? of 3? of 4? etc.... This is the MapReduce job that does the trick.

> z <- mapreduce(y, map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k, length(v)))

MAP: Take all the vertex degrees and emit each degree with the the value 1.

REDUCE: Take all the degrees and their list of 1s and count their list -- that is the distribution for each degree.

Now lets plot this in R. First we load the distribution into memory (which is much smaller than the graph.

> dist <- from.dfs(z)

> length(dist)

[1] 27

> hist(matrix(c(as.numeric(keys(dist)),as.numeric(values(dist))),nrow=2),breaks=10,main="Random Graph Degree Distribution",xlab="in-degree")