Calculating In-Degree using R MapReduce over Hadoop

Marko Rodriguez

unread,

Feb 2, 2012, 3:18:39 AM2/2/12

to gremli...@googlegroups.com

Hi,

I'm playing with RHadoop which is an R package that lets you do MapReduce jobs over Hadoop.

https://github.com/RevolutionAnalytics/RHadoop

I have a pseudo Hadoop cluster running on my local machine. I was playing around with it and all of a sudden thought: "Ah! Its easy to calculate vertex in-degree!" Want to see how it is done?

Lets create a random graph with 10 vertices and 100 edges and represent it as as a two-column table (i.e. an edge list).

> graph <- matrix(sample.int(10,100,replace=TRUE),50)

> graph[1:5,]

[,1] [,2]

[1,] 6 2

[2,] 2 2

[3,] 4 10

[4,] 9 10

[5,] 5 7

The first row says that vertex 6 is connected to vertex 2.

Now lets dump that data into HDFS:

> x <- to.dfs(graph)

This creates the variable x which is a reference to the file in HDFS. Lets look at one key/value pair in that file.

> from.dfs(x)[1]

[[1]]

[[1]]$key

[1] 6

[[1]]$val

[1] 6 2

attr(,"rmr.keyval")

[1] TRUE

We can ignore the key -- what matters is the value is the edge data that says that vertex 6 is connected to vertex 2.

Now lets calculate the in-degree of all vertices in the graph.

> y <- mapreduce(x, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v)))

12/02/02 01:02:51 INFO mapred.FileInputFormat: Total input paths to process : 1

12/02/02 01:02:51 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-marko/mapred/local]

12/02/02 01:02:51 INFO streaming.StreamJob: Running job: job_201202012324_0008

12/02/02 01:02:51 INFO streaming.StreamJob: To kill this job, run:

12/02/02 01:02:51 INFO streaming.StreamJob: /Applications/hadoop/hadoop-0.20.203.0/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201202012324_0008

12/02/02 01:02:51 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201202012324_0008

12/02/02 01:02:52 INFO streaming.StreamJob: map 0% reduce 0%

12/02/02 01:03:06 INFO streaming.StreamJob: map 100% reduce 0%

12/02/02 01:03:18 INFO streaming.StreamJob: map 100% reduce 100%

12/02/02 01:03:24 INFO streaming.StreamJob: Job complete: job_201202012324_0008

12/02/02 01:03:24 INFO streaming.StreamJob: Output: /var/folders/W7/W7Tviv8iFv8r508qXHlEs++++TM/-Tmp-//Rtmpan4VFX/file3e1ae6f

The map job says, emit key/value pairs where the key is the head vertex of the edge (e.g. 6) and the value is just the number 1.

The reduce job says, emit key/value pairs where the key is the key (e.g. 6) and the value is the number of 1s in the value (i.e. the in-degree)

Now we can check our results which was written to the file reference in HDFS reference by y.

> from.dfs(y)[3]

[[1]]

[[1]]$key

[1] 6

[[1]]$val

[1] 2

The result says that 6 has an in-degree of 2. Lets check using standard R constructs.

> table(graph[,2])

1 2 3 4 5 6 7 8 9 10

3 6 5 4 6 2 9 2 4 9

Yep. 6 is seen 2 times in the second column.

Tada! In-Degree using R and Hadoop. Moreover, its just a one-liner in R:

mapreduce(x, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v)))

Can you tell me how to calculate a graph's degree distribution? --- HINT: its this MapReduce job composed with another.

Hope that was enlightening,

Marko.

http://markorodriguez.com

Antonio Piccolboni

unread,

Feb 3, 2012, 2:40:52 AM2/3/12

to gremli...@googlegroups.com

Hi, I am the developer of the rmr package used here and besides being pleased with seeing it put to good use, since there is an interest in doing graph algorithms with it, let me point to you an implementation of connected components on my blog

http://blog.piccolboni.info/2011/09/connected-components-example-rewritten.html

Please do not hesitate with comments or questions.

Marko Rodriguez

unread,

Feb 3, 2012, 3:02:36 AM2/3/12

to gremli...@googlegroups.com

Howdy ho,

So no one answered my question from the email entitled "Calculating In-Degree using R MapReduce over Hadoop."

"Can you tell me how to calculate a graph's degree distribution? --- HINT: its this MapReduce job composed with another."

As we saw before, this is how you calculate the in-degree for every vertex in the graph. Lets do it again, but this time, over a larger random graph -- 100 vertices and 10,000 edges.

> graph <- matrix(sample.int(100,10000,replace=TRUE),ncol=2)

> graph[1:5,]
[,1] [,2]

[1,] 97 4
[2,] 87 63
[3,] 9 79
[4,] 92 62
[5,] 22 77
> x <- to.dfs(graph)

> y <- mapreduce(x, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v)))

The variable y is a reference to an HDFS file that has the in-degree of every vertex. We can then use this result to produce a degree distribution. First, what is a degree distribution? It is the histogram/density of the various degrees -- i.e. how many nodes have a degree of 1? of 2? of 3? of 4? etc.... This is the MapReduce job that does the trick.

> z <- mapreduce(y, map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k, length(v)))

MAP: Take all the vertex degrees and emit each degree with the the value 1.
REDUCE: Take all the degrees and their list of 1s and count their list -- that is the distribution for each degree.

Now lets plot this in R. First we load the distribution into memory (which is much smaller than the graph.

> dist <- from.dfs(z)
> length(dist)
[1] 27
> hist(matrix(c(as.numeric(keys(dist)),as.numeric(values(dist))),nrow=2),breaks=10,main="Random Graph Degree Distribution",xlab="in-degree")

Rplot.pdf

Peter Neubauer

unread,

Feb 3, 2012, 4:15:52 AM2/3/12

to gremli...@googlegroups.com

Marko,
Love this. It demands a screencast to replicate!

Send from a device with crappy keyboard and autocorrection.

/peter

Here is the algorithm in composite starting from an edge list and yielding an in-degree distribution:

mapreduce(
mapreduce(to.dfs(graph),
map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v))) ,
map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k, length(v)))))

Boo yea!,
Marko.

http://markorodriguez.com

Jack

unread,

Feb 3, 2012, 5:03:01 AM2/3/12

to Gremlin-users

Very interesting, but isn't there a way to accomplish this with pure
Gremlin?
I am struggling for awhile now with a similar problem of how to
calculate node's statistical deviation of edges. I know how to do it
using groovy but having it in pure Gremlin would be nicer, would be
happy to know if its impossible :)

(the statistical deviation function requires a [N:3] matrix containing
for each row [EdgeType, Degree group, Amount of nodes that contain
this degree level for the specific edge type])

> > hist(matrix(c(as.numeric(keys(dist)),as.numeric(values(dist))),nrow=2),breaks=10,main="Random Graph Degree Distribution",xlab="in-degree")
>
>
>
> Rplot.pdf
> 140Kהצגהורדה

>
>
>
> Here is the algorithm in composite starting from an edge list and yielding an in-degree distribution:
>
> mapreduce(
> mapreduce(to.dfs(graph),
> map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v))) ,
> map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k, length(v)))))
>
> Boo yea!,
> Marko.
>

> http://markorodriguez.com-הסתר טקסט מצוטט-
>
> -הראה טקסט מצוטט-

Pierre De Wilde

unread,

Feb 3, 2012, 5:34:56 AM2/3/12

to gremli...@googlegroups.com

Hi,

Very interesting, but isn't there a way to accomplish this with pure Gremlin?

Here is the Graph's Degree Distribution in Gremlin:

m = [:].withDefault{0};

g.V.sideEffect{m[it.both.count()]+=1}.iterate();

m.sort{a,b->a.key<=>b.key};

To get In-degree or out-degree distributions, replace it.both.count() by it.in.count() or it.out.count()

HTH,

Pierre

Jack

unread,

Feb 3, 2012, 10:44:38 AM2/3/12

to Gremlin-users

Wow! thanks. finally I also understand what the sideEffect is good
for.

> > > -הראה טקסט מצוטט-- Hide quoted text -
>
> - Show quoted text -

Reply all

Reply to author

Forward