I have a pseudo Hadoop cluster running on my local machine. I was playing around with it and all of a sudden thought: "Ah! Its easy to calculate vertex in-degree!" Want to see how it is done?
Lets create a random graph with 10 vertices and 100 edges and represent it as as a two-column table (i.e. an edge list).
We can ignore the key -- what matters is the value is the edge data that says that vertex 6 is connected to vertex 2.
Now lets calculate the in-degree of all vertices in the graph.
> y <- mapreduce(x, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v)))
12/02/02 01:02:51 INFO mapred.FileInputFormat: Total input paths to process : 1 12/02/02 01:02:51 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-marko/mapred/local] 12/02/02 01:02:51 INFO streaming.StreamJob: Running job: job_201202012324_0008 12/02/02 01:02:51 INFO streaming.StreamJob: To kill this job, run: 12/02/02 01:02:51 INFO streaming.StreamJob: /Applications/hadoop/hadoop-0.20.203.0/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201202012324_0008 12/02/02 01:02:51 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201202012324_0008 12/02/02 01:02:52 INFO streaming.StreamJob: map 0% reduce 0% 12/02/02 01:03:06 INFO streaming.StreamJob: map 100% reduce 0% 12/02/02 01:03:18 INFO streaming.StreamJob: map 100% reduce 100% 12/02/02 01:03:24 INFO streaming.StreamJob: Job complete: job_201202012324_0008 12/02/02 01:03:24 INFO streaming.StreamJob: Output: /var/folders/W7/W7Tviv8iFv8r508qXHlEs++++TM/-Tmp-//Rtmpan4VFX/file3e1ae6f
The map job says, emit key/value pairs where the key is the head vertex of the edge (e.g. 6) and the value is just the number 1. The reduce job says, emit key/value pairs where the key is the key (e.g. 6) and the value is the number of 1s in the value (i.e. the in-degree)
Now we can check our results which was written to the file reference in HDFS reference by y.
> from.dfs(y)[3]
[[1]] [[1]]$key [1] 6 [[1]]$val [1] 2
The result says that 6 has an in-degree of 2. Lets check using standard R constructs.
> table(graph[,2])
1 2 3 4 5 6 7 8 9 10 3 6 5 4 6 2 9 2 4 9
Yep. 6 is seen 2 times in the second column.
Tada! In-Degree using R and Hadoop. Moreover, its just a one-liner in R:
Hi, I am the developer of the rmr package used here and besides being pleased with seeing it put to good use, since there is an interest in doing graph algorithms with it, let me point to you an implementation of connected components on my blog
So no one answered my question from the email entitled "Calculating In-Degree using R MapReduce over Hadoop."
"Can you tell me how to calculate a graph's degree distribution? --- HINT: its this MapReduce job composed with another."
As we saw before, this is how you calculate the in-degree for every vertex in the graph. Lets do it again, but this time, over a larger random graph -- 100 vertices and 10,000 edges.
> x <- to.dfs(graph) > y <- mapreduce(x, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v)))
The variable y is a reference to an HDFS file that has the in-degree of every vertex. We can then use this result to produce a degree distribution. First, what is a degree distribution? It is the histogram/density of the various degrees -- i.e. how many nodes have a degree of 1? of 2? of 3? of 4? etc.... This is the MapReduce job that does the trick.
> z <- mapreduce(y, map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k, length(v)))
MAP: Take all the vertex degrees and emit each degree with the the value 1. REDUCE: Take all the degrees and their list of 1s and count their list -- that is the distribution for each degree.
Now lets plot this in R. First we load the distribution into memory (which is much smaller than the graph.
> So no one answered my question from the email entitled "Calculating > In-Degree using R MapReduce over Hadoop."
> "Can you tell me how to calculate a graph's degree distribution? > --- HINT: its this MapReduce job composed with another."
> As we saw before, this is how you calculate the in-degree for every vertex > in the graph. Lets do it again, but this time, over a larger random graph > -- 100 vertices and 10,000 edges.
> The variable y is a reference to an HDFS file that has the in-degree of > every vertex. We can then use this result to produce a degree distribution. > First, what is a degree distribution? It is the histogram/density of the > various degrees -- i.e. how many nodes have a degree of 1? of 2? of 3? of > 4? etc.... This is the MapReduce job that does the trick.
> MAP: Take all the vertex degrees and emit each degree with the the > value 1. > REDUCE: Take all the degrees and their list of 1s and count their > list -- that is the distribution for each degree.
> Now lets plot this in R. First we load the distribution into memory (which > is much smaller than the graph.
Very interesting, but isn't there a way to accomplish this with pure
Gremlin?
I am struggling for awhile now with a similar problem of how to
calculate node's statistical deviation of edges. I know how to do it
using groovy but having it in pure Gremlin would be nicer, would be
happy to know if its impossible :)
(the statistical deviation function requires a [N:3] matrix containing
for each row [EdgeType, Degree group, Amount of nodes that contain
this degree level for the specific edge type])
On 3 פברואר, 10:02, Marko Rodriguez <okramma...@gmail.com> wrote:
> So no one answered my question from the email entitled "Calculating In-Degree using R MapReduce over Hadoop."
> "Can you tell me how to calculate a graph's degree distribution? --- HINT: its this MapReduce job composed with another."
> As we saw before, this is how you calculate the in-degree for every vertex in the graph. Lets do it again, but this time, over a larger random graph -- 100 vertices and 10,000 edges.
> > x <- to.dfs(graph)
> > y <- mapreduce(x, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k, length(v)))
> The variable y is a reference to an HDFS file that has the in-degree of every vertex. We can then use this result to produce a degree distribution. First, what is a degree distribution? It is the histogram/density of the various degrees -- i.e. how many nodes have a degree of 1? of 2? of 3? of 4? etc.... This is the MapReduce job that does the trick.
> > z <- mapreduce(y, map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k, length(v)))
> MAP: Take all the vertex degrees and emit each degree with the the value 1.
> REDUCE: Take all the degrees and their list of 1s and count their list -- that is the distribution for each degree.
> Now lets plot this in R. First we load the distribution into memory (which is much smaller than the graph.
> calculate node's statistical deviation of edges. I know how to do it > using groovy but having it in pure Gremlin would be nicer, would be > happy to know if its impossible :)
> (the statistical deviation function requires a [N:3] matrix containing > for each row [EdgeType, Degree group, Amount of nodes that contain > this degree level for the specific edge type])
> On 3 פברואר, 10:02, Marko Rodriguez <okramma...@gmail.com> wrote: > > Howdy ho,
> > So no one answered my question from the email entitled "Calculating > In-Degree using R MapReduce over Hadoop."
> > "Can you tell me how to calculate a graph's degree distribution? > --- HINT: its this MapReduce job composed with another."
> > As we saw before, this is how you calculate the in-degree for every > vertex in the graph. Lets do it again, but this time, over a larger random > graph -- 100 vertices and 10,000 edges.
> > > x <- to.dfs(graph) > > > y <- mapreduce(x, map=function(k,v) keyval(v[2],1), > reduce=function(k,v) keyval(k, length(v)))
> > The variable y is a reference to an HDFS file that has the in-degree of > every vertex. We can then use this result to produce a degree distribution. > First, what is a degree distribution? It is the histogram/density of the > various degrees -- i.e. how many nodes have a degree of 1? of 2? of 3? of > 4? etc.... This is the MapReduce job that does the trick.
> > MAP: Take all the vertex degrees and emit each degree with the > the value 1. > > REDUCE: Take all the degrees and their list of 1s and count > their list -- that is the distribution for each degree.
> > Now lets plot this in R. First we load the distribution into memory > (which is much smaller than the graph.
> Very interesting, but isn't there a way to accomplish this with
> > pure Gremlin?
> Here is the Graph's Degree Distribution in Gremlin:
> m = [:].withDefault{0};
> g.V.sideEffect{m[it.both.count()]+=1}.iterate();
> m.sort{a,b->a.key<=>b.key};
> To get In-degree or out-degree distributions, replace it.both.count() by
> it.in.count() or it.out.count()
> HTH,
> Pierre
> I am struggling for awhile now with a similar problem of how to
> > calculate node's statistical deviation of edges. I know how to do it
> > using groovy but having it in pure Gremlin would be nicer, would be
> > happy to know if its impossible :)
> > (the statistical deviation function requires a [N:3] matrix containing
> > for each row [EdgeType, Degree group, Amount of nodes that contain
> > this degree level for the specific edge type])
> > On 3 פברואר, 10:02, Marko Rodriguez <okramma...@gmail.com> wrote:
> > > Howdy ho,
> > > So no one answered my question from the email entitled "Calculating
> > In-Degree using R MapReduce over Hadoop."
> > > "Can you tell me how to calculate a graph's degree distribution?
> > --- HINT: its this MapReduce job composed with another."
> > > As we saw before, this is how you calculate the in-degree for every
> > vertex in the graph. Lets do it again, but this time, over a larger random
> > graph -- 100 vertices and 10,000 edges.
> > > The variable y is a reference to an HDFS file that has the in-degree of
> > every vertex. We can then use this result to produce a degree distribution.
> > First, what is a degree distribution? It is the histogram/density of the
> > various degrees -- i.e. how many nodes have a degree of 1? of 2? of 3? of
> > 4? etc.... This is the MapReduce job that does the trick.
> > > MAP: Take all the vertex degrees and emit each degree with the
> > the value 1.
> > > REDUCE: Take all the degrees and their list of 1s and count
> > their list -- that is the distribution for each degree.
> > > Now lets plot this in R. First we load the distribution into memory
> > (which is much smaller than the graph.