pagerank input data

424 views
Skip to first unread message

Joseph Wang

unread,
Feb 8, 2012, 6:01:25 PM2/8/12
to DataFu
I've a question regarding input data for the pagerank. Documentation
states each row should contain <int><int><weight>. What is the
<weight>? The number of times source page references a specific link?

Matt Hayes

unread,
Feb 8, 2012, 6:45:21 PM2/8/12
to dat...@googlegroups.com
Hi Joseph,

It's up to you what you want the weight to represent.  The short answer is to use 1 unless you want to have more fine grained influence over connection strength between nodes.  

It's actually good that you brought this up.  I think we should simplify the API of the UDF so that the default is to not require a weight.  If you want a weight you should have to supply a config parameter.

The longer answer is that it basically affects how "PageRank juice" is distributed to outgoing nodes.  For a standard PageRank implementation the weight would just be 1, which means that the PageRank juice is distributed evenly to the target nodes.  So each outgoing node receives an equal amount.  We have a weight in order to support biasing towards certain target nodes.  To use a very simple example, suppose that you had a web page A which links to two pages B and C.  With standard PageRank each has a weight of 1 and so each gets an equal share of PageRank juice.  But suppose you decided that the link on page A to page B was more important than the link to C, and so you assign the link to B a weight of 2.  This means that page B will get twice as much juice from A than C gets, and so page B will have a higher PageRank as a result.  Why the link is more important is an application specific question and it's up to you how you choose it.  

-Matt

Joseph Wang

unread,
Feb 10, 2012, 6:28:16 PM2/10/12
to dat...@googlegroups.com
My input directory contains text files in the following format
<source id><tab><target id><tab><1.0>

When I run -check, I'm getting the input format is not correct. Do you have an example of input format?


$ pig -check pagerank.pig
2012-02-10 15:25:45,602 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/jwang/apache-nutch-1.4/runtime/local/pig_1328916345600.log
2012-02-10 15:25:45,919 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://hadooprsonn001.bo1.shopzilla.sea:8020/
2012-02-10 15:25:46,245 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: hadooprsojt001.bo1.shopzilla.sea:8021
2012-02-10 15:25:46,577 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Expected input bag to contain a TUPLE, but instead found int
Details at logfile: /home/jwang/apache-nutch-1.4/runtime/local/pig_1328916345600.log


Thanks in advance,

Joseph


From: Matt Hayes <matthew.te...@gmail.com>
To: dat...@googlegroups.com
Sent: Wednesday, February 8, 2012 3:45 PM
Subject: Re: pagerank input data

Matt Hayes

unread,
Feb 10, 2012, 6:41:13 PM2/10/12
to dat...@googlegroups.com
The UDF expects the input data to be grouped in a particular way (for efficiency reasons).  See the PageRank example here:


It has a sample input and output file that should work with the Pig script in the page.  This was prepared from the graph in the Wikipedia PageRank example.

-Matt

Joseph Wang

unread,
Feb 10, 2012, 6:46:58 PM2/10/12
to dat...@googlegroups.com
Getting error

Not Found

The requested URL /common.php was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.


I also noticed the API change. It now has (topic, source, dest, weight). Is topic = source in most usage?

Sent: Friday, February 10, 2012 3:41 PM

Matt Hayes

unread,
Feb 10, 2012, 6:57:31 PM2/10/12
to dat...@googlegroups.com
This link doesn't work for you?


Weird, it is working for me.  Can you try again?

The topic is a label used to identify each graph to run PageRank on.  If you're only running on a single graph then you can just use something like 0 for the topic.  

Here is the script from the page above:
  define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true');
  
  topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE);
  
  topic_edges_grouped = GROUP topic_edges by (topic, source);
  topic_edges_grouped = FOREACH topic_edges_grouped GENERATE
    group.topic as topic,
    group.source as source,
    topic_edges.(dest,weight) as edges;
  
  topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic; 
  
  topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE
    group as topic,
    FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,rank);
  
  skill_ranks = FOREACH skill_ranks GENERATE
    topic, source, rank;

Sample input:
0	2	3	1.0
0	3	2	1.0
0	4	1	1.0
0	4	2	1.0
0	5	4	1.0
0	5	2	1.0
0	5	6	1.0
0	6	5	1.0
0	6	2	1.0
0	100	2	1.0
0	100	5	1.0
0	101	2	1.0
0	101	5	1.0
0	102	2	1.0
0	102	5	1.0
0	103	5	1.0
0	104	5	1.0

This is from the PageRank example graph in Wikipedia: http://en.wikipedia.org/wiki/File:PageRanks-Example.svg

Note that the topic is 0 since there is just a single graph.  Nodes A-F were given IDs 1-6, the remainder 100+.

The pig script groups the input data into a form the UDF can use.

Sample output:

0	104	0.016169477
0	1	0.03278149
0	103	0.016169477
0	5	0.08088569
0	100	0.016169477
0	102	0.016169477
0	6	0.03908709
0	3	0.34291038
0	2	0.38440096
0	4	0.03908709
0	101	0.016169477

Joseph Wang

unread,
Feb 10, 2012, 8:23:35 PM2/10/12
to dat...@googlegroups.com
Yes, I'm getting 404 error.

Your input looks like what I have where each line is
 <id><tab><id><tab><id><tab>weight.

Sent: Friday, February 10, 2012 3:57 PM

Joseph Wang

unread,
Feb 10, 2012, 9:55:00 PM2/10/12
to dat...@googlegroups.com
I copied your sample data into a hdfs folder. Getting the same error. Here is my code.


register /home/jwang/datafu-0.0.4/dist/datafu-0.0.4.jar;

define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true');

topic_edges = LOAD 'workspace/test' as (topic:INT,source:INT,dest:INT,weight:DOUBLE);


topic_edges_grouped = GROUP topic_edges by (topic, source);
topic_edges_grouped = FOREACH topic_edges_grouped GENERATE
    group.topic as topic,
    group.source as source,
    topic_edges.(dest,weight) as edges;

topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic;

topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE
    group as topic,
    FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,rank);

skill_ranks = FOREACH skill_ranks GENERATE topic, source, rank;

STORE skill_ranks INTO 'workspace/pr';



From: Joseph Wang <joseph...@yahoo.com>
To: "dat...@googlegroups.com" <dat...@googlegroups.com>
Sent: Friday, February 10, 2012 5:23 PM

Joseph Wang

unread,
Feb 10, 2012, 10:22:39 PM2/10/12
to dat...@googlegroups.com
Could it be pig 0.8 issue? Is there a version that is pig 0.8 compatible?

--->)$ pig -version
Apache Pig version 0.8.1-cdh3u1 (rexported)

Sent: Friday, February 10, 2012 6:55 PM

Matt Hayes

unread,
Feb 10, 2012, 10:27:54 PM2/10/12
to dat...@googlegroups.com
Hmm that could be a problem.  We have only tested it on pig 0.9.
Reply all
Reply to author
Forward
0 new messages