Definition of a link?

9 views
Skip to first unread message

Adam Dingle

unread,
Aug 11, 2023, 3:47:51 PM8/11/23
to WikiRank
Your Open Wikipedia Ranking project is a great resource - thanks for making its data publicly available.

I've written my own code to compute the number of links between Wikipedia pages; this corresponds to your concept of "indegree".  However I'm seeing significant differences between my counts and yours.  Sometimes my count is higher than yours, and other times it is lower.  I'm working with a Wikipedia dump from August 1st (specifically enwiki-20230801-pages-articles.xml).  I know that you may have generated your 2023 data from an earlier dump, but I think that probably doesn't explain the large differences I'm seeing.

As one example of this, consider the Wikipedia page "Classification of finite simple groups" (https://en.wikipedia.org/wiki/Classification_of_finite_simple_groups).  I count 33 links to this page, but your 2023 data shows 114 links.  But that doesn't seem possible, since the XML file only contains this text 54 times:

$ grep -c 'Classification of finite simple' enwiki-20230801-pages-articles.xml
54

Perhaps you are also counting links to pages that redirect to this page?  My own program isn't smart enough to take those into consideration.

Is your software that generates the Wikipedia link graph available as open source?  If so, I could look at it to try to understand these differences.  Your site says that "The conversion is performed using a combination of classes from MG4J and WebGraph.", however I briefly looked at the documentation of those projects and it's not obvious to me that the Wikipedia link crawling code is in one of them.

Or, if the software is not available, could you provide a precise definition of what counts as a link?  Your site says that "We do not consider links in infoboxes".  What about links to pages that redirect?  Thanks -

Adam

Sebastiano Vigna

unread,
Aug 12, 2023, 2:18:44 AM8/12/23
to Adam Dingle, WikiRank


> On 11 Aug 2023, at 21:47, Adam Dingle <ad...@medovina.org> wrote:
>
> Your Open Wikipedia Ranking project is a great resource - thanks for making its data publicly available.
>
> Perhaps you are also counting links to pages that redirect to this page? My own program isn't smart enough to take those into consideration.

We are doing redirects, and that adds a lot of links.

> Is your software that generates the Wikipedia link graph available as open source? If so, I could look at it to try to understand these differences. Your site says that "The conversion is performed using a combination of classes from MG4J and WebGraph.", however I briefly looked at the documentation of those projects and it's not obvious to me that the Wikipedia link crawling code is in one of them.

Yes. All the software is public, but you need some scripting to pull everything together. They're here:

https://vigna.di.unimi.it/wikirank.zip

The dump we use has this name: enwiki-20230301-pages-articlel.xml

Read the documentation at the start of doall.sh: if everything is set up correctly, after downloading the data ./doall.sh is all you need. Although, if you just need the graph ./buildgraph.sh should be sufficient.

Ciao,

seba


Adam Dingle

unread,
Aug 16, 2023, 12:33:41 PM8/16/23
to Sebastiano Vigna, WikiRank
Hi Sebastiano,

thanks very much for the quick and helpful reply!  And thanks for providing your source code - that will let me study your process in detail.

> We are doing redirects, and that adds a lot of links.

OK, that makes sense.

One more question: why are you not counting links in infoboxes?  Do you believe these are less interesting/relevant than links in text?  Or is it just harder to find/parse these links and you didn't want to bother?

Adam

Sebastiano Vigna

unread,
Aug 17, 2023, 11:24:03 AM8/17/23
to wiki...@googlegroups.com
They made Linneus the most important man in history just because of thousands of insects he classified, and which point to him in infobozes. It just didn't seem right...

Adam Dingle

unread,
Aug 17, 2023, 11:31:59 AM8/17/23
to Sebastiano Vigna, wiki...@googlegroups.com
Hi Sebastiano,
They made Linneus the most important man in history just because of thousands of insects he classified, and which point to him in infobozes. It just didn't seem right...

Fascinating. Thanks for the explanation!

Adam
Reply all
Reply to author
Forward
0 new messages