Re: WSD and PageRank

80 views
Skip to first unread message

Aitor Soroa

unread,
Feb 17, 2016, 3:04:53 AM2/17/16
to Gianluca Quercini, ukb...@googlegroups.com
Hi Gianluca,

(I'm forwarding the message to the ukb mailing list, as others may find
it useful)

See answers below.

On Tue, Feb 16, 2016 at 03:30:33PM +0100, Gianluca Quercini wrote:
> [...]
>
> I only have a couple of questions:
>
> 1) In the approach using personalized PageRank, how do you assign the
> initial weights to the nodes?
>
> If my understanding is correct, you assign to each node representing a
> word in the context a weight 1/|W|, where |W| is the number of words
> in the context, while the other nodes have a weight 0. Is that
> correct?

The exact initial weight of a node v, PV[v], is calculated as follow:

for each cw in context
for each v pointed so that cw->v is in the dictionary:
PV[v] += normalized_cw_w * e[cw->v] / Sum_{cw->u}(e[cw->u])

where:
- normalized_cw_w = weight(cw) / Sum_{w in context} weight(w)
- e[cw->v] weight of the word->synset relation (usually 1, but see below)

for calculating e[cw->v] you can use --dict-weight option, and then the
frequencies present on the dictionary will be used.

> 2) I'd like to try your implementation that I found on GitHub on the latest version of Wikipedia.
>
> I found in your Github repository some scripts to convert WordNet to
> the format required by your algorithm, but I did not find any script
> to convert Wikipedia. Is there any?
>

Unfortunately, creating UKB graphs from Wikipedia is not
straightforward. You can download a Wikipedia graph and dictionary. You
can download the graph/dictionary extracted from English Wikipedia here:

http://ixa2.si.ehu.es/ukb/graphs/wikipedia_en_2013.tar.bz2

We have also some scripts for extracting the graph from Wikipedia XML
dumps, find these attached. There is no proper documentation, and I'm
afraid that I won't be able to offer any support regarding those
scripts.

The idea is to download a Wikipedia XML dump (usually called
enwiki-latest-pages-articles.xml.bz2), cd into the directory, and run
the extractWikipediaData.pl script. This should create some files like
"page.csv".

Once the extraction is over, you can run "00-dict.pl" to create the
dictionary and "00-grA.pl" for creating the graph relations.

hope this helps,

aitor
ukbWikiExtractor.tar.bz2
00-dict.pl

Dr. Srinivas Rao

unread,
Oct 18, 2016, 5:51:57 AM10/18/16
to ukblist, gianluca...@lri.fr
Reply all
Reply to author
Forward
0 new messages