Hi Gianluca,
(I'm forwarding the message to the ukb mailing list, as others may find
it useful)
See answers below.
On Tue, Feb 16, 2016 at 03:30:33PM +0100, Gianluca Quercini wrote:
> [...]
>
> I only have a couple of questions:
>
> 1) In the approach using personalized PageRank, how do you assign the
> initial weights to the nodes?
>
> If my understanding is correct, you assign to each node representing a
> word in the context a weight 1/|W|, where |W| is the number of words
> in the context, while the other nodes have a weight 0. Is that
> correct?
The exact initial weight of a node v, PV[v], is calculated as follow:
for each cw in context
for each v pointed so that cw->v is in the dictionary:
PV[v] += normalized_cw_w * e[cw->v] / Sum_{cw->u}(e[cw->u])
where:
- normalized_cw_w = weight(cw) / Sum_{w in context} weight(w)
- e[cw->v] weight of the word->synset relation (usually 1, but see below)
for calculating e[cw->v] you can use --dict-weight option, and then the
frequencies present on the dictionary will be used.
> 2) I'd like to try your implementation that I found on GitHub on the latest version of Wikipedia.
>
> I found in your Github repository some scripts to convert WordNet to
> the format required by your algorithm, but I did not find any script
> to convert Wikipedia. Is there any?
>
Unfortunately, creating UKB graphs from Wikipedia is not
straightforward. You can download a Wikipedia graph and dictionary. You
can download the graph/dictionary extracted from English Wikipedia here:
http://ixa2.si.ehu.es/ukb/graphs/wikipedia_en_2013.tar.bz2
We have also some scripts for extracting the graph from Wikipedia XML
dumps, find these attached. There is no proper documentation, and I'm
afraid that I won't be able to offer any support regarding those
scripts.
The idea is to download a Wikipedia XML dump (usually called
enwiki-latest-pages-articles.xml.bz2), cd into the directory, and run
the extractWikipediaData.pl script. This should create some files like
"page.csv".
Once the extraction is over, you can run "
00-dict.pl" to create the
dictionary and "00-grA.pl" for creating the graph relations.
hope this helps,
aitor