Get Wikipedia Link with wikidata reconciliation service

503 views
Skip to first unread message

richi...@googlemail.com

unread,
Nov 19, 2018, 11:15:07 AM11/19/18
to OpenRefine
Dear Group Members,

I am trying to get the Wikipedia URL for the wikipedias articles about a list of "humans" in different languages (if there is an entry), based on a variable I already successfully reconciled with wikidata. Since the Wikipedia links seem not to be directly available as statements, I do not know how to get them.
Can someone help me?

Kind regards
Richard

 
Message has been deleted

Ettore Rizza

unread,
Nov 19, 2018, 4:25:27 PM11/19/18
to OpenRefine
Hi Richard, 

I think this feature is not implemented yet (unless Antonin added it recently?)

In the meantime, you can retrieve this information using the Wikidata API. The URL looks like this - in the case of the English version of Wikipedia:


If you are not used to working with APIs in OpenRefine, you can try this Python script. Just add a new column based on the column that contains your names matched with Wikidata, change the scripting language to "Python/Jython " and paste this code in the window.

import json
import urllib2

site
= "enwiki"

url
= "https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks/urls&ids=" + cell.recon.match.id + "&sitefilter=%s" %site

response
= urllib2.urlopen(url)

json
= json.loads(response.read())

for i in json['entities'].values():
   
return i['sitelinks'][site]['url']



screenshot-localhost-3333-2018.11.19-22-14-49.png



Feel free to ask for clarification if it's not clear.

Ettore

Richard Fabio

unread,
Nov 20, 2018, 7:59:56 AM11/20/18
to openr...@googlegroups.com
Hi Ettore,

thank you for your perfect answer and helpful explanation. The process took some time, but worked well. I managed also to get the German Wikipedia links by replacing site = "enwiki" with site = "dewiki".

Kind regards
Richard

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ettore Rizza

unread,
Nov 20, 2018, 8:30:54 AM11/20/18
to OpenRefine
The process took some time, but worked well.

Great!

I managed also to get the German Wikipedia links by replacing site = "enwiki" with site = "dewiki".

Here is a slightly different version of the script that will test several Wikipedia until it finds a match. You can also specify a single language, for example langs = ['de'] 

import json
import urllib2

langs
= ["en", "de", "fr", "xx"]  # ordered list of languages you want to try

value
= cell.recon.match.id

for lang in langs:

    wiki
= lang + "wiki"

    url
= ("https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks/urls&ids=" +
           value
+
           
"&sitefilter=" +
           wiki
)
 
    response
= urllib2.urlopen(url)

   
if response:
        data
= json.loads(response.read())

       
for i in data['entities'].values():
           
try:
               
return i['sitelinks'][wiki]['url']
           
except:
               
pass

FYI, I see that Antonin added the extraction of sitelinks to his to do list: https://github.com/wetneb/openrefine-wikibase/issues/17#issuecomment-440266713

Best regards,

Ettore

Antonin Delpeuch (lists)

unread,
Nov 20, 2018, 2:00:18 PM11/20/18
to openr...@googlegroups.com
Thanks Ettore for your excellent reply.
Richard, I have taken notice of your request here:
https://github.com/wetneb/openrefine-wikibase/issues/17

Antonin

On 11/20/18 12:59 PM, 'Richard Fabio' via OpenRefine wrote:
> Hi Ettore,
>
> thank you for your perfect answer and helpful explanation. The process
> took some time, but worked well. I managed also to get the German
> Wikipedia links by replacing site ="enwiki" with site ="dewiki".
>
> Kind regards
> Richard
>
> Am Mo., 19. Nov. 2018 um 22:25 Uhr schrieb Ettore Rizza
> <ettor...@gmail.com <mailto:ettor...@gmail.com>>:
>
> Hi Richard, 
>
> I think this feature is not implemented yet (unless Antonin added it
> recently?)
>
> In the meantime, you can retrieve this information using the
> Wikidata API. The URL looks like this - in the case of the English
> version of Wikipedia:
>
> |https://www.wikidata.org/w/api.php?action=wbgetentities&format=xml&props=sitelinks/urls&ids=Q5&sitefilter=enwiki
> <http://www.wikidata.org/w/api.php?action=wbgetentities&format=xml&props=sitelinks/urls&ids=Q5&sitefilter=enwiki>|
>
>
> If you are not used to working with APIs in OpenRefine
> <https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine>,
> you can try this Python script. Just add a new column based on the
> column that contains your names matched with Wikidata, change the
> scripting language to "Python/Jython " and paste this code in the
> window.
>
> |
> importjson
> importurllib2
>
> site ="enwiki"
>
> url
> ="https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks/urls&ids="+cell.recon.match.id
> +"&sitefilter=%s"%site
>
> response =urllib2.urlopen(url)
>
> json =json.loads(response.read())
>
> fori injson['entities'].values():
>     returni['sitelinks'][site]['url']
>
> |
>
>
> screenshot-localhost-3333-2018.11.19-22-14-49.png
>
>
>
> Feel free to ask for clarification if it's not clear.
>
> Ettore
>
> On Monday, 19 November 2018 17:15:07 UTC+1, richi...@googlemail.com
> <mailto:richi...@googlemail.com> wrote:
>
> Dear Group Members,
>
> I am trying to get the Wikipedia URL for the wikipedias articles
> about a list of "humans" in different languages (if there is an
> entry), based on a variable I already successfully reconciled
> with wikidata. Since the Wikipedia links seem not to be directly
> available as statements, I do not know how to get them.
> Can someone help me?
>
> Kind regards
> Richard
>
>  
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages