I'm having trouble parsing JSON

114 views
Skip to first unread message

Andrea Volpini

unread,
Jul 21, 2017, 4:53:33 PM7/21/17
to OpenRefine
Hi all, 

I've been using OpenRefine for some time now and I've successfully worked with parseJson() before, but this one really I can't really get it to work. 
My goal is to extract the URL from JSON files like this one. I can do it by using the following: 

forEach(value.parseJson().query.pages['34858492'].imageinfo, e, e.url).join("|")

this works just fine but I need to run the command on the entire column (I have already imported all the json files for all the images that I need to work on) but I don't have the id of pages.
I tried something like this but it does not work.
 
forEach(value.parseJson().query.pages[0].imageinfo, e, e.url).join("|")

Many thanks in advance

Ettore Rizza

unread,
Jul 22, 2017, 8:10:31 AM7/22/17
to OpenRefine
Ciao Andrea,

Json responses from Wikimedia api have a particular structure and are difficult to parse with GREL. Here is a method with Python/Jython :

import json

data = json.loads(value)

lista = []
for pageid, page in data["query"]["pages"].items():
     lista.append(page['imageinfo'][0]['url'])

return ":::".join(lista)

screenshot :




Hope this helps.

Ettore Rizza

unread,
Jul 22, 2017, 8:14:39 AM7/22/17
to OpenRefine
Important: the method only works in Open Refine 2.7 (with jython 2.7)

Andrea Volpini

unread,
Jul 24, 2017, 10:44:55 PM7/24/17
to OpenRefine
Grazie Ettore, it works perfectly! Now I'm trying to understand why I get so many errors from the Commons API. 

HTTP error 400 : Bad Request | <html> <head><title>400 Bad Request</title></head> <body bgcolor="white"> <center><h1>400 Bad Request</h1></center> <hr><center>nginx/1.11.13</center> </body> </html>

In my starting column, I have the titles expressed in this way (these have been extracted from Wikidata): File:Google 2015 logo.svg 

"https://commons.wikimedia.org/w/api.php?action=query&titles=File:"+ value+"&prop=imageinfo&iiprop=url&format=json" 

When I run it from the browser it works well. When I use Open Refine (regardless of the threshold I use) I get several of these Bad Requests.
Any idea?

Ettore Rizza

unread,
Jul 25, 2017, 3:49:04 AM7/25/17
to OpenRefine
Your values probably contain forbidden characters in a URL, such as blank spaces. A web browser transforms these characters under the hood, because it's his job, but Open Refine just sends HTTP requests. You can deal with these characters using the escape('url') function, like this:

"https://commons.wikimedia.org/w/api.php?action=query&titles=File:"+ value.escape('url') +"&prop=imageinfo&iiprop=url&format=json"

Another possibility of error: if your values already start with "File:", as in the example you give, you have to delete this word in the first part of the URL. Otherwise, this will return File:File:Google 2015 logo.svg

Andrea Volpini

unread,
Jul 25, 2017, 6:52:43 AM7/25/17
to OpenRefine
Super great! Grazie di nuovo Ettore.
Reply all
Reply to author
Forward
0 new messages