Need help Parsing this JSON file with duplicate Keys

2,165 views
Skip to first unread message

Edeline Gagnon

unread,
Jun 7, 2017, 2:10:16 PM6/7/17
to OpenRefine
Hi,
I'm relatively new to OpenRefine, and I have been attempting to parse the information "Vernacular name" in these type of files, which are repeated multiple times:


{"apiVersion":"0.1","lastUpdatedDate":"2017-03-07","results":[{"searchedTerm":"Acer negundo","numMatches":1,"matches":[{"taxonID":9206,"scientificName":"Acer negundo Linnaeus","scientificNameAuthorship":"Linnaeus","canonicalName":"Acer negundo","taxonRank":"species","taxonomicAssertions":[{"acceptedNameUsage":"Acer negundo Linnaeus","acceptedNameUsageID":9206,"nameAccordingTo":"FNA Editorial Committee. in prep. Flora of North America north of Mexico. Volume 13. Magnoliophyta: Geraniaceae to Apiaceae. Oxford University Press, New York.","nameAccordingToID":"","taxonomicStatus":"accepted","parentNameUsageID":769,"higherClassification":"Equisetopsida;Magnoliidae;Rosanae;Sapindales;Sapindaceae;Hippocastanoideae;Acereae;Acer"}],"vernacularNames":[{"vernacularName":"érable à Giguère","language":"fr","source":"Darbyshire S.J., M. Favreau & M. Murray (revu et augmenté par). 2000. Noms populaires et scientifiques des plantes nuisibles du Canada. Agriculture et Agroalimentaire Canada. Publication 1397. 132 pp.","preferredName":true},{"vernacularName":"Manitoba maple","language":"en","source":"Farrar, J.L. 1996. Les Arbres du Canada. Fides et Service canadien des forêts. 502 pp.","preferredName":true},{"vernacularName":"ash-leaved maple","language":"en","source":"Farrar, J.L. 1996. Les Arbres du Canada. Fides et Service canadien des forêts. 502 pp.","preferredName":false},{"vernacularName":"aulne-buis","language":"fr","source":"Louis-Marie, P. 1953. Flore-Manuel de la province de Québec. 2ième éd. Institut agricole d'Oka. 323 pp.","preferredName":false},{"vernacularName":"box-elder","language":"en","source":"Farrar, J.L. 1996. Les Arbres du Canada. Fides et Service canadien des forêts. 502 pp.","preferredName":false},{"vernacularName":"box-elder maple","language":"en","source":"ITIS. 2010. Integrated Taxonomic Information System (ITIS). http://www.itis.gov (consulted 2010)","preferredName":false},{"vernacularName":"California box-elder","language":"en","source":"ITIS. 2010. Integrated Taxonomic Information System (ITIS). http://www.itis.gov (consulted 2010)","preferredName":false},{"vernacularName":"érable à feuilles composées","language":"fr","source":"Farrar, J.L. 1996. Les Arbres du Canada. Fides et Service canadien des forêts. 502 pp.","preferredName":false},{"vernacularName":"érable à feuilles de Frêne","language":"fr","source":"Louis-Marie, P. 1953. Flore-Manuel de la province de Québec. 2ième éd. Institut agricole d'Oka. 323 pp.","preferredName":false},{"vernacularName":"érable argilière","language":"fr","source":"Marie-Victorin, Fr. 1995. Flore laurentienne. 3e éd. Mise à jour et annotée par L. Brouillet, S.G. Hay, I. Goulet, M. Blondeau, J. Cayouette et J. Labrecque. Gaétan Morin éditeur. 1093 pp.","preferredName":false},{"vernacularName":"érable négondo","language":"fr","source":"Marie-Victorin, Fr. 1995. Flore laurentienne. 3e éd. Mise à jour et annotée par L. Brouillet, S.G. Hay, I. Goulet, M. Blondeau, J. Cayouette et J. Labrecque. Gaétan Morin éditeur. 1093 pp.","preferredName":false},{"vernacularName":"érable négundo","language":"fr","source":"Grandtner, M.M. 2005. Elsevier's Dictionnary of Trees. Volume 1. North America. Elsevier, Amsterdam. 1493 pp.","preferredName":false},{"vernacularName":"plaine à Giguère","language":"fr","source":"Louis-Marie, P. 1953. Flore-Manuel de la province de Québec. 2ième éd. Institut agricole d'Oka. 323 pp.","preferredName":false},{"vernacularName":"three-leaved maple","language":"en","source":"ITIS. 2010. Integrated Taxonomic Information System (ITIS). http://www.itis.gov (consulted 2010)","preferredName":false},{"vernacularName":"western box-elder","language":"en","source":"ITIS. 2010. Integrated Taxonomic Information System (ITIS). http://www.itis.gov (consulted 2010)","preferredName":false}],"distribution":[{"locationID":"ISO 3166-2:CA-AB","locality":"AB","establishmentMeans":"native","occurrenceStatus":"native"},{"locationID":"ISO 3166-2:CA-MB","locality":"MB","establishmentMeans":"native","occurrenceStatus":"native"},{"locationID":"ISO 3166-2:CA-NT","locality":"NT","establishmentMeans":"introduced","occurrenceStatus":"introduced"},{"locationID":"ISO 3166-2:CA-NS","locality":"NS","establishmentMeans":"introduced","occurrenceStatus":"introduced"},{"locationID":"ISO 3166-2:CA-YT","locality":"YT","establishmentMeans":"introduced","occurrenceStatus":"introduced"},{"locationID":"ISO 3166-2:CA-QC","locality":"QC","establishmentMeans":"introduced","occurrenceStatus":"introduced"},{"locationID":"ISO 3166-2:CA-ON","locality":"ON","establishmentMeans":"native","occurrenceStatus":"native"},{"locationID":"ISO 3166-2:CA-BC","locality":"BC","establishmentMeans":"introduced","occurrenceStatus":"introduced"},{"locationID":"ISO 3166-2:CA-SK","locality":"SK","establishmentMeans":"native","occurrenceStatus":"native"},{"locationID":"ISO 3166-2:CA-PE","locality":"PE","establishmentMeans":"introduced","occurrenceStatus":"introduced"},{"locationID":"ISO 3166-2:CA-NB","locality":"NB","establishmentMeans":"introduced","occurrenceStatus":"introduced"}]}]}]}


I can retrieve all the vernacular names with the following expression: forEach(value.parseJson().results[0].matchs[0].vernacularNames,v,v.vernacularName).join(", ")

However, would it be possible to retrieve only the vernacularName who have the key "preferredName":true? Would it also be able to retrieve only the english(en) names with preferredNames=true, or only the french (fr) names?

I think it's with an IF expression or somesort of filter, but I'm not sure how it should be speficied with this JSON structured file.

Thanks for your help,
Edeline


Joe Wicentowski

unread,
Jun 7, 2017, 4:00:11 PM6/7/17
to OpenRefine
Hi Edeline,

I'm not sure how to do this with OpenRefine, but XQuery 3.1 lets you query and filter JSON data, as follows (the 3 lines of code between the GFM-style fenced code blocks):

```xquery
let $doc := json-doc('http://data.canadensys.net/vascan/api/0.1/search.json?q=Acer%20negundo')
return
    $doc?results?*?matches?*?vernacularNames?*[ ?preferredName eq true() ]?vernacularName
```

It's not the easiest syntax to read, but basically this query traverses the JSON, through this route:

(a) outermost "results" object 
(b) array 
(c) "matches" object
(d) array
(e) "vernacularNames" object
(f) array
(g) filter for objects containing a "preferredName" key with a value of "true"
(f) "vernacularName" entries' values

The result of the query is these two strings:

  "érable à Giguère", "Manitoba maple"

I ran this query in eXist (http://exist-db.org).  Happy to provide more info if you'd like to pursue this more.

Joe

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joe Wicentowski

unread,
Jun 7, 2017, 4:04:48 PM6/7/17
to OpenRefine
You also actually asked about filtering by language.  Here's the same query adapted to your first case, "only the english(en) names with preferredNames=true":

  $doc?results?*?matches?*?vernacularNames?*[?preferredName eq true() and ?language = 'en']?vernacularName

The result is "Manitoba maple".  Here it is adapted to the second case, "only the french (fr) names":

  $doc?results?*?matches?*?vernacularNames?*[?preferredName eq true() and ?language = 'fr']?vernacularName

The result is "érable à Giguère".



Thad Guidry

unread,
Jun 7, 2017, 4:29:03 PM6/7/17
to OpenRefine

forEach(value.parseJson().vernacularNames,v,v.vernacularName).join(":::")

You'll probably be better off just exporting that column as 'myJson.txt'...

Then import that myJson.txt file and choose JSON importer wizard.
Our preview window will give you a visual selector to help you choose the vernacularName entities.

Let us know if you need further help.  There are many tutorials out there to learn from: https://github.com/OpenRefine/OpenRefine/wiki/External-Resources

-Thad

Ettore Rizza

unread,
Jun 7, 2017, 4:29:28 PM6/7/17
to OpenRefine
Hi Edeline,

Try using Jython/Python instead of Grel. I can't use Open Refine in this moment, but this script should work :

import json
data
= value

vernacular
= data['results'][0]['matches'][0]['vernacularNames']

liste
= []
for i in vernacular:
   
if i['language'] == "fr" and i['preferredName'] == True:
        liste
.append(i['vernacularName'])

return ", ".join(liste)

PS : make sure you're using Open Refine 2.7.

Ettore Rizza

unread,
Jun 7, 2017, 4:51:28 PM6/7/17
to OpenRefine
Sorry, I forgot an element in my answer. Here is the good script :

import json

data
= json.loads(value)

vernacular
= data['results'][0]['matches'][0]['vernacularNames']

for i in vernacular:
   
if i['language'] == "fr" and i['preferredName'] == True:

       
return i['vernacularName']


Owen Stephens

unread,
Jun 8, 2017, 4:27:47 AM6/8/17
to OpenRefine
Hi Edeline,

As usual there is more than one way of doing this with OpenRefine. Two GREL based options are:

filter(value.parseJson().results[0].matches[0].vernacularNames,v,v.preferredName==true)

This filters the array of vernacular names to only those with the preferredName equal to true. The output is an array of JSON objects from which you can then extract the actual names - e.g. with a forEach loop:

forEach(filter(value.parseJson().results[0].matches[0].vernacularNames,v,v.preferredName==true),w,w.parseJson().vernacularName).join("|")

Alternatively you can use a forEach with an 'if' statement like:

forEach(value.parseJson().results[0].matches[0].vernacularNames,v,if(v.preferredName==true,v.vernacularName,null)).join("|")

This second statement is slightly more compact and efficient I think, but both approaches will work

Owen

Ettore Rizza

unread,
Jun 8, 2017, 4:53:55 AM6/8/17
to OpenRefine
Thank you for this solution in GREL, Owen. The problem is that if you only want the Preferred Term in French, the formula becomes:

forEach(
value
.parseJson().results[0].matches[0].vernacularNames,
v
,
if(v.language=="fr",if(v.preferredName==true,v.vernacularName,null),null)
).join("|")


It's very confusing, which is why I prefer to switch to Python as soon as the operations become a bit complex.

Owen Stephens

unread,
Jun 8, 2017, 5:02:20 AM6/8/17
to OpenRefine
I can understand switching to Python - but I avoid that where possible :) (nothing against Python, just when I'm working in OpenRefine I tend to stick to GREL)

So GREL does let you deal with multiple criteria in the if/filter if you need to:

forEach(value.parseJson().results[0].matches[0].vernacularNames,v,if(and(v.preferredName==true,v.language=="fr"),v.vernacularName,null)).join("|")


Going beyond two is possible but gets ugly as the 'and' and 'or' functions can only take two arguments, so extending the conditions requires you to have nested boolean conditions.

Owen

Ettore Rizza

unread,
Jun 8, 2017, 5:13:16 AM6/8/17
to OpenRefine
Right, I forgot the and() and or() operators ! That said, I know few people except you who can easily handle this kind of scaffolding. :)

If you go on vacation, few people will be able to answer complicated questions, while Python users on StackOverflow are tens of thousands.

Thad Guidry

unread,
Jun 8, 2017, 9:04:00 AM6/8/17
to OpenRefine
Hmm...Since GREL does get a bit ugly for these special cases for JSON handling...

1. We can probably make some simpler GREL syntax for JSON handling all around.  Especially Nested JSON Objects / Arrays like this use case.
2. Probably we should make our JSON picker / previewer also available as a new Edit Column -> Split JSON into several columns .

# 2 would be quite a bit of work, but its doable actually and totally AWESOME.

# 1 is doable and simpler.  Looping through Arrays with a for-in loop is silly easy in Javascript as well  https://www.w3schools.com/js/js_json_arrays.asp  We should perform the burden for the user, and actually already do with # 2.  But I think easier GREL syntax for JSON handling and looping through Nested Arrays is what the community is really looking for.

If you could imagine some new GREL syntax for how that would look or work... what would that strawman syntax look like for this use case ?

-Thad



On Thu, Jun 8, 2017 at 4:13 AM Ettore Rizza <ettor...@gmail.com> wrote:
Right, I forgot the and() and or() operators ! That said, I know few people except you who can easily handle this kind of scaffolding. :)

If you go on vacation, few people will be able to answer complicated questions, while Python users on StackOverflow are tens of thousands.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Thad Guidry

unread,
Jun 8, 2017, 9:28:24 AM6/8/17
to OpenRefine
# 1 seems to be made much simpler with a slight tweak to GREL when asked to parseJson().

Rather than using the org.json.JsonTokener[1] that we currently do.. we could instead resort to using a Map like interface and that is done very easily with JsonSlurper() in Groovy language.  http://groovy-lang.org/json.html

Then we could get really nice GPath-like expressions with dot notation[2].  For instance,

value.parseJson().results.matches.vernacularNames.vernacularName


-Thad

Thad Guidry

unread,
Jun 8, 2017, 9:33:08 AM6/8/17
to OpenRefine
Owen,

You should try hacking on our # 1 case... Its not that hard, and you'd learn a bit more. :)  I think you can totally pull this off actually.

Just drop the latest Groovy jar into Web-Inf/libs and then import JsonSlurper into our ParseJson.java and comment out the import org.json lines and replace JsonTokener with uses of JsonSlurper, and then hack away.

Do this on a new branch from our master on your Github and then Jacky and I can review and help you out.

-Thad

Owen Stephens

unread,
Jun 8, 2017, 9:35:53 AM6/8/17
to OpenRefine
Thanks Thad - I'll take a look and see where I get to!

Owen

Thad Guidry

unread,
Jun 8, 2017, 9:42:05 AM6/8/17
to openr...@googlegroups.com
Thanks Owen.  Go get 'em !

BTW,  Groovy could also be dropped in as a new Expression Language for us !  :)  It runs on the JVM just like Clojure and Jython.

You can hack on that enhancement next :)

-Thad

Owen Stephens

unread,
Jun 8, 2017, 9:44:13 AM6/8/17
to openr...@googlegroups.com
Interestingly I had this suggested to me (dropping Groovy in as an expression language) by Steve who worked on the GOKb/OpenRefine extension - he was keen but I was less sure of the value - I’ll feed back that it has at least one supporter!

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: ow...@ostephens.com
Telephone: 0121 288 6936

Ettore Rizza

unread,
Jun 8, 2017, 1:03:45 PM6/8/17
to OpenRefine
But is creating new GREL functions really a priority when you can use Jython? This involves a little learning, but it is also the case for any new function. And in this case, this learning can be used elsewhere than in Open refine.

Jython/Python has the advantage of being widely used and documented, its modules and their functions have for the most part been extensively tested and debugged. I wonder if a library of GREL and Python snippets embedded in the interface would not be more useful. 

Thad Guidry

unread,
Jun 8, 2017, 1:11:09 PM6/8/17
to OpenRefine
OpenRefine has always been about providing easy ways to clean up messy data... without having to be much of a programmer.
Having smarter OpenRefine GREL functions or menu controls or dialogs, etc... is part of providing those easier ways.

Your welcome to look at what Owen and I are discussing on the developers mailing list thread here: https://groups.google.com/forum/#!topic/openrefine-dev/G9uD9l0rZO0

I love Python just as much as you Ettore...but I also appreciate other programming languages that help simplify tasks...and I don't even have to program and can just use something smart enough to 'read my mind'. :)

-Thad

On Thu, Jun 8, 2017 at 12:03 PM Ettore Rizza <ettor...@gmail.com> wrote:
But is creating new GREL functions really a priority when you can use Jython? This involves a little learning, but it is also the case for any new function. And in this case, this learning can be used elsewhere than in Open refine.

Jython/Python has the advantage of being widely used and documented, its modules and their functions have for the most part been extensively tested and debugged. I wonder if a library of GREL and Python snippets embedded in the interface would not be more useful. 

--

Thad Guidry

unread,
Jun 8, 2017, 1:15:51 PM6/8/17
to OpenRefine
The other point is that I want Owen to be more active with our development.  We desperately need more folks and Owen has already been vocal about wanting to dive in and learn more about OpenRefine's source code.  But we are always hiring for anyone that has the chops or interest :)

-Thad

Ettore Rizza

unread,
Jun 8, 2017, 2:37:21 PM6/8/17
to OpenRefine
I totally agree, Thad. But this kind of simple functions that read in the minds of users are just the most complicated to write. :/ If I started using Python instead of Grel in some cases, it is because the syntax of Python is much clearer in the cases in question. Which ?

- The for loops and the if conditions: I explained them just now to a colleague who wanted to learn GREL. She has a much better understanding of Python's way of doing things (perhaps because she has already worked with this language).

- Extracting patterns using regular expressions: value.match() is a pain in the ass compared to re.findall().

- Anything that requires "if" and "for" at the same time, or array manipulations, or which uses powerful functions already coded in the standard library of Python.

GREL contains great features. Fingerprint, for example, has been imitated in R and Python ! The dot notation is so intuitive that I regret that not all languages ​​use it.

But if I had to learn a language, I prefer to learn one that will reserve me elsewhere. One do not realize how many different functions a data analyst/data wrangler should remember. The main functions of Excel. Those of DAX if you want to switch to Power Pivot/Power BI. Those of "Tableau Public". A little bit of VBA. SQL of course. Unix command line. And so on.

Today there are many Open Refine competitors, such as Trifacta Wrangler, Talend Preparation, Dataiku ... They look much more modern than Open Refine, some have much more powerfull functons, but their problem is that everyone uses its owns.

That's why I recently fell in love with the Exploratory.io software, which uses R as its main language. 



I have the intuition that the future is there, in Grand Public softwares that leverage a great programming language in an user friendly GUI. That's why I keep using Open Refine. It is not perfect, but I know it and I know that no other software gives me - for the moment - the equivalent : simple functions for current data wrangling, such as those of Grel, AND Python for more complicated things. When I spend hours developing a complicated operation in Jython, I know that this time will not be lost even if Open Refine disappears tomorrow. I would of course be delighted to have new functions as intuitive as "replace()". But if Open refine lacks of developpers, I would prefer a hundred times that those who remain focus on the UI.

 My two cents to de debate. :)

Thad Guidry

unread,
Jun 8, 2017, 2:44:45 PM6/8/17
to OpenRefine
Isn't it great that we had the foresight to give you an OPTION of your preferred Expression Language ?

Sorry, no debate here.

You like Python.  Great.  Not everyone needs to learn it, or wants to.  So please don't push any agenda.  OpenRefine doesn't and neither should the community.

Regarding the focus of the UI.  Scan through the existing enhancement requests and give your Thumbs Up or down so that we can get a sense of what is more important.

Thanks,
-Thad

Ettore Rizza

unread,
Jun 8, 2017, 3:07:48 PM6/8/17
to OpenRefine
I do not push any agenda Ô_-  I give my assiduous user opinion in what appeared to be a cordial debate - Sorry if i was wrong - on anything other than the OP's request, namely how to parse a JSON with multiple conditions. I gave an answer to the question, as I did a hundred times here or on StackOverflow. But if my opinion does not matter, I can leave you alone. 

Thad Guidry

unread,
Jun 8, 2017, 3:48:53 PM6/8/17
to openr...@googlegroups.com
Ettore,

Thanks for your opinions.
-Thad

On Thu, Jun 8, 2017 at 2:07 PM Ettore Rizza <ettor...@gmail.com> wrote:
I do not push any agenda Ô_-  I give my assiduous user opinion in what appeared to be a cordial debate - Sorry if i was wrong - on anything other than the OP's request, namely how to parse a JSON with multiple conditions. I gave an answer to the question, as I did a hundred times here or on StackOverflow. But if my opinion does not matter, I can leave you alone. 

--

Ettore Rizza

unread,
Jun 8, 2017, 4:10:47 PM6/8/17
to OpenRefine
Thank you also for your opinions, Thad, I find them always interesting.

Ettore

qi cui

unread,
Jun 9, 2017, 6:37:06 PM6/9/17
to OpenRefine
The or and and take multiple arguments. They were improved and merged few month ago.

Ettore Rizza

unread,
Jun 10, 2017, 2:23:16 AM6/10/17
to OpenRefine
Oh yeah, right! 




I had never thought of using the development version. Is there a real risk to work with it in production?

Thad Guidry

unread,
Jun 10, 2017, 8:23:43 AM6/10/17
to openr...@googlegroups.com
Not much at all.  Its an extremely minor risk...more like just annoyances that MIGHT happen.  Its because we are not introducing major fundamental changes under the covers.  We're still in 2.xx world after 7 years, and not moved into 3.xx

I and many others like Data Journalists use OpenRefine trunk (dev) build in production all the time.

Reply all
Reply to author
Forward
0 new messages