Parse hierarchically deep XML retrieved from VIAF API

748 views
Skip to first unread message

Marsha

unread,
May 15, 2013, 1:31:08 PM5/15/13
to openr...@googlegroups.com
Hello, all,

First I should say that I'm a non-programmer (just learning GREL, or trying to). From a column of names in my OpenRefine project, I've set up a new column based on fetching data from VIAF, an international name authority file
http://www.oclc.org/developer/services/viaf 
Although many national libraries have submitted, for each name indexed in their library catalog, the "authorized" form of that name, I really only want the "lc" (Library of Congress) authorized name and its identifier (e.g., n199298765). However, VIAF returns for each retrieved name every authorized and non-authorized name form from every participating library, as well as all kinds of additional information (such as publication titles associated with that name, biographical info, etc.). If there is more than one match on a name (from my OpenRefine column), additional VIAF records are returned. Results are in XML.

The XML hierarchy returned by VIAF is quite deep, e.g., record/viafID/source/mainHeading/data/text  for a name. I've seen a few recipes and articles on extracting the value of HTML elements and attributes fetched from a Web service (although HTML parsing can only be done using Jython rather than GREL? I don't know Java at all -- and will HTML parsing work on XML?), but the examples given are understandably simple, extracting data from HTML elements that are only two levels deep or so. In VIAF query results, moreover, a given element can be used several times at different hierarchical levels.

Has anyone ever parsed VIAF or equally lengthy and hierarchically deep query results in OpenRefine? Is there any hope that a non-programmer could do this?

Many thanks.

Marsha

Tom Morris

unread,
May 15, 2013, 6:00:27 PM5/15/13
to openr...@googlegroups.com
Hi Marsha,


 HTML parsing is available in GREL as well as Jython (and Closure too).  Also, VIAF can return results in JSON format as well as XML if that's easier for you to deal with.  The depth of the nesting isn't really that big a deal in terms of adding complexity, but you may want to get a more technical type to put together a little recipe for you.  If you post an example of what you're trying to parse (or post a URL that returns it), someone on the list can probably help, if you don't have a local resource.

Tom

Marsha

unread,
May 21, 2013, 3:07:02 PM5/21/13
to openr...@googlegroups.com
Hello, Tom, and thanks for your reply. I wrote a long explanation of what I'm doing and how I've tried to do it, clicked "post," and watched it disappear. (So you see what you're dealing with here -- hopeless!).  :-)

First, can VIAF return results of SRU requests in JSON? It didn't seem that was possible from the Search Results description on http://www.oclc.org/developer/documentation/virtual-international-authority-file-viaf/search-results
I need to search on a column of names in OpenRefine; my search in GREL was:

"http://www.viaf.org/viaf/search?query=local.personalNames+all+%22"+escape(value, "url")+"%22&version=1.1&operation=searchRetrieve&recordSchema=http%3A%2F%2Fviaf.org%2FVIAFCluster&maximumRecords=10&startRecord=1&recordPacking=xml&sortKeys=holdingscount&httpAccept=text/xml"

The XML result, just for the name Vallee Rudy, was nearly 1500 lines long (viewing it in Oxygen). The XPath to the data I need is basically:

searchRetrieveResponse/records/record/recordData/ns2:/VIAFCluster/ns2:mainHeadings/ns2:/mainHeadingEl[2]/ns2:datafield/ns2:subfield[code='a']/  and  .../ns2:subfield[code='d']   when ...ns2:sources/ns2:s has a value of "LC."
 
That gives me the name and dates of the LC authorized form of name. Given the same path, I also need the value of the sibling following ns2:sources, the value of ns2:id  (which is the LC ID number of the name).

The XML returned from VIAF for the pertinent ns2:mainHeadingEl is:

<ns2:mainHeadingEl>
                            <ns2:datafield dtype="MARC21" ind1="1" ind2=" " tag="100">
                                <ns2:subfield code="a">Valle&#x301;e, Rudy,</ns2:subfield>
                                <ns2:subfield code="d">1901-1986</ns2:subfield>
                            </ns2:datafield>
                            <ns2:sources>
                                <ns2:s>LC</ns2:s>
                            </ns2:sources>
                            <ns2:id>LC|n 82152282</ns2:id>

Is there a tutorial or description available that explains how to parse XML in OpenRefine? It looks as though parseHTML() may be used, and I can see that "select" with an element name should extract an element value. Is it important to indicate the XPath to the desired elements and attributes? How is the value of an attribute extracted?

Actually, I'd prefer to reconcile my column of names against VIAF, but there's no way to do that at present, correct? Roderic Page established a reconciliation service for VIAF 
http://iphylo.blogspot.com/2013/04/reconciling-author-names-using-open.html 
but he says it's "fairly crude ''  -- seems it was mainly for testing purposes.

Thanks so much.

Marsha

Tom Morris

unread,
May 24, 2013, 12:57:31 PM5/24/13
to openr...@googlegroups.com
Sorry for the delay in replying to this.  I wanted to find a little time to play hands-on, but I should have answered the simple part first.  Yes, parseHTML() should be able to parse your XML for you.  You don't need to specify the full path if there's no ambiguity.  For example you could use:

value.parseHtml().select('ns2|mainHeadingEl')[2] to get the third main heading  instead of searchRetrieveResponse/records/record/recordData/ns2:/VIAFCluster/ns2:mainHeadings/ns2:/mainHeadingEl[2]/

You can get the attributes using the htmlAttr() function.  htmlText and ownText are other relevant functions.

You could probably use the array iterators and the if() control to get everything you need in a single pass, but I might be tempted to extract the desired subfields and source code, concatenate them all together with something identifiable like a pipe (|) separator and then do the filtering as a separate step in refine.

It looks like specifying a content type of application/json doesn't work with the search API which is unfortunate since that would make things a lot simpler.  What about the application/rss-xml content type?  It looks simpler if it has the information that you need perhaps with a separate query to fetch the target items.

Tom


--
You received this message because you are subscribed to the Google Groups "Open Refine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Marsha

unread,
Jun 3, 2013, 1:46:28 PM6/3/13
to openr...@googlegroups.com
Now I'm sorry for not replying sooner; I can only work on the VIAF project from time to time. With your help, Tom, I think I have something that will work now. Because VIAF is not available as a source for reconciliation (wouldn't it be nice if OCLC would do this; if not, are there linked data services we might expect would establish major data sets like VIAF as reliable and up-to-date reconciliation services, I wonder), I have to use the VIAF API. The closest approximation to reconciliation might be the VIAF AutoSuggest request type, which offers several alternative name forms and source-based identifiers, and does so in JSON. I still don't quite understand what characters need to be percent encoded in GREL expressions that query APIs; some reserved characters seem OK as they are in a query string while others need to be percent encoded. Anyway, what worked here was simply:

"http://www.viaf.org/viaf/AutoSuggest?query="+escape(value, "url")  for a column of names in the format  Vallee, Rudy

The results in JSON were, e.g.:

{
    "query": "vallee, rudy",
    "result": [
        {
            "term": "Vallée, Rudy, 1901-1986",
            "lc": "n82152282",
            "dnb": "119472015",
            "bnf": "12404039",
            "bne": "xx4579281",
            "nla": "000035575290",
            "viafid": "61733278"
        },
        {
            "term": "Vallée, Rudy, 1901-",
            "lc": "n82152282",
            "dnb": "119472015",
            "bnf": "12404039",
            "bne": "xx4579281",
            "nla": "000035575290",
            "viafid": "61733278"
        },
        {
            "term": "Vallee, Rudy",
            "lc": "n82152282",
            "dnb": "119472015",
            "bnf": "12404039",
            "bne": "xx4579281",
            "nla": "000035575290",
            "viafid": "61733278"
        },
        {
            "term": "Vallee, Rudy",
            "lc": "no2008185615",
            "viafid": "53980359"
        },
        {
            "term": "Vallée, Rudy",
            "viafid": "284160151"
        }
    ]
}

With a little research, I discover that the heading I need is Vallee, Rudy, 1901-1986. To parse the JSON AutoSuggest results to get this heading and its LC identifier, the following GREL worked:

forEach(value.parseJson().result,v,v.term+" ;; "+v.lc)[0]

The first term/lc pairing is generally the one I need, but that won't always be the case, so I guess filtering for the existence of an "lc" element in the "result" array would be the best approach -- ? I'll always need to evaluate (manually) the various headings returned for a given name, but that is similar to reconciliation in a way.

Thank you so very much for all your help. This non-programmer really appreciates the time and effort you've taken to explain things in such detail.

Marsha
Reply all
Reply to author
Forward
0 new messages