Reconciling with VIAF using OpenRefine

704 views
Skip to first unread message

kos...@gmail.com

unread,
Aug 16, 2020, 10:29:03 PM8/16/20
to OpenRefine

I've submitted a batch of new name authority records to VIAF earlier under the contributing organisation NLB and would now like to generate a list displaying the name authorities with the newly assigned VIAF ID. I tried to do this using the reconciliation service in OpenRefine, limiting to those contributed by NLB: http://refine.codefork.com/reconcile/viaf/NLB. However, the success rate of matching is low, with only 172 out of 686 name authorities being matched successfully. I tried to add the original ID of the name authorities (see column TTE UID below) as a property to match against MARC tag 001 in VIAF, but it doesn't seems to be working in OpenRefine. How do I improve the success rate of matching with VIAF?


Below is a sample of the Excel list I've uploaded to OpenRefine for reconciliation with VIAF. I'm using column VIAF Preferred Form for the reconciliation: http://127.0.0.1:3333/project?project=2519255424221&ui=%7B%22facets%22%3A%5B%5D%7D


TTE UIDTTE PTVIAF Preferred FormVIAF Category
18606982Abdul Hamid Haji JumatAbdul Hamid Haji JumatPersonal Names
18338826Agency for Integrated CareAgency for Integrated CareCorporate Names
18603820Alkaff MosqueAlkaff MosqueCorporate Names
18605517Alpha Gallery (Singapore)Alpha Gallery (Singapore)Corporate Names
18532899Alsagoff, Ali Redha, SyedAlsagoff, Ali Redha, SyedPersonal Names
18578460Ameerali R. JumabhoyAmeerali R. JumabhoyPersonal Names


Thank you.

Tom Morris

unread,
Aug 17, 2020, 2:37:07 PM8/17/20
to openr...@googlegroups.com
I'm not sure that the author of that reconciliation service follows this list, so you might do better creating an issue on their bug tracker here: https://github.com/codeforkjeff/conciliator/issues

It would be helpful if you provided a little more information such as the list of reconciled vs not reconciled names and/or the complete data set for people to experiment with.

A fews of things that I noticed in playing with this:
- choosing either organization or person reconciles *only* the entries for that type, so you'd need to do it twice
- choosing to reconcile against no type in particular doesn't match anything 
- the identifier (TTE UUID) can be used to reconcile, but it needs to be prefixed with "NLB|" in VIAF form e.g. NLB|18606982 (but the restrictions above still apply about only reconciling to a single type)

Another alternative is to just use the VIAF API directly by using the "Add column by fetching URL" with a URL of the form "http://www.viaf.org/viaf/sourceID/NLB%7C"+value+"/viaf.xml" and then parsing the result using something along the lines of value.parseXml().select('ns1|viafID')[0].xmlText() to extract the VIAF ID (or a similar expression if you want other values).

Hope that helps!

Tom

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/4bab23b4-e2e0-4194-bc1f-958357a77f62o%40googlegroups.com.

Owen Stephens

unread,
Aug 19, 2020, 7:52:43 AM8/19/20
to OpenRefine
On Monday, August 17, 2020 at 7:37:07 PM UTC+1 tfmo...@gmail.com wrote:
Another alternative is to just use the VIAF API directly by using the "Add column by fetching URL" with a URL of the form "http://www.viaf.org/viaf/sourceID/NLB%7C"+value+"/viaf.xml" and then parsing the result using something along the lines of value.parseXml().select('ns1|viafID')[0].xmlText() to extract the VIAF ID (or a similar expression if you want other values).


Just to +1 Tom's suggestion of using a "Add column by fetching URL" - this is almost certainly the most direct route to getting what you need - you already have unique identifiers (which unfortunately the particular reconciliation service doesn't work well with) so you can just go ahead and do a direct lookup.

Chow Yaw Huah

unread,
Aug 25, 2020, 4:03:14 AM8/25/20
to OpenRefine
I've tried to "add column by fetching URL" and key in the URL "http://www.viaf.org/viaf/sourceID/NLB%7C"+value+"/viaf.xml" in Expression using GREL. It doesn't seems to fetch any values. Am I missing something? And how do I parse the result?


Thank you.



Owen Stephens

unread,
Aug 25, 2020, 4:46:37 AM8/25/20
to OpenRefine
Hi

There's a page on the wiki which can help in this situation.


The first step is to check the option to "Store Error" (its just below where you set the column name when you use "add column by fetching URL"). This should store any errors generated when OpenRefine tries to retrieve the URL which will help in diagnosing what is going wrong here.

If you can't resolve the error please post back here with any additional information you've got from this and we can try to resolve

Best wishes

Owen

Chow Yaw Huah

unread,
Aug 26, 2020, 4:07:29 AM8/26/20
to OpenRefine
Hi Owen,

The error message: 
HTTP error 404 : Not Found | <html><head><title>Document not found</title></head><body><h2>404: Document not found</h2><hr/>OCLC SRW/SRU Server</body></html>

I notice there a change in the original UID of the name authorities in the preview after adding the URL "http://www.viaf.org/viaf/sourceID/NLB%7C"+value+"/viaf.xml" :

Is that causing the error?

Owen Stephens

unread,
Aug 26, 2020, 5:09:09 AM8/26/20
to OpenRefine
Yes - that's definitely the problem. It looks like you are storing the VIAF ID as a Number in OpenRefine (it will appear green in the data grid). When you use the GREL

"http://www.viaf.org/viaf/sourceID/NLB%7C"+value+"/viaf.xml"

The VIAF ID gets converted to a string - unfortunately the string representation chosen is "scientific notation" (1.8338826E7 means 1.8338826 * 10^7 or 1.8338826 * 10000000). That means what you are asking VIAF to look up isn't the ID - as far as VIAF is concerned you are searching for an ID that doesn't exist - and so you get a "Document not found" reply.

That the number value 18338826 gets converted to the string 1.8338826E7 I'd regard as a bug - this is completely non-intuitive and very unhelpful in this context - although generally I'd highly recommend only storing things as numbers if they can be treated as numbers in a meaningful way - e.g. if you could meaningfully add them together. In this case the VIAF ID isn't really a number - you can't do numerical operations like additional or multiplication on it in a meaningful way. Its really a bit of text that just happens to only contain the digits 0-9)

There are different ways to fix the problem, but I'd suggest that you first run the following GREL on the column that contains your VIAF ID:

value.floor().toString()

This should result in the VIAF ID being converted to a string in the correct format (i.e. not in scientific notation). Once this has been done you can use the "Add column by fetching URL" option again with the same formula and because you've already converted the VIAF ID to a string the formula should work as expected (i.e. you are trying to fetch URLs that look like http://www.viaf.org/viaf/sourceID/NLB%7C18338826/viaf.xml)

If you have any problems with this, let me know.

Owen

PS exactly why value.floor().toString() gives a different result to value.toString() I'm not 100% sure but I suspect value.floor() forces a change in how the number is stored behind the scenes and so when you convert to a string it doesn't use the scientific notation

Tom Morris

unread,
Aug 26, 2020, 11:53:42 AM8/26/20
to openr...@googlegroups.com
As Owen mentioned, identifiers should almost always be treated as strings, even if they consist of all digits, otherwise you end up dropping significant leading zeros, running into conversion problems, etc.

On Wed, Aug 26, 2020 at 5:09 AM Owen Stephens <ow...@ostephens.com> wrote:

PS exactly why value.floor().toString() gives a different result to value.toString() I'm not 100% sure but I suspect value.floor() forces a change in how the number is stored behind the scenes and so when you convert to a string it doesn't use the scientific notation

OpenRefine stores numbers internally in two different formats: Longs for integers and Doubles for numbers which can't be represented as integers. Since toNumber() will preferentially choose integers, these numbers probably got created through some other mechanism to have them be stored as doubles. The effect of floor() is easy to explain, it's specified to return an integer.

Don't forget that toString() accepts a format specifier that allows you complete control over how your numbers are formatted.

Tom

Thad Guidry

unread,
Aug 26, 2020, 2:30:58 PM8/26/20
to openr...@googlegroups.com
Ah darn!  We forgot to go back and update the Wiki docs for the toString() format argument!

Updating that now.
The Help tab on Expression Editor is already updated thankfully.



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Thad Guidry

unread,
Aug 26, 2020, 3:02:34 PM8/26/20
to openr...@googlegroups.com
Tom,

Where's the reference for the Number formatting?

I've also added a new issue to improve the error message returned.
https://github.com/OpenRefine/OpenRefine/issues/3133


Thad Guidry

unread,
Aug 26, 2020, 3:11:20 PM8/26/20
to openr...@googlegroups.com
Nevermind, I found it... it's basically this:  https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html

Which is the worst reference to place in front of our users...(sigh)

I like the way these 2 tables in this article provide a quick reference for the rather extremely verbose official reference.
https://dzone.com/articles/java-string-format-examples

Any suggestions on best approaches to document toString format argument options?  Do we just borrow those 2 tables and then provide a hyperlink to official reference for folks that want to dig deeper?


Chow Yaw Huah

unread,
Sep 3, 2020, 4:32:42 AM9/3/20
to OpenRefine
Thanks everyone! It's working perfectly fine now.

Jessica Breiman

unread,
Sep 29, 2020, 3:30:36 PM9/29/20
to OpenRefine
Hi Everyone, 

I am using the same Conciliator service to reconcile names against LCNAF. However, I would really like to grab the LCNAF ID number --using Conciliator I can only grab the VIAF ID. Any suggestions for how to get the LC ID?

Thanks!

Jessica

Reply all
Reply to author
Forward
0 new messages