Attempting to pull ULAN records with LCNAF & VIAF IDs

55 views
Skip to first unread message

Nicholas Cipolla

unread,
Nov 12, 2020, 11:30:53 PM11/12/20
to Getty Vocabularies as Linked Open Data
Hello,

We're attempting to pull and parse the ULAN dataset to get the following headers:
1. ULAN ID
2. Preferred Name
3. Non-Preferred Names (up to 5 in number)
4. Record Type  (Person or Corporate Body)
5. Preferred Nationality (code)
6. Preferred Birth Date
7. Preferred Death Date
8. LCNAF ID (Library of Congress)
9. VIAF ID

We've been using the XML version of the dataset up to this point primarily as it's of a manageable size to use a Python script to parse out the relevant information to a CSV. 

However, it seems that the XML version does not have all of the data of the web version and LOD versions (triples, JSON-LD). Specifically, not every record has the LCNAF or VIAF IDs that exist on the web or LOD version of the record. Furthermore, the LOD versions have a true URI link to the Source ID itself, not just the ID.

If this all makes sense, is this correct that the XML version is not as complete as the web/LOD versions of the record? And if so, can a SPARQL query based on this thread (https://groups.google.com/g/gettyvocablod/c/wUNxdzrwP1c/m/EpJK4_lXAgAJ)
produce the resultant CSV that we're looking for? Or is there a repository of the LOD version that I'm unaware of (the n-Triples version is 8GB and I cannot even open that large of a file)?

many thanks in advance from a relative non-techie,
Nicholas Cipolla

Gregg Garcia

unread,
Nov 13, 2020, 2:07:51 PM11/13/20
to Getty Vocabularies as Linked Open Data
Hi, Nicholas.

I'm not sure what you mean by "the XML version does not have all of the data of the web version and LOD versions."  Are you talking about the full XML download? The fields you describe are all available in both the full download and web services version of the XML. The XML downloads are in fact generated from the web services.

We are not currently tracking the VIAF ID in ULAN.

For the LCNAF ID, you can find it in the <Page> element of <Term_Source> or <Subject_Source> where the <Source_ID> is "2100149014/Library of Congress Authorities database (n.d.)".

Alternately, a SPARQL query gives the same information in URI form:

SELECT * WHERE {
  ?s skos:inScheme ulan:.
  ?s skos:exactMatch ?p FILTER(contains(str(?p),'id.loc.gov'))

You can obtain the other fields you list using SPARQL queries like the ones highlighted in the discussion thread you mention, but the data is the same as in the XML.

Gregg Garcia
Software Architect
Getty Digital, J. Paul Getty Trust

Nicholas Cipolla

unread,
Nov 20, 2020, 3:29:19 PM11/20/20
to Getty Vocabularies as Linked Open Data
Hi Gregg,

Thanks very much for your reply, and we managed to craft a SPARQL query to provide us with just about everything we needed. I'm including it here:

select ?x ?name ?type ?nationality ?birth ?died ?lcnaf {
  ?x gvp:broaderExtended ulan:500000002. # Persons, Artists
  optional {?x gvp:agentTypePreferred [gvp:prefLabelGVP [xl:literalForm ?type]]}
  optional {?x foaf:focus [gvp:nationalityPreferred [gvp:prefLabelGVP [xl:literalForm ?nationality]]]}
  optional {?x gvp:prefLabelGVP [xl:literalForm ?name]}
  optional {?x foaf:focus [gvp:biographyPreferred [gvp:estStart ?birth; gvp:estEnd ?died]]}
  optional {?x skos:exactMatch ?lcnaf FILTER(contains(str(?lcnaf),“id.loc.gov”))}
}

And a second one for alternative names:
select ?x ?altname {
  ?x gvp:broaderExtended ulan:500000002.
  optional {?x skosxl:altLabel [xl:literalForm ?altname]}
 }

We also did the same query for all five ULAN codes- persons, artists, corporate bodies, etc. Because there are many alternate names attached to one ID, we chose to do a separate pull using first normal form. Many thanks for the LCNAF ID query.

A couple of things- I should have been more clear in that the LOD version uses URIs more widely due to its very nature than the XML version does. For example, using the TGN term for a nationality is more helpful to us than the old ULAN codes that began with "9". Apologies for inferring the XML version contained less data than the LOD version.
 
Also, I did see VIAF listed as a source for ULAN records– here is what it looks like in the XML: <Source_ID>2100183299/VIAF: Virtual International Authority File [online] (2009-)</Source_ID>  
Wikidata seems to be listed as well: <Source_ID>2100182972/Wikidata online (2000-)</Source_ID>

Occasionally, a VIAF ID is listed as well:
<Source_ID>2100183299/VIAF: Virtual International Authority File [online] (2009-)</Source_ID>
            </Source>
            <Page>29657151; accessed 29 July 2016</Page>
The URL https://viaf.org/viaf/29657151 will take you to the relevant record.

Could you provide the query for VIAF and Wikidata if they possibly exist? Or if I'm still mis-reading the data, please let me know!

Lastly, we're looking to grab the DisplayName, Birthplace, and Deathplace. Any aid on formulating these last three queries would be much appreciated!

--Nicholas

Nicholas Cipolla

unread,
Nov 20, 2020, 4:17:50 PM11/20/20
to Getty Vocabularies as Linked Open Data
One addition: I found a Wikidata Q-ID in the XML: 
<Page>WKP Q15040624; accessed 31 January 2018</Page>

Vladimir Alexiev

unread,
Feb 5, 2021, 4:48:45 AM2/5/21
to Getty Vocabularies as Linked Open Data
You will have better luck pulling cross-references from Wikidata.
If you'd like some help with that, send me a personal email.
Cheers!


Reply all
Reply to author
Forward
0 new messages