Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Lucene FTS and ULAN prefName/altName

50 views
Skip to first unread message

Matthew Lincoln

unread,
Jun 10, 2015, 3:45:36 PM6/10/15
to gettyv...@googlegroups.com
I've been working on an R package for matching ULAN IDs to a list of artist names, optionally filtering by life dates: https://github.com/mdlincoln/ulanr

I'd like to search against BOTH prefLabel AND altLabel for each artist. Right now, I'm using this query, which accepts the arguments NAME, and the optional terms EARLYDATE, LATEDATE (if not specified, query is constructed with -9999 and 2099 respectively)

SELECT ?id ?pref_name ?startdate ?enddate ?gender ?nationality
WHERE {
 
 
?artist skos:inScheme ulan: ;

 luc
:term NAME ;
 rdf
:type gvp:PersonConcept ;
 dc
:identifier ?id ;
 gvp:prefLabelGVP [gvp:term ?pref_name] .


 ?artist foaf:focus [gvp:biographyPreferred ?bio] .
 
?bio gvp:estStart ?startdate ;
 gvp
:estEnd ?enddate .

 OPTIONAL {
  ?bio schema:gender [gvp:prefLabelGVP [gvp:term ?gender]] .
 }

 OPTIONAL {
  ?focus gvp:nationalityPreferred [gvp:prefLabelGVP [gvp:term ?nationality]] .
 }
 
 FILTER
(?startdate >= EARLYDATE^^xsd:gYear && ?enddate <= LATEDATE^^xsd:gYear),
 
}
LIMIT
1

The program will returns a table containing all the bindings from the SPARQL query, along with a column for the originally-submitted name. If I understand correctly, this search will include both preflabels AND altlabels, and will return results ordered by score?

I find that the lucene index seems to handle variant spellings fine, but it behaves unexpectedly when searching against names with numbers. For example:

> ulan_data(c("Hendrik Hondius (I)", "Hendrick Hondius (I)", "Hendrik Hondius", "Hendrick Hondius"))
Source: local data frame [4 x 7]

                  name        id           pref_name birth_year death_year gender nationality
1  Hendrik Hondius (I) 500006788 Hondius, Hendrik, I       1573       1650   male       Dutch
2 Hendrick Hondius (I) 500006788 Hondius, Hendrik, I       1573       1650   male       Dutch
3      Hendrik Hondius 500116744    Hondius, Hendrik       1615       1677   male       Dutch
4     Hendrick Hondius 500006787     Hondius, Gerrit       1891       1970   male       Dutch

The fourth query result was a bit of a surprise, given that none of the pref/altabels for 500006787: Gerrit Hondius contain Hendrick/Hendrik - I would at least have expected that it would return 500116744:Hondius, Hendrik. Any thoughts?

Vladimir Alexiev

unread,
Jun 11, 2015, 3:46:09 PM6/11/15
to gettyv...@googlegroups.com, matthew....@gmail.com
Hi Matthew!
It's all about the Lucene query that you use: luc:term is a query against the corresponding index (which indeed includes prefLabel and altLabel), not a literal string.
A couple days ago we added some examples at http://vocab.getty.edu/queries#Exact-Match_Full_Text_Search_Query.
You can use triple quotes to delimit the SPARQL string literal, so you won't have to escape a quote that you want to pass to Lucene.

It's not intuitive at all. Examples:
- """ "Hendrik  Hondius" """ : 3 results. This is an exact phrase search
- """Hendrik AND Hondius""" : 3 results. This allows the two names to be swapped, but GVP already does that (the prefLabel is in "Indexing" form "Last, First", an altLabel is in "Display" form)
- """Hendrik Hondius""" : 337 results. This searches for EITHER word, sorting first results that include both
- """Hendrik  and Hondius""" : 337 results. "and" is an English stop word so it's removed from the query
Reply all
Reply to author
Forward
0 new messages