Querying en AND en-US

78 views
Skip to first unread message

cbut...@colby.edu

unread,
Jun 16, 2015, 10:11:59 PM6/16/15
to gettyv...@googlegroups.com
I have been using this SPARQL query in an OpenRefine reconciliation service that takes human-readable material labels for objects and returns the prefLabelGVP and AAT# for possible matches in the AAT. After breaking down the label to component parts (e.g., "Oil on canvas" to "oil," "canvas"), it searches the vocabularies for matches to each of those terms:

select distinct * {?entry (xl:prefLabel|xl:altLabel)/gvp:term "termQuery"@en .
?entry gvp:prefLabelGVP/xl:literalForm ?label}


This query has worked surprisingly well, returning a limited number of relevant results--well enough that i haven't needed to restrict the query to the materials or objects facets yet. One edge case, though, are objects of xl:prefLabel predicates that have variant US english and british english spellings. These statements (for example in watercolor, http://vocab.getty.edu/aat/300015045 or drypoints, http://vocab.getty.edu/aat/300041349) appear to have no terms bound to @en at all, and exclusively use @en-US and @en-GB. As a result, the query above returns no results for terms like these. 

Is there a way to construct this query to search @en-US as well as @en gvp:terms? I was delighted that such a simple query could get me so far, but perhaps there is a better way. If I can, i'd like to avoid a lucene index query, as they always seems to add unnecessary noise to my results (for this use case).

Vladimir Alexiev

unread,
Jun 17, 2015, 4:09:32 AM6/17/15
to gettyv...@googlegroups.com, cbut...@colby.edu
Hi! A very good question and very good suggestion to use the pure gvp:term.

If you're not sure about the language, you can check with str() equality. But if you do that over all labels, that will be very slow.
So you can first use Lucene to restrict to a small result set, THEN use equality to improve precision.
Eg this finds just 3 entries (incl "oiling" because that's "oil (process)"):

select * {
 
?entry luc:term "oil";
     gvp
:prefLabelGVP/xl:literalForm ?label.
  filter exists
{
   
?entry (xl:prefLabel|xl:altLabel)/gvp:term ?term.
    filter
(str(?term)="oil")}}

I also tried to filter to EN with langMatches(?term,"en") (see langMatches()) but there's some bug in the SPARQL processor, I've posted it.

With your permission I'll add this example to "Sample queries". And when you're done with the reconciliation, please describe it in some detail and I'll add it to our list of Usages. Thanks!


cbut...@colby.edu

unread,
Jun 18, 2015, 2:10:04 PM6/18/15
to gettyv...@googlegroups.com, cbut...@colby.edu
That query is exactly what I was looking for, thanks Vladimir! The lucene index search solves another outstanding issue for this query--case insensitivity in searches for terms with acronyms (and thus forced-capitalized labels like "C-print"). This is what our query looks like now, with the addition of lucene scores to help sort larger result sets:

select ?entry ?label ?score {
  ?entry luc:term "c-print";
     gvp:prefLabelGVP/xl:literalForm ?label;
         luc:score ?score.
  filter exists {
    ?entry (xl:prefLabel|xl:altLabel)/gvp:term ?term.
    filter (lcase(str(?term)) = "c-print")}} order by desc(?score)

Feel free to add this as a sample query. If you're interested, I can post the python script that runs the reconciliation service when I've got it complete, as well.

Vladimir Alexiev

unread,
Jun 19, 2015, 7:01:18 AM6/19/15
to gettyv...@googlegroups.com
>case insensitivity in searches for terms with acronyms (and thus forced-capitalized labels like "C-print")

Good suggestion!

> addition of lucene scores to help sort larger result sets:

AFAIK if you don’t specify order, it orders by luc:score. Can you confirm or disconfirm this? I say it somewhere in the doc...

> Feel free to add this as a sample query.

Will appear in ver 3.2.
I also expanded the OpenRefine section and added a reference to your query since it's more appropriate for reconciliation.

> If you're interested, I can post the python script that runs the reconciliation service when I've got it complete, as well.

Better describe it in a bit more detail (e.g. in a blog) including background, and we’ll add it to Usage stories.
Cheers!

Vladimir Alexiev

unread,
Jun 19, 2015, 7:14:40 AM6/19/15
to gettyv...@googlegroups.com, vlad...@sirma.bg, cbut...@colby.edu
> I also tried to filter to EN with langMatches(?term,"en") but there's some bug in the SPARQL processor

There is no bug. The right way to invoke it is:  langMatches(lang(?term),"en") 
Reply all
Reply to author
Forward
0 new messages