Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Getting "relevant" answers from AAT

150 views
Skip to first unread message

Vladimir Alexiev

unread,
Mar 29, 2015, 11:27:20 AM3/29/15
to gettyv...@googlegroups.com
Cristiano Bianchi of Keep Thinking asks:
We use the query below to search AAT - but the results we get back are very random - as there does not seem to be a notion of hierarchy.
Ideally, if we search for 'boat' we'd like to get the most relevant results back, before delving back to banana boats...

SELECT DISTINCT ?subject ( COALESCE(?gvp_term, ?term, ?gvp_term, ?term) as ?label) ?parent ( COALESCE(?scope_note, ?scope_note_en) as ?scope_note)
WHERE {
  ?subject luc:term "boat"; skos:inScheme aat: ; a ?typ.
  ?typ rdfs:subClassOf gvp:Concept; rdfs:label ?type.
  OPTIONAL {?subject gvp:prefLabelGVP [skosxl:literalForm ?gvp_term]}
  OPTIONAL {?subject gvp:prefLabelGVP [dct:language gvp_lang:nl; skosxl:literalForm ?gvp_term_nl]}
  OPTIONAL {?subject skosxl:prefLabel|skosxl:altLabel [dct:language gvp_lang:nl; skosxl:literalForm ?term_nl]}
  OPTIONAL {?subject skosxl:prefLabel|skosxl:altLabel [skosxl:literalForm ?term]}
  OPTIONAL {?subject skos:scopeNote [dct:language gvp_lang:nl; rdf:value ?scope_note_nl]}
  OPTIONAL {?subject skos:scopeNote [dct:language gvp_lang:en; rdf:value ?scope_note_en]}
  OPTIONAL {?subject gvp:parentString ?parent}}
ORDER BY ?term


Hi Christiano!

Before trying to answer your question, there are a few problems with this query:
  • COALESCE(?gvp_term, ?term, ?gvp_term, ?term) as ?label): there's no point to coalesce the same variable twice.
  • gvp:prefLabelGVP [dct:language gvp_lang:nl]: there's always exactly one gvp:prefLabelGVP. (It's usually but not always in en.) So there's no point to filter by language of gvp:prefLabelGVP
  • You just need COALESCE(?term_nl, ?gvp_term) and don't need to bother with the other two.
  • "?subject a ?typ. ?typ rdfs:subClassOf gvp:Concept" can be stated simply "?subject a gvp:Concept" because that has no subclasses. 
  • Come think of it, you can also skip "skos:inScheme aat:" because gvp:Concept (unlike skos:Concept) only live in aat:

When I run the query, I see what you mean. But since you order by ?term, you get what you asked for :-)
Let's try some queries against http://vocab-beta.getty.edu/sparql
  • Lucene has a notion of relevance score. Its use is described here: https://confluence.ontotext.com/display/OWLIMv54/OWLIM-SE+Full-text+Search. It is the default ordering. You can see the values with a query like this. The result-set and score are sensitive to the keyword: "boat" and "boats" and "boat*" all produce different results (the wildcard makes all scores equal, guess that's a peculiarity of Lucene).
SELECT ?score ?Subject ?Term ?Parents ?ScopeNote {
  ?Subject a skos:Concept; luc:term "boats"; skos:inScheme aat: ;
     gvp:prefLabelGVP [skosxl:literalForm ?Term]; luc:score ?score.
  optional {?Subject gvp:parentStringAbbrev ?Parents}
  optional {?Subject skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}
}
  • Not sure what you mean by hierarchy, but I assume you want to first get the higher-level concepts. Here's a query for that, it counts the commas in parentString (yes, a hack). Remember that the resultset and score are sensitive to the keyword.
SELECT ?level ?score ?Subject ?Term ?Parents ?ScopeNote {
  ?Subject a skos:Concept; luc:term "boats"; skos:inScheme aat: ;
     gvp:prefLabelGVP [skosxl:literalForm ?Term]; luc:score ?score.
  optional {?Subject gvp:parentString ?Parents}
  bind(strlen(replace(?Parents,"[^,]+","")) as ?level)
  optional {?Subject skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}
} order by ?level desc(?score)
  • If you want only objects (not "boating" nor "boatmen"), you should limit to the Objects Facet:
SELECT ?level ?score ?Subject ?Term ?Parents ?ScopeNote {
  ?Subject a gvp:Concept; luc:term "boat";
     gvp:broaderExtended [rdfs:label "Objects Facet"@en];
     gvp:prefLabelGVP [skosxl:literalForm ?Term]; luc:score ?score.
  optional {?Subject gvp:parentString ?Parents}
  bind(strlen(replace(?Parents,"[^,]+","")) as ?level)
  optional {?Subject skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}
} order by ?level desc(?score)
  • Let's try to look at the prefLabelGVP and put first those concepts where the keyword is the first word of the prefLabel:
SELECT ?prefix ?level ?score ?Subject ?Term ?Parents ?ScopeNote {
  ?Subject a gvp:Concept; luc:term "boat";
     gvp:broaderExtended [rdfs:label "Objects Facet"@en];
     gvp:prefLabelGVP [skosxl:literalForm ?Term]; luc:score ?score.
  optional {?Subject gvp:parentString ?Parents}
  bind(strlen(replace(?Parents,"[^,]+","")) as ?level)
  bind(regex(?Term,"^boat","i") as ?prefix)
  optional {?Subject skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}
} order by desc(?prefix) ?level desc(?score)
  • You could also filter to only those concepts where the keyword appears in any prefLabel:
SELECT ?prefix ?level ?score ?Subject ?Term ?Parents ?ScopeNote {
  ?Subject a gvp:Concept; luc:term "boat";
     gvp:broaderExtended [rdfs:label "Objects Facet"@en];
     gvp:prefLabelGVP [skosxl:literalForm ?Term]; luc:score ?score.
  optional {?Subject gvp:parentString ?Parents}
  bind(strlen(replace(?Parents,"[^,]+","")) as ?level)
  bind(regex(?Term,"^boat","i") as ?prefix)
  filter exists{?Subject skos:prefLabel ?prefLabel. filter(regex(?prefLabel,"boat","i"))}
  optional {?Subject skos:scopeNote [dct:language gvp_lang:en; rdf:value ?ScopeNote]}
} order by desc(?prefix) ?level desc(?score)


But neither of these queries is quite what we need, and they are hacky. I would expect the Lucene score to be higher if the keyword appears at the beginning of a preferred label. I think the problem is that luc:term includes all labels; in no particular order; and the more terms match the keyword, the higher the score.

I think that if we make an index by prefLabels only, that would resolve most problems. But is this what you need? Eg it won't find "frostbiting" aka "frostbite boating", see http://vocab-beta.getty.edu/aat/300262618?inference=all

In other words, please define what you mean by "relevant".

Vladimir Alexiev

unread,
Mar 29, 2015, 11:50:49 AM3/29/15
to gettyv...@googlegroups.com
luc:term includes all labels; in no particular order

All labels are inferred into the field rdfs:label. I think the concatenation order for the luc:term index (or so-called FTS "molecule") would be the same as listed for rdfs:label. Eg for the first match of the first query http://vocab-beta.getty.edu/aat/300232754?inference=all the order is:

boats, bushwack@enbuschwack boat@esbuschwack boats@esbushwack boat@nlbushwack boat@enbushwack boats@nlbushwack boats@es,bushwack boats@en ...

And you see why its score is high: the random alt label placed first starts with the keyword (the pref label comes later), there are many alt labels (including English ones promulgated to other languags) with the keyword (i.e. many occurrences). These problems would be largely resolved if we make an index by prefLabels only.

BTW the Getty site doesn't do better for ordering.

Vladimir Alexiev

unread,
May 13, 2015, 7:57:48 AM5/13/15
to gettyv...@googlegroups.com, vlad...@sirma.bg
So, does anyone need a Lucene index by prefLabels only?
Reply all
Reply to author
Forward
0 new messages