Stop words in search terms

26 views
Skip to first unread message

Rolf Blijleven

unread,
May 5, 2018, 5:27:48 PM5/5/18
to Getty Vocabularies as Linked Open Data
Hello again, 

Still on the subject of an interface service that queries the endpoint and returns Adlib XML, I'm looking into stop word removal, as suggested here, specifically 

 Lucene doesn’t index stop words, you should [..] remove them from the query, [otherwise], queries like [...] “arts and crafts” won’t find anything.

So I tried: 

    SELECT * WHERE {
        ?subject luc:term "arts* and crafts*"; xl:prefLabel [dct:language gvp_lang:en; xl:literalForm ?term]
    }

    SELECT * WHERE {
        ?subject luc:term "arts* crafts*"; xl:prefLabel [dct:language gvp_lang:en; xl:literalForm ?term]
    }

    SELECT * WHERE {
        ?subject luc:term "arts and crafts*"; xl:prefLabel [dct:language gvp_lang:en; xl:literalForm ?term]
    }

Somewhat to my surprise, the first two yield 116 hits, the third 112, all yield  'Arts and Crafts (movement)@en' as the topmost result. Likewise with "Library of congress*" as search term. "arts* AND crafts*" yield only 3. So 'and' is a stop word, 'AND' is an operator. 

In itself, the result is perfectly usable. But it's not what I would expect from the documentation. Even if the stop word capitalised is an operator, the result will be OK. 

I'm now trying to think of an example where stop word removal is necessary. Haven't found it yet. 

Any comments? 

thanks, 
Rolf 


 

Vladimir Alexiev

unread,
May 10, 2018, 10:07:05 AM5/10/18
to Getty Vocabularies as Linked Open Data
You're right, I took a note to update the documentation. I guess the last update of GraphDB comes with some update of Lucene that handles that better.

The most precise results are if you use a phrase search
'  "arts crafts"  '

or a conjunction
' arts AND crafts '


Reply all
Reply to author
Forward
0 new messages