Search on individual words + multi word on the beginning of label ?

68 views
Skip to first unread message

Thomas Francart

unread,
Apr 26, 2018, 7:23:04 AM4/26/18
to Skosmos Users
Hello

I happily configured search indexes as described at https://github.com/NatLibFi/Skosmos/wiki/TextAnalysisConfiguration to search on individual words anywhere in the labels.

So a search on "orientation" matches "accompagnement à l'orientation", and that is good.

However, doing so I lose the ability to search on more that one word. A search on "accompagnement à" does not match "accompagnement à l'orientation", and more strangely a search on "accompagnement à l'orientation" (the exact same label) does not match "accompagnement à l'orientation".

Could I configure the search index so that individual words are splitted _and_ the full label is also inserted in the index as a single token ? so that a multi word search on the beginning of the label, or on the exact label, would work ?

I copy my search config below.

Cheers
Thomas

<#indexLucene> a text:TextIndexLucene ;
    text:queryParser text:AnalyzingQueryParser;
    text:directory <file:/var/lib/fuseki/lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    text:storeValues true ; ## required for Skosmos 1.4
    .

# Text index configuration for Skosmos 1.4 and above (requires Fuseki 1.3.0+ or 2.3.0+)
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ; ## enable graph-specific indexing
    text:defaultField     "pref" ; ## Must be defined in the text:map
    text:uidField         "uid" ; ## recommended for Skosmos 1.4+
    text:langField        "lang" ; ## required for Skosmos 1.4
    text:map (
         # skos:prefLabel
         [ text:field "pref" ;
           text:predicate skos:prefLabel ;
           text:analyzer [
                a text:ConfigurableAnalyzer ;
                text:tokenizer text:LetterTokenizer ;
                text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ]
        ]
         # skos:altLabel
         [ text:field "alt" ;
           text:predicate skos:altLabel ;
           text:analyzer [
                a text:ConfigurableAnalyzer ;
                text:tokenizer text:LetterTokenizer ;
                text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ]
        ]
         # skos:hiddenLabel
         [ text:field "hidden" ;
           text:predicate skos:hiddenLabel ;
           text:analyzer [
                a text:ConfigurableAnalyzer ;
                text:tokenizer text:LetterTokenizer ;
                text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ]
         ]
     ) .


--

Thomas Francart - SPARNA
Web de données | Architecture de l'information | Accès aux connaissances
blog :
blog.sparna.fr, site : sparna.fr, linkedin : fr.linkedin.com/in/thomasfrancart
tel : 
 +33 (0)6.71.11.25.97
, skype : francartthomas

Bruno P. Kinoshita

unread,
Apr 26, 2018, 10:05:18 PM4/26/18
to Thomas Francart, Skosmos Users
Hi Thomas,

Could it be that "à" is being transformed into "a" due to the ASCIIFoldingFilter?

Even if that's the case, and you still get no hits for

accompagnement a


then perhaps your queries are using some different analyzer? They way troubleshooted queries against JenaText recently, was by looking at the Lucene index created directly.

For that I opened the directory in my JenaText configuration for Lucene (/var/lib/fuseki/lucene in your case I think) with Luke.

Then tested filters/analyzers & query analyzer to confirm I had everything working all right in the Lucene/search level.

Hope that helps,
Bruno




________________________________
From: Thomas Francart <thomas....@sparna.fr>
To: Skosmos Users <skosmo...@googlegroups.com>
Sent: Thursday, 26 April 2018 11:23 PM
Subject: Search on individual words + multi word on the beginning of label ?
--
You received this message because you are subscribed to the Google Groups "Skosmos Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skosmos-user...@googlegroups.com.
To post to this group, send email to skosmo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skosmos-users/CAPugn7UHa9hm0G6mqXofqm9b3nvx7oXZBTtrW_-KHJnAUVygDw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thomas Francart

unread,
Apr 27, 2018, 4:12:28 AM4/27/18
to Bruno P. Kinoshita, Skosmos Users
Hello Bruno

Thanks for your answer

2018-04-27 4:05 GMT+02:00 Bruno P. Kinoshita <brunod...@yahoo.com.br>:
Hi Thomas,

Could it be that "à" is being transformed into "a" due to the ASCIIFoldingFilter?

Yes, it is being transformed, but this is intended ! (see https://github.com/NatLibFi/Skosmos/wiki/TextAnalysisConfiguration#accent-folding-ie-matching-regardless-of-diacritics). Search regardless of diacritics works fine.
 

Even if that's the case, and you still get no hits for

accompagnement a

I still get no hit for "accompagnement a". With ou without diacritics does not make a difference. The whitespace is making a difference.
 


then perhaps your queries are using some different analyzer?

See my config in previous mail. The query is using a "AnalyzingQueryParser", as described in documentation.

 
They way troubleshooted queries against JenaText recently, was by looking at the Lucene index created directly.

For that I opened the directory in my JenaText configuration for Lucene (/var/lib/fuseki/lucene in your case I think) with Luke.

Then tested filters/analyzers & query analyzer to confirm I had everything working all right in the Lucene/search level.

Thanks for the tip on how to debug the content of the index;
Actually, I could rephrase my question like so : "Is there a way to combine a LetterTokenizer that splits on individual words, _plus_ a KeywordTokenizer that would keep the full label untouched, and have both individual words and full labels stored in the index ?"

I have tried duplicating the entries in the EntityMap so that skos:prefLabel and other properties are duplicated, once with LetterTokenizer and once with KeywordTokenizer :

# Text index configuration for Skosmos 1.4 and above (requires Fuseki 1.3.0+ or 2.3.0+)
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ; ## enable graph-specific indexing
    text:defaultField     "pref" ; ## Must be defined in the text:map
    text:uidField         "uid" ; ## recommended for Skosmos 1.4+
    text:langField        "lang" ; ## required for Skosmos 1.4
    text:map (
         # skos:prefLabel
         [ text:field "pref" ;
           text:predicate skos:prefLabel ;
           text:analyzer [
                a text:ConfigurableAnalyzer ;
                text:tokenizer text:LetterTokenizer ;
                text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ]
        ]
         # skos:prefLabel with KeywordTokenizer

         [ text:field "pref" ;
           text:predicate skos:prefLabel ;
           text:analyzer [
                a text:ConfigurableAnalyzer ;
                text:tokenizer text:KeywordTokenizer ;

                text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ]
        ]
         # same with skos:altLabel and skos:hiddenLabel
     ) .

 
I have restarted and reloaded the data; no error. But I think the fact that I keep an AnalyzingQueryParser at the query level still prevents multi words queries to work. But even commenting out the AnalyzingQueryParser does not help.
Maybe the way Skosmos builds the search string in the SPARQL query does not allow multi-words queries to work ?

Any idea or comment on the approach above ?

Cheers
Thomas


To unsubscribe from this group and stop receiving emails from it, send an email to skosmos-users+unsubscribe@googlegroups.com.



--

Thomas Francart - SPARNA

Bruno P. Kinoshita

unread,
Apr 27, 2018, 4:20:14 AM4/27/18
to Thomas Francart, Skosmos Users


Not sure if I may be of much more help, not much experienced with Skosmos + JenaText. But in this pull request I included in the comments my configuration. That may be of some help perhaps?

https://github.com/apache/jena/pull/395/

I used the LowerCaseTokenizer, and explicitly defined analyzer and queryAnalyzer, and no analyzers in the terms (simply because I read in the docs somewhere that one analyzer would be reused somewhere... but I wanted to confirm how the code worked and didn't have time).

The only other suggestion I have I think, is to try and debug the problem in parts starting from lucene. If you are able to open the lucene index in Luke, then a simple query will display if the tokenizer/analyzers are working OK. If so, my next try would be a simple Jena Text query (i.e. some SPARQL-fu to craft a query with text:query).

Only then I would dig what's going on in Skosmos. For my pull request above, I made a few mistakes in the assembler configuration (bad combination of some parameters, misspellings, etc), so doing that way it saved me some time. Of course I also had Eclipse with Jena, and used some breakpoints to see what was happening in the code... but that's not really necessary.

Hope that helps,
Bruno


________________________________
From: Thomas Francart <thomas....@sparna.fr>
To: Bruno P. Kinoshita <brunod...@yahoo.com.br>
Cc: Skosmos Users <skosmo...@googlegroups.com>
Sent: Friday, 27 April 2018 8:12 PM
Subject: Re: Search on individual words + multi word on the beginning of label ?



Hello Bruno

Thanks for your answer



2018-04-27 4:05 GMT+02:00 Bruno P. Kinoshita <brunod...@yahoo.com.br>:

Hi Thomas,
>
>Could it be that "à" is being transformed into "a" due to the ASCIIFoldingFilter?
>

Yes, it is being transformed, but this is intended ! (see https://github.com/NatLibFi/ Skosmos/wiki/ TextAnalysisConfiguration# accent-folding-ie-matching- regardless-of-diacritics). Search regardless of diacritics works fine.
I have restarted and reloaded the data; no error. But I think the fact that I keep anAnalyzingQueryParser at the query level still prevents multi words queries to work. But even commenting out the AnalyzingQueryParser does not help.

Maybe the way Skosmos builds the search string in the SPARQL query does not allow multi-words queries to work ?


Any idea or comment on the approach above ?


Cheers

Thomas




>Hope that helps,
>Bruno
>
>
>
>
>______________________________ __
>From: Thomas Francart <thomas....@sparna.fr>
>To: Skosmos Users <skosmo...@googlegroups.co m>
>Sent: Thursday, 26 April 2018 11:23 PM
>Subject: Search on individual words + multi word on the beginning of label ?
>
>
>
>
>Hello
>
>I happily configured search indexes as described at https://github.com/NatLibFi/Sk osmos/wiki/TextAnalysisConfigu ration to search on individual words anywhere in the labels.
To unsubscribe from this group and stop receiving emails from it, send an email to skosmos-user...@googlegroups.com.
To post to this group, send email to skosmo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skosmos-users/CAPugn7VnoYiB3VvTvoKZQY8cFrUdKpfk7WmbNQVGVuSTEO4jSA%40mail.gmail.com.

Thomas Francart

unread,
Jan 11, 2021, 10:51:22 AM1/11/21
to Bruno P. Kinoshita, Skosmos Users
Hello

I know there have been some progress on full-text search at the Jena level in the last months, so I am reviving this 2 years old thread, as I am facing the same issue : Can I configure a search index so that individual words are splitted _and_ I can do a multi word search on the beginning of the label, or on the exact label ? Currently as soon as you type a space in the search box, you don't get any results.

Thanks
Thomas

--

Thomas Francart - SPARNA

PBS Stac

unread,
Jan 12, 2021, 8:13:36 AM1/12/21
to Thomas Francart, Bruno P. Kinoshita, Skosmos Users
Thomas thanks for posting this we have the same problem with searching for terms in buildvoc. When we search for ‘alarm’ no results found but we search for ‘fire alarm systems’ 

Any any improvements on search would be greatly appreciated and well received
Phil


From: skosmo...@googlegroups.com <skosmo...@googlegroups.com> on behalf of Thomas Francart <thomas....@sparna.fr>
Sent: Monday, January 11, 2021 3:51:09 PM

To: Bruno P. Kinoshita <brunod...@yahoo.com.br>
Cc: Skosmos Users <skosmo...@googlegroups.com>

Osma Suominen

unread,
Feb 17, 2021, 10:21:10 AM2/17/21
to skosmo...@googlegroups.com
Hello Phil and Thomas,

Skosmos relies on the jena-text index so how the search works depends a
lot on the index configuration. Have you checked out the configurations
here in the wiki:
https://github.com/NatLibFi/Skosmos/wiki/TextAnalysisConfiguration

There are recipes for configuring the index so it matches individual words.

Please report back if you can get it working the way you like! This is
something we are also interested in for Finto - the default way is OK
for traditional topic-oriented thesauri (at least has been so far) but
matching individual words would be better for e.g. name authority lists
such as the recently published KANTO/FINAF: http://finto.fi/finaf/en/

-Osma

PS. Sorry for the long wait...

PBS Stac kirjoitti 12.1.2021 klo 15.13:
> Thomas thanks for posting this we have the same problem with searching
> for terms in buildvoc. When we search for ‘alarm’ no results found but
> we search for ‘fire alarm systems’
> http://buildvoc.co.uk/resource/c_7804ad6cresults found.
>
> Any any improvements on search would be greatly appreciated and well
> received
> Phil
>
> Get Outlook for iOS <https://aka.ms/o0ukef>
> ------------------------------------------------------------------------
> *From:* skosmo...@googlegroups.com <skosmo...@googlegroups.com>
> on behalf of Thomas Francart <thomas....@sparna.fr>
> *Sent:* Monday, January 11, 2021 3:51:09 PM
> *To:* Bruno P. Kinoshita <brunod...@yahoo.com.br>
> *Cc:* Skosmos Users <skosmo...@googlegroups.com>
> *Subject:* Re: Search on individual words + multi word on the beginning
> of label ?
> Hello
>
> I know there have been some progress on full-text search at the Jena
> level in the last months, so I am reviving this 2 years old thread, as I
> am facing the same issue : Can I configure a search index so that
> individual words are splitted _and_ I can do a multi word search on the
> beginning of the label, or on the exact label ? Currently as soon as you
> type a space in the search box, you don't get any results.
>
> Thanks
> Thomas
>
>
> Le ven. 27 avr. 2018 à 10:20, Bruno P. Kinoshita
> <brunod...@yahoo.com.br <mailto:brunod...@yahoo.com.br>> a écrit :
>
>
>
> Not sure if I may be of much more help, not much experienced with
> Skosmos + JenaText. But in this pull request I included in the
> comments my configuration. That may be of some help perhaps?
>
> https://github.com/apache/jena/pull/395/
>
> I used the LowerCaseTokenizer, and explicitly defined analyzer and
> queryAnalyzer, and no analyzers in the terms (simply because I read
> in the docs somewhere that one analyzer would be reused somewhere...
> but I wanted to confirm how the code worked and didn't have time).
>
> The only other suggestion I have I think, is to try and debug the
> problem in parts starting from lucene. If you are able to open the
> lucene index in Luke, then a simple query will display if the
> tokenizer/analyzers are working OK. If so, my next try would be a
> simple Jena Text query (i.e. some SPARQL-fu to craft a query with
> text:query).
>
> Only then I would dig what's going on in Skosmos. For my pull
> request above, I made a few mistakes in the assembler configuration
> (bad combination of some parameters, misspellings, etc), so doing
> that way it saved me some time. Of course I also had Eclipse with
> Jena, and used some breakpoints to see what was happening in the
> code... but that's not really necessary.
>
> Hope that helps,
> Bruno
>
>
> ________________________________
> From: Thomas Francart <thomas....@sparna.fr
> <mailto:thomas....@sparna.fr>>
> To: Bruno P. Kinoshita <brunod...@yahoo.com.br
> <mailto:brunod...@yahoo.com.br>>
> Cc: Skosmos Users <skosmo...@googlegroups.com
> <mailto:skosmo...@googlegroups.com>>
> Sent: Friday, 27 April 2018 8:12 PM
> Subject: Re: Search on individual words + multi word on the
> beginning of label ?
>
>
>
> Hello Bruno
>
> Thanks for your answer
>
>
>
> 2018-04-27 4:05 GMT+02:00 Bruno P. Kinoshita
> <brunod...@yahoo.com.br <mailto:brunod...@yahoo.com.br>>:
> <mailto:thomas....@sparna.fr>>
> >To: Skosmos Users <skosmo...@googlegroups.co
> <mailto:skosmo...@googlegroups.co> m>
> >blog : blog.sparna.fr <http://blog.sparna.fr>, site : sparna.fr
> <http://sparna.fr>, linkedin : fr.linkedin.com/in/thomasfranc
> <http://fr.linkedin.com/in/thomasfranc> art
> >tel :  +33 (0)6.71.11.25.97, skype : francartthomas
> >--
> >You received this message because you are subscribed to the Google
> Groups "Skosmos Users" group.
> >To unsubscribe from this group and stop receiving emails from it,
> send an email to skosmos-users+unsubscribe@goog legroups.com
> <http://legroups.com>.
> >To post to this group, send email to
> skosmo...@googlegroups.com <mailto:skosmo...@googlegroups.com> .
> >To view this discussion on the web visit
> https://groups.google.com/d/ms gid/skosmos-users/CAPugn7UHa9h
> m0G6mqXofqm9b3nvx7oXZBTtrW_- KHJnAUVygDw%40mail.gmail.com
> <http://40mail.gmail.com>.
> >For more options, visit https://groups.google.com/d/op tout.
> >
>
>
> --
>
>
> Thomas Francart -SPARNA
> Web de données | Architecture de l'information | Accès aux connaissances
> blog : blog.sparna.fr <http://blog.sparna.fr>, site : sparna.fr
> <http://sparna.fr>, linkedin : fr.linkedin.com/in/
> <http://fr.linkedin.com/in/> thomasfrancart
> tel :  +33 (0)6.71.11.25.97, skype : francartthomas
> --
> You received this message because you are subscribed to the Google
> Groups "Skosmos Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to skosmos-user...@googlegroups.com
> <mailto:skosmos-users%2Bunsu...@googlegroups.com>.
> To post to this group, send email to skosmo...@googlegroups.com
> <mailto:skosmo...@googlegroups.com>.
> To view this discussion on the web visit
> *
> *
> *Thomas Francart* -*SPARNA*
> Web de _données_ | Architecture de l'_information_ | Accès aux
> _connaissances_
> blog : blog.sparna.fr <http://blog.sparna.fr>, site : sparna.fr
> <http://sparna.fr>, linkedin : fr.linkedin.com/in/thomasfrancart
> <https://fr.linkedin.com/in/thomasfrancart>
> tel :  +33 (0)6.71.11.25.97, skype : francartthomas
>
> --
> You received this message because you are subscribed to the Google
> Groups "Skosmos Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to skosmos-user...@googlegroups.com
> <mailto:skosmos-user...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/skosmos-users/CAPugn7XSqptgVEfwaQTU4RogfPbgQs08mtps%3DPX9ivV9prCN7w%40mail.gmail.com
> <https://groups.google.com/d/msgid/skosmos-users/CAPugn7XSqptgVEfwaQTU4RogfPbgQs08mtps%3DPX9ivV9prCN7w%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Skosmos Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to skosmos-user...@googlegroups.com
> <mailto:skosmos-user...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/skosmos-users/AM6PR03MB444048A19345846B06C58294A9AA0%40AM6PR03MB4440.eurprd03.prod.outlook.com
> <https://groups.google.com/d/msgid/skosmos-users/AM6PR03MB444048A19345846B06C58294A9AA0%40AM6PR03MB4440.eurprd03.prod.outlook.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi
Reply all
Reply to author
Forward
0 new messages