Search using diacritics does not match word with diacritics

72 views
Skip to first unread message

Thomas Francart

unread,
Nov 14, 2017, 8:13:07 AM11/14/17
to Skosmos Users
Hello

As suggested in the (excellent) documentation at https://github.com/NatLibFi/Skosmos/wiki/TextAnalysisConfiguration I am using this Jena analyzer configuration to perform diacritic-insensitive searches :

       text:analyzer [
        a text:ConfigurableAnalyzer ;
        text:tokenizer text:LetterTokenizer ;
        text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
       ]


Using this, a search for "deja" will indeed match "déjà"; however, a search on "déjà" does _not_ match "déjà".
It looks like the same analyser is not used at index and query time (?). Or is the use of a trailing "*" character in JenaText SPARQL queries prevent the analyzer to be used at query-time ?

Is it possible to have both "deja" _and_ "déjà" match "déjà" ?

Thanks
Thomas

--

Thomas Francart - SPARNA
Web de données | Architecture de l'information | Accès aux connaissances
blog :
blog.sparna.fr, site : sparna.fr, linkedin : fr.linkedin.com/in/thomasfrancart
tel : 
 +33 (0)6.71.11.25.97
, skype : francartthomas

Osma Suominen

unread,
Nov 14, 2017, 9:53:50 AM11/14/17
to skosmo...@googlegroups.com
Hi Thomas!

Have you set

<#indexLucene> a text:TextIndexLucene ;
text:queryParser text:AnalyzingQueryParser ;

as suggested on the TextAnalysisConfiguration right above the
ConfigurableAnalyzer configuration you use? It's crucial, because
without this setting the trailing * in queries (which is implicitly
added by Skosmos) will cause jena-text to skip the analyzer. This would
explain your problem.

-Osma

Thomas Francart kirjoitti 14.11.2017 klo 15:12:
> Hello
>
> As suggested in the (excellent) documentation at
> https://github.com/NatLibFi/Skosmos/wiki/TextAnalysisConfiguration I am
> using this Jena analyzer configuration to perform diacritic-insensitive
> searches :
>
>        text:analyzer [
>         a text:ConfigurableAnalyzer ;
>         text:tokenizer text:LetterTokenizer ;
>         text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>        ]
>
>
> Using this, a search for "deja" will indeed match "déjà"; however, a
> search on "déjà" does _not_ match "déjà".
> It looks like the same analyser is not used at index and query time (?).
> Or is the use of a trailing "*" character in JenaText SPARQL queries
> prevent the analyzer to be used at query-time ?
>
> Is it possible to have both "deja" _and_ "déjà" match "déjà" ?
>
> Thanks
> Thomas
>
> --
> *
> *
> *Thomas Francart* -*SPARNA*
> Web de _données_ | Architecture de l'_information_ | Accès aux
> _connaissances_
> blog : blog.sparna.fr <http://blog.sparna.fr>, site : sparna.fr
> <http://sparna.fr>, linkedin : fr.linkedin.com/in/thomasfrancart
> <https://fr.linkedin.com/in/thomasfrancart>
> tel :  +33 (0)6.71.11.25.97, skype : francartthomas
>
> --
> You received this message because you are subscribed to the Google
> Groups "Skosmos Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to skosmos-user...@googlegroups.com
> <mailto:skosmos-user...@googlegroups.com>.
> To post to this group, send email to skosmo...@googlegroups.com
> <mailto:skosmo...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/skosmos-users/CAPugn7V78hAhoY81VdaK%2BZK62n5sSK2Ki3kPD%3DozyF3fsF4rQA%40mail.gmail.com
> <https://groups.google.com/d/msgid/skosmos-users/CAPugn7V78hAhoY81VdaK%2BZK62n5sSK2Ki3kPD%3DozyF3fsF4rQA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Osma Suominen

unread,
Nov 14, 2017, 10:00:24 AM11/14/17
to skosmo...@googlegroups.com
I updated the wiki to make this a bit more explicit in the last section
which uses LetterTokenizer.

-Osma

Thomas Francart

unread,
Nov 14, 2017, 3:47:15 PM11/14/17
to Osma Suominen, Skosmos Users
That was indeed my problem !
Works like a charm now.

Thanks for the help

To unsubscribe from this group and stop receiving emails from it, send an email to skosmos-users+unsubscribe@googlegroups.com <mailto:skosmos-users+unsubscri...@googlegroups.com>.
To post to this group, send email to skosmo...@googlegroups.com <mailto:skosmos-users@googlegroups.com>.


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

--
You received this message because you are subscribed to the Google Groups "Skosmos Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skosmos-users+unsubscribe@googlegroups.com.
To post to this group, send email to skosmo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skosmos-users/50bb09cf-98ab-f94b-4259-ec1dfeeaa04e%40helsinki.fi.

For more options, visit https://groups.google.com/d/optout.



--

Thomas Francart - SPARNA
Web de données | Architecture de l'information | Accès aux connaissances
blog :
blog.sparna.fr,
site : sparna.fr, linkedin : fr.linkedin.com/in/thomasfrancart
tel : 
 +33 (0)6.71.11.25.97
, skype : francartthomas

"Dan Michael O. Heggø"

unread,
Nov 16, 2017, 5:25:12 AM11/16/17
to Skosmos Users
Cool, I didn't know this could be configured. Is it also possible to use
a ConfigurableAnalyzer setup to tune this to normalize some diacritics,
but not all non-ascii characters? For instance, in Norwegian it's
important to not normalize Å as A :) What configuration do you use at
finto.fi?

Dan Michael

On 14/11/2017, 21:46, Thomas Francart wrote:
> That was indeed my problem !
> Works like a charm now.
>
> Thanks for the help
>
> 2017-11-14 16:00 GMT+01:00 Osma Suominen <osma.s...@helsinki.fi
> <mailto:osma.s...@helsinki.fi>>:
> <http://fr.linkedin.com/in/thomasfrancart>
> <https://fr.linkedin.com/in/thomasfrancart
> <https://fr.linkedin.com/in/thomasfrancart>>
> tel : +33 (0)6.71.11.25.97
> <tel:%2B33%20%280%296.71.11.25.97>, skype : francartthomas
>
> --
> You received this message because you are subscribed to the
> Google Groups "Skosmos Users" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to
> skosmos-user...@googlegroups.com
> <mailto:skosmos-users%2Bunsu...@googlegroups.com>
> <mailto:skosmos-user...@googlegroups.com
> <mailto:skosmos-users%2Bunsu...@googlegroups.com>>.
> To post to this group, send email to
> skosmo...@googlegroups.com
> <mailto:skosmo...@googlegroups.com>
> <mailto:skosmo...@googlegroups.com
> <mailto:skosmo...@googlegroups.com>>.
> <https://groups.google.com/d/msgid/skosmos-users/CAPugn7V78hAhoY81VdaK%2BZK62n5sSK2Ki3kPD%3DozyF3fsF4rQA%40mail.gmail.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
>
>
>
>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 26 (Kaikukatu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529 <tel:%2B358%2050%203199529>
> osma.s...@helsinki.fi <mailto:osma.s...@helsinki.fi>
> http://www.nationallibrary.fi
>
> --
> You received this message because you are subscribed to the Google
> Groups "Skosmos Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to skosmos-user...@googlegroups.com
> <mailto:skosmos-users%2Bunsu...@googlegroups.com>.
> To post to this group, send email to skosmo...@googlegroups.com
> <mailto:skosmo...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/skosmos-users/50bb09cf-98ab-f94b-4259-ec1dfeeaa04e%40helsinki.fi
> <https://groups.google.com/d/msgid/skosmos-users/50bb09cf-98ab-f94b-4259-ec1dfeeaa04e%40helsinki.fi>.
>
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
>
>
> --
> *
> *
> *Thomas Francart* -*SPARNA*
> Web de _données_ | Architecture de l'_information_ | Accès aux
> _connaissances_
> blog : blog.sparna.fr <http://blog.sparna.fr>, site : sparna.fr
> <http://sparna.fr>, linkedin : fr.linkedin.com/in/thomasfrancart
> <https://fr.linkedin.com/in/thomasfrancart>
> tel : +33 (0)6.71.11.25.97, skype : francartthomas
>
> --
> You received this message because you are subscribed to the Google
> Groups "Skosmos Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to skosmos-user...@googlegroups.com
> <mailto:skosmos-user...@googlegroups.com>.
> To post to this group, send email to skosmo...@googlegroups.com
> <mailto:skosmo...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/skosmos-users/CAPugn7XTkzzOWzbgr4KSMUuTPT%2B%2BzAXBxdhycsxNBus2y%2Bk-dw%40mail.gmail.com
> <https://groups.google.com/d/msgid/skosmos-users/CAPugn7XTkzzOWzbgr4KSMUuTPT%2B%2BzAXBxdhycsxNBus2y%2Bk-dw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Osma Suominen

unread,
Nov 16, 2017, 5:39:25 AM11/16/17
to skosmo...@googlegroups.com
"Dan Michael O. Heggø" kirjoitti 16.11.2017 klo 12:25:
> Cool, I didn't know this could be configured. Is it also possible to use
> a ConfigurableAnalyzer setup to tune this to normalize some diacritics,
> but not all non-ascii characters? For instance, in Norwegian it's
> important to not normalize Å as A :) What configuration do you use at
> finto.fi?

We don't use this feature for Finto.fi for the same reason you mention -
it would be necessary to leave some diacritics such as ÅÄÖ alone, while
folding others, but currently AFAIK no Lucene analyzer can do so.

It shouldn't be hard to implement but it would have to be done either in
Jena (which already has some special Lucene analyzers, some of them
written by yours truly...) or directly in Lucene. If you want to try I
can help ;)

-Osma


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi
Reply all
Reply to author
Forward
0 new messages