Facet filter matching fails for diacritics (â, special chars) in sidebar

8 views
Skip to first unread message

Muhep Atasoy

unread,
2:50 AM (11 hours ago) 2:50 AM
to DSpace Technical Support

Hi all,

I'm running into a problem with sidebar facet filtering (browse-by-value / filter list) in a DSpace 8.x (Solr-backed) installation. The UI displays facet values correctly (with diacritics) but searching in the facet input (the small search box in the sidebar) behaves like a strict exact-match: when I type an ASCII version or remove diacritics, items that only differ by diacritics do not match. Example:

  • Stored/display value: Hamzazâde Esad

  • If the user types hamzazade in the facet search box, it does not return the expected facet value or matching results.

What I found and tried

  • The dynamic field for sidebar facets is *_filter:

<dynamicField name="*_filter" type="keywordFilter" indexed="true" stored="true" multiValued="true" omitNorms="true" />
  • Current keywordFilter fieldType (originally):

<fieldType name="keywordFilter" class="solr.TextField" sortMissingLast="true" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
    </analyzer>
</fieldType>

This keeps the stored/display value exact (good), but facet search behaves like exact-match because the index token is the full original string.

  • I attempted to add Turkish/ICU folding to the analyzer. When I add folding to index analyzer, displayed facet strings started appearing lowercased and diacritics lost (bad for presentation). So I tried to split behaviors with index vs query analyzer:

<fieldType name="keywordFilter" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.TrimFilterFactory"/>
  </analyzer>

  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

This left displayed values untouched and query normalization runs but still the problematic diacritic letter â (and some other special characters) does not reliably match when the user types ASCII a. I tested analyzer outputs in Solr Admin Analysis; query tokens become kamil from kâmil, but facets still don't return as expected.

Important constraints:

  • DSpace code (the UI / server) expects author_filter style fields and currently both search and display use the same *_filter field for facets — I cannot change DSpace to use a separate *_search field easily.

  • I cannot use WhitespaceTokenizer on query time because index uses KeywordTokenizer (index tokens are whole values) and mismatch causes no hits.

Questions / requests for help

  1. Does the DSpace sidebar facet search use the same keywordFilter fieldType defined above, or does DSpace apply additional query-time processing before facet matching? (Which fieldType or query param does the facet small-search box use?)

  2. Am I missing a Solr parameter that controls how facet search (the small value search) is executed so I can inject folding/normalization? (e.g. use of facet.contains, facet.prefix, facet.method, facet.contains.ignoreCase or special params?)

  3. Has anyone solved diacritics matching in the sidebar facets without losing the displayed original strings? Best-practice patterns: use copyField, multi-field approach, mapping char filter, or client-side solution?

  4. If multi-field (search+display) is the recommended approach but DSpace insists on *_filter, is there a recommended DSpace config or XSL/template hook to let facet UI show stored display value while facet search works against a different indexed token?

What I can provide if helpful:

  • sample schema.xml snippets

  • Solr analysis outputs (index vs query) for sample values like kâmil, kâmil\n|||\nKâmil etc.

  • steps I used to test in Solr Admin (analysis page) and example queries that fail.

Thanks in advance any pointers, config snippets, or DSpace-specific guidance would be appreciated.

Muhep

Reply all
Reply to author
Forward
0 new messages