[nrs4lib] support for icu's Normalizer2 added to lucene

23 views
Skip to first unread message

Robert Muir

unread,
Apr 17, 2010, 1:54:14 PM4/17/10
to nonromans...@googlegroups.com
Hey guys,

Just wanted to let you know, today I added support for ICU's new normalization architecture (Normalizer2) as a TokenFilter to lucene.
You can find out more details in https://issues.apache.org/jira/browse/LUCENE-2399

In short, you can simply use this filter instead of LowerCaseFilter and get much better international behavior, e.g. width-differences such as CJK full-width numerics will be folded in addition to cases like German sharp-S and Greek final sigma. This is because the default implements the new "NFKC_Casefold".

The next step I think is to create support for custom normalization, particularly for folks that want to do 'search term folding' in an international fashion, such as ignoring accents and differences such as Traditional/Simplified.

This has been brought up in a Lucene Issue as https://issues.apache.org/jira/browse/LUCENE-1343, and I think we can really implement this nicely by producing a custom normalization data file that supports all the foldings in UTR#30

While UTR#30 is a withdrawn report, its basically the only one that attempts to address normalization for search term folding, so I think it would be a great way to implement that idea. You can see more details at http://www.unicode.org/reports/tr30/tr30-4.html

If you guys have any recommendations beyond this, please let me know!!! But I think this is a clear win, its much simpler than having a bunch of separate filters with their own hairy implementations.  You can always make your own custom mappings if you want, but I think NFKC + CaseFold + Identifier Ignorable as a default, with UTR#30 as an additional option will suit most people well.

Once LUCENE-1343 is resolved I will be looking at exposing this in Solr.

--
Robert Muir
rcm...@gmail.com

Rob Casson

unread,
Apr 17, 2010, 3:10:52 PM4/17/10
to nonromans...@googlegroups.com
very nice!...the icu stuff has always seemed like some really powerful
magic, so hooking it to lucene/solr is very exciting.

thanks,
rob
--
Subscription settings: http://groups.google.com/group/nonromanscripts4lib/subscribe?hl=en

Daniel Lovins

unread,
Apr 19, 2010, 11:35:06 AM4/19/10
to nonromans...@googlegroups.com
This looks really promising, Robert. Thanks for sharing the news with
the nrs4lib list.

Also, thanks for citing UTR #30. I read through the document this
morning, and, even though it's been withdrawn, it's still a great
analysis and comparison of character folding, normalization, and
unicode collation algorithm techniques.

/ Daniel

On Sat, Apr 17, 2010 at 1:54 PM, Robert Muir <rcm...@gmail.com> wrote:

Robert Muir

unread,
Apr 19, 2010, 12:20:13 PM4/19/10
to nonromans...@googlegroups.com
On Mon, Apr 19, 2010 at 11:35 AM, Daniel Lovins <daniel...@gmail.com> wrote:
I read through the document this
morning, and, even though it's been withdrawn, it's still a great
analysis and comparison of character folding, normalization, and
unicode collation algorithm techniques.

 I agree, its good to address "search term folding", and I also agree this can't really be an official standard: a lot depends on your application... the document is still useful.

FYI: I implemented this as a custom unicode normalization form, available at https://issues.apache.org/jira/browse/LUCENE-1343

I did omit Traditional-Simplified and Katakana-Hiragana, not that they couldn't be done, but I think its just more appropriate to use Transliterator for these, since it can better take context into account if you want this behavior.


--
Robert Muir
rcm...@gmail.com
Reply all
Reply to author
Forward
0 new messages