Hey guys,
Just wanted to let you know, today I added support for ICU's new normalization architecture (Normalizer2) as a TokenFilter to lucene.
In short, you can simply use this filter instead of LowerCaseFilter and get much better international behavior, e.g. width-differences such as CJK full-width numerics will be folded in addition to cases like German sharp-S and Greek final sigma. This is because the default implements the new "NFKC_Casefold".
The next step I think is to create support for custom normalization, particularly for folks that want to do 'search term folding' in an international fashion, such as ignoring accents and differences such as Traditional/Simplified.
While UTR#30 is a withdrawn report, its basically the only one that attempts to address normalization for search term folding, so I think it would be a great way to implement that idea. You can see more details at
http://www.unicode.org/reports/tr30/tr30-4.html
If you guys have any recommendations beyond this, please let me know!!! But I think this is a clear win, its much simpler than having a bunch of separate filters with their own hairy implementations. You can always make your own custom mappings if you want, but I think NFKC + CaseFold + Identifier Ignorable as a default, with UTR#30 as an additional option will suit most people well.
Once LUCENE-1343 is resolved I will be looking at exposing this in Solr.
--
Robert Muir
rcm...@gmail.com