Hi Dan,
we had the need to normalize unicode strings before
indexing because our data providers use both
forms of diacritics: combined (like e^) and precomposed
characters (like ê).
More on unicode normalization
http://en.wikipedia.org/wiki/Unicode_equivalenceTo accomplish that I wrote a "UnicodeNormalizationFilter.java"
(attached) and applied it just after LowerCaseFilter in
XTFTextAnalyzer.
In case you want add it to XTF there is an issue:
the java.text.Normalizer class is available only
since Java 1.6.
Cheers,
Marcos Fragomeni
Systems Analyst
Federal Senate of Brazil