I strongly recommend going to the ICU stuff - you'll get top notch support from the Lucene community should it not live up to your needs.
How about someone take some of your non-English text examples, and run them through Solr's analysis.jsp view using the UnicodeNormalizationFilter and then also run it through a Solr 3.x ICU configured analyzer and see what the diffs, if any, are?
Michael - why go to 3.1 when 3.3 is now the latest? Just jump there. Use the ICU stuff. Then see if any users complain :)
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To post to this group, send email to blacklight-...@googlegroups.com.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
Bob - thanks for your efforts with this normalization stuff over the years. Your contributions/feedback to the Lucene project factored into these improvements being made part of Lucene itself.
We still have some work to do to tie all this stuff together nicely out of the box with Solr, though. More on that in my next reply.
Solr (3.3 for example here) ships with apache-solr-analysis-extras-3.3.0.jar in the dist/ directory of the binary distro. This JAR file contains the Solr "factories" to wire Solr to the underlying Lucene libraries.
As Bob mentioned, you'll also need a couple of additional JAR files. These can be found in a binary distribution of Lucene (again, using 3.3 as an example), under contrib/icu. There's lucene-icu-3.3.0.jar (the actual analyzers that the above factories instantiate) and lib/icu4j-4_8.jar.
I strongly recommend keeping versions in sync. Solr and Lucene are versioned identically now, so just stick with the same 3_x release (again, 3.3 is recommended at this point) for both sides of things.