We should take the German and Chinese claims with a grain of salt, since the named entity recognizer is unlikely to
have been trained, used and tested as extensively as the English one. The quality of results will have to be evaluated
when we have a real corpus with either of these languages.
An interesting advantage for us is that since we're in the domain of email, we have accurate names for correspondents.
At a minimum, these can be used as a "gazette" of names to recognize, even if the named entity recognizer fails
or is non-existent for a language.
Btw, Lucene analyzer language support means that a somewhat sensible analyzer is available (the part that does tokenization
and stemming). However, you should still be able to search for virtually all unicode strings -- for example, I can
correctly search full words in Indic languages (Kannada and English) that are not on the Lucene analyzers page.
[s]