icu integration and lucene/solr

rcmuir

unread,

Mar 11, 2010, 12:55:50 PM3/11/10

to NonRomanScripts4Lib

Hello,

Erik pointed me at this link, and I just wanted to mention that I've
been working on integrating ICU and Lucene's analysis for a while.

I would love to know what challenges you guys are facing in the
library space, what features you want, what languages you care about,
basically any feedback I can get.

No code has been committed yet to support this in Lucene (except for
Unicode Collation), but I've been working this issue on/off for about
a year now. More details at https://issues.apache.org/jira/browse/LUCENE-1488

I've been rounding out the code in my local tree while I wait for ICU
4.4 release, as its a goal of mine to have solid Unicode 5.2 support
as a contrib module in the next release of Lucene...

dlovins

unread,

Mar 11, 2010, 4:37:47 PM3/11/10

to NonRomanScripts4Lib

Hi Robert.

Thanks for introducing yourself.

Here are some of the issues that we're studying at the Yale library:

- language identification at index and query time,
- language-specific tokenizing, stemming, relevancy ranking, spelling
suggestions, etc.;
- Ability to detect and trigger right-to-left displays of languages,
scripts, and page elements;
- Mapping of "variant" Chinese characters (e.g. searching Mao Zedong
in simplified characters 毛泽东 in traditional characters 毛澤東 and in
modern Kanji 毛沢東, and getting the same results ).

One of our major constraints (or opportunities) is that the Yale
Library collection has over 8 million items in over 600 languages (and
potentially dozens of scripts). We know that we can't optimize
indexing and retrieval for every language at once, and we've
historically paid the most attention to what we call the JACKPHY
languages: Japanese, Arabic, Chinese, Korean, Persian, Hebrew,
Yiddish, which are are routinely assigned both original script and
Romanized metadata in our catalog. We're looking at supplying other
scripts, e.g., Cyrillic, systematically as well.

/ Daniel

On Mar 11, 12:55 pm, rcmuir <rcm...@gmail.com> wrote:
> Hello,
>
> Erik pointed me at this link, and I just wanted to mention that I've
> been working on integrating ICU and Lucene's analysis for a while.
>
> I would love to know what challenges you guys are facing in the
> library space, what features you want, what languages you care about,
> basically any feedback I can get.
>
> No code has been committed yet to support this in Lucene (except for
> Unicode Collation), but I've been working this issue on/off for about

> a year now. More details athttps://issues.apache.org/jira/browse/LUCENE-1488

Robert Muir

unread,

Mar 11, 2010, 5:22:32 PM3/11/10

to nonromans...@googlegroups.com

On Thu, Mar 11, 2010 at 4:37 PM, dlovins <daniel...@gmail.com> wrote:
> Hi Robert.
>
> Thanks for introducing yourself.
>
> Here are some of the issues that we're studying at the Yale library:
>
> - language identification at index and query time,
> - language-specific tokenizing, stemming, relevancy ranking, spelling
> suggestions, etc.;
> - Ability to detect and trigger right-to-left displays of languages,
> scripts, and page elements;
> - Mapping of "variant" Chinese characters (e.g. searching Mao Zedong
> in simplified characters 毛泽东 in traditional characters 毛澤東 and in
> modern Kanji 毛沢東, and getting the same results ).
>
> One of our major constraints (or opportunities) is that the Yale
> Library collection has over 8 million items in over 600 languages (and
> potentially dozens of scripts). We know that we can't optimize
> indexing and retrieval for every language at once, and we've
> historically paid the most attention to what we call the JACKPHY
> languages: Japanese, Arabic, Chinese, Korean, Persian, Hebrew,
> Yiddish, which are are routinely assigned both original script and
> Romanized metadata in our catalog. We're looking at supplying other
> scripts, e.g., Cyrillic, systematically as well.

This is great feedback! Warning: very long post follows...

The more we know in the lucene-world, the more we can try to help.
I'll explain a little about the direction I've been heading with ICU
support, of course we can change this direction if it makes sense, so
don't read too far into it. Its mostly just me relying upon my own
experiences, and only really due to minimal feedback so far from
others.

In my opinion, "stemming and stopwords" is just a hack for improved
relevance: I would rather we not intentionally remove information, but
instead account for morphological variation at query-time, and have
support for scoring algorithms that don't need stopwords removal to
give good relevance.

Instead, I would rather the analysis part of lucene focus on just
breaking text into tokens (the "features" that should be indexed).
This removes the requirement for index-time language-identification
and stemming.

I would rather at query-time, query terms are expanded to handle
variation, for example if you search on "walked", the search can be
expanded to "walk". I am attempting to address this expansion via
other avenues, such as exploiting morphological information present in
spell-checking dictionaries:
http://code.google.com/p/lucene-hunspell/. In the future I think we
can use this information, combined with a generative approach, to
produce a finite-state query that accounts for variation. I'm
relevance testing some of these approaches on a number of languages,
but its my understanding this is what modern search engines such as
Google already do.

In my opinion, the index-time tokenization can be mostly insensitive
to language, by following UAX#29 (Unicode Text Segmentation). This
algorithm performs well for many languages, yet has problems for a
few: CJK, Thai, Lao, Myanmar, Khmer, etc that do not have any explicit
word breaks.

The way I attempt to deal with this in the lucene integration, is to
use UAX#24 (Scripts) to first divide text into Script boundaries, and
then tokenization takes place within an individual script run. This
way you can override the unicode defaults: for example Thai text is
automatically run thru the dictionary-based thai word segmentation
supplied by ICU. In the patch I've written several algorithms to
segment Khmer, Myanmar, and Lao into syllables, which perhaps arent as
good as words, are useful tokens. These can simply be written as RBBI
rules, or hopefully in the future, ICU BreakIterator support improves
and we just steal more of their support:

In my opinion, such functionality belongs in ICU itself and we should
just make use of their improvements. For example, there exist tickets
for both Myanmar (http://bugs.icu-project.org/trac/ticket/6780) and
CJK (http://bugs.icu-project.org/trac/ticket/2229)

As far as variation, I do feel this belongs in the analysis pipeline.
In the patch so far, Case Folding, Unicode Normalization, etc are
implemented. I also provide a TokenFilter that allows you to easily
use any Transliterator (e.g. "Traditional-Simplified"), to deal with
some of the sorts of variation you might see. While the Trad-Simp is
nice that its built-in, I've had good luck writing my own rulesets to
manipulate text too.

I know this won't solve all, or even many of your problems, its just
an explanation of where I am at now. I definitely want your own ideas
on other (perhaps completely different) approaches to make this kind
of thing easier.

Certainly if there is functionality exposed via Unicode properties or
ICU that I haven't thought of, I'm really excited about that, as I
want to improve the "language-independent" case as much as possible.

Lovins, Daniel

unread,

Mar 12, 2010, 1:58:10 PM3/12/10

to nonromans...@googlegroups.com, Lovins, Daniel

Thanks, Robert.

Getting the script boundaries from UAX#24 and tokenizing on results seems like a great approach.

And I'll definitely want to learn more about how your TokenFilter works with Trad-Simp and other Transliterators.

Daniel

Robert Muir

unread,

Mar 12, 2010, 2:21:25 PM3/12/10

to nonromans...@googlegroups.com

> And I'll definitely want to learn more about how your TokenFilter works with Trad-Simp and other Transliterators.
>

The Traditonal-Simplified ruleset comes with ICU (it is maintained by
the Unicode CLDR project, along with many other built-ins).

So in lucene, you would just do:

Transliterator tradSimp = Transliterator.getInstance("Traditional-Simplified");
TokenStream ts = new ICUTransformFilter(yourTokenStream, tradSimp);

In Solr I would later propose a factory that allows you to specify one
of these "system" ones such as Traditional-Simplified, or point to a
text file containing your own custom rules, which I find very useful
as well.

The MusicBrainz project is using some of this functionality, for an
example see their analyzers here:
http://svn.musicbrainz.org/search_server/trunk/index/src/main/java/org/musicbrainz/search/analysis/

They are normalizing Katakana to Hiragana and Traditional-Simplified:

StandardTokenizer tokenStream = new
StandardTokenizer(Version.LUCENE_CURRENT,mappingCharFilter);

TokenStream result = new ICUTransformFilter(tokenStream,
Transliterator.getInstance("[ー[:Script=Katakana:]]Katakana-Hiragana"));

result = new ICUTransformFilter(result,
Transliterator.getInstance("Traditional-Simplified"));

The reason for the wierd stuff in front of the Katakana one, is just
to specify a UnicodeFilter.
These are helpful for good performance, so you don't waste time
applying the rules to English or Arabic data or something silly.

In many cases these can be determined automagically from the rules
themselves: but you can see my notes/warnings about this in the code:

http://svn.musicbrainz.org/search_server/trunk/index/src/main/java/org/musicbrainz/search/analysis/ICUTransformFilter.java

--
Robert Muir
rcm...@gmail.com

Reply all

Reply to author

Forward