Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory

148 views
Skip to first unread message

Michael Levy

unread,
Aug 29, 2011, 3:06:17 PM8/29/11
to Blacklight Development
Hi all,

We're moving servers and want to move to Solr 3.1. I am having an
issue using Blacklight and Solr 3.1. There is an existing thread on
the topic:
http://groups.google.com/group/blacklight-development/browse_thread/thread/7a5ccc50d5378631

In 2008 I had acquired an older version of
UnicodeNormalizationFilterFactory.jar directly from Robert Haschart
and was using that (source code was dated around 2008-06-30) and I
have continued to use that with 1.4.1. Now moving to Solr 3.1 and
have tried the older version of UnicodeNormalizationFilterFactory.jar
and a newer one I acquired from here along with normalizer.jar:
https://github.com/projectblacklight/blacklight-jetty/tree/master/solr/lib/
...I can start the Solr admin app but when I try to do any query I see
this error:

java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
at
org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:
48)
at
org.apache.solr.analysis.WordDelimiterFilter.incrementToken(WordDelimiterFilter.java:
338)
at
org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:
60)
at
org.apache.lucene.analysis.KeywordMarkerFilter.incrementToken(KeywordMarkerFilter.java:
73)
at
org.apache.lucene.analysis.snowball.SnowballFilter.incrementToken(SnowballFilter.java:
76)
...

That error seems to be similar to those documented here:
http://lucene.472066.n3.nabble.com/K-Stemmer-for-Solr-3-1-td2929892.html
and here:
http://search.lucidimagination.com/search/document/ddce3a95ce8d7172/kstemmer_for_solr_3_x

At the same time I see there has been quite a bit of discussion of
UnicodeNormalizationFilterFactory versus ICUTokenizerFactory and
ICUFoldingFilterFactory

And I note Chris Beer's work using the ICU approach :
https://github.com/projectblacklight/blacklight-jetty/blob/solr-4/solr/development-core/conf/schema.xml

I don't know enough to prefer UnicodeNormalizationFilterFactory
versus ICUTokenizerFactory, but generally would like to keep up with
the Blacklight community generally. If someone has run
UnicodeNormalizationFilterFactory with Solr 3.1, that would probably
be the easiest for me.

I am indexing data both from a MARC .mrc export from Voyager along
with other data from other cataloging systems (which is for me the #1
reason I love Blacklight -- it was easy to do). So I'll need SolrMarc
and plain-old XML paths to index data.

Thanks in advance for any help!


Erik Hatcher

unread,
Aug 29, 2011, 3:16:36 PM8/29/11
to blacklight-...@googlegroups.com
It would require a fairly involved rewrite of the UnicodeNormalizationFilter to get it to work with the newer version of Lucene in Solr.

I strongly recommend going to the ICU stuff - you'll get top notch support from the Lucene community should it not live up to your needs.

How about someone take some of your non-English text examples, and run them through Solr's analysis.jsp view using the UnicodeNormalizationFilter and then also run it through a Solr 3.x ICU configured analyzer and see what the diffs, if any, are?

Michael - why go to 3.1 when 3.3 is now the latest? Just jump there. Use the ICU stuff. Then see if any users complain :)

Erik

> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To post to this group, send email to blacklight-...@googlegroups.com.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
>

Chris Beer

unread,
Aug 29, 2011, 3:24:01 PM8/29/11
to blacklight-...@googlegroups.com
I'd echo Erik's comments -- go with ICU. One of the hang-ups I ran into in preparing a blacklight-jetty running Solr 3.x was trying to determine if if there are significant differences in the normalized output between UnicodeNormalizationFilterFactory and the ICU filters. If you find anything so if you find anything, I'd like to know about it.

Chris

Michael Levy

unread,
Aug 29, 2011, 5:00:42 PM8/29/11
to Blacklight Development
Erik, Chris,

Thank you very much for your prompt responses. Sounds quite clear:
ICU here we come.

Re 3.1 versus 3.3, we just haven't kept up since May, but we will.

I've been pretty quiet on the listserv but we have rolled out an
internal Blacklight implementation at USHMM and are working on a plan
to roll out a version for the web. I'll keep the list updated when we
get close to rolling it out.

Robert Haschart

unread,
Aug 29, 2011, 5:48:19 PM8/29/11
to blacklight-...@googlegroups.com
Even though I don't believe that rewriting the UnicodeNormalizationFilter code would be a major effort, since it is mostly a boilerplate token filter factory that calls functions from the ICU libraries to do the actual work, I still think it is probably time to retire the UnicodeNormalizationFilter code, in favor of the   solr.ICUFoldingFilterFactory  code that is in Solr 3.1.   The UnicodeNormalizationFilter was only written because the previously existing filter for processing accented characters   ISOLatin1AccentFilterFactory  was abysmally bad.  

Now that a supported filter is available that uses the ICU libraries is available,  the filter pro tem: UnicodeNormalizationFilter should be retired and replaced.

I believe that removing the two jar files (normalizer.jar and UnicodeNormalizeFilter.jar) from the lib directory and replacing the line(s) in schema.xml
        <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
with
        <filter class="solr.ICUFoldingFilterFactory" />

should achieve largely the same results. (I think you'll need    apache-solr-analysis-extras.3.x.jar    lucene-icu-3.x.jar   and   icu4j-4_6.jar  in the solr lib directory)

-Bob Haschart

Michael Levy

unread,
Aug 29, 2011, 5:59:12 PM8/29/11
to Blacklight Development
I pretty much followed this schema (adding in a few mods I'd made
previously):
https://github.com/projectblacklight/blacklight-jetty/blob/master/solr/conf/schema.xml
and I picked up the three jar's mentioned by Bob from here
(icu4j-4_6.jar, apache-solr-analysis-
extras-4.0-2011-03-26_08-06-09.jar, and lucene-analyzers-
icu-4.0-2011-03-26_08-06-09.jar):
https://github.com/projectblacklight/blacklight-jetty/tree/solr-4/solr/test-core/lib
it pretty much works. I've done a bit of preliminary testing (for
example, searching for Lodz and for Łódź should return the same
results) which at first glance seems to indicate the two methods
return the same results.

It would seem I'm mixing Solr 3.1 with some 4.0 jars, and I might try
to get other versions, but so far so good.

Again, thanks to all.

Chris Beer

unread,
Aug 29, 2011, 6:19:10 PM8/29/11
to blacklight-...@googlegroups.com
There's also a Solr 3.3 branch of blacklight-jetty at https://github.com/projectblacklight/blacklight-jetty/tree/solr-3.3 which is probably what you want to use as a reference copy. I believe the outstanding issues with blacklight-jetty using Solr 3.3 were outlined on this list earlier this month.




Chris Beer

unread,
Aug 29, 2011, 6:22:17 PM8/29/11
to blacklight-...@googlegroups.com
Thanks Bob, it's great to hear they are largely compatible with each other. 

Just as a reminder, Tom raised some issues with using CJK and the ICUTokenizer on this list earlier [1] that we should probably keep in mind for documenting our future use of the ICU packages.

Thanks,
Chris

Erik Hatcher

unread,
Aug 30, 2011, 5:40:14 AM8/30/11
to blacklight-...@googlegroups.com
As for the effort involved - Lucene's analysis API's changed a fair bit since 2.x and thus why I made that comment. It's trickier stuff under the covers than ever before, to achieve reusable token streams and leverage "attributes" and so on. Certainly not a major undertaking, but hopefully an unnecessary one since the new ICU filters should do the trick.

Bob - thanks for your efforts with this normalization stuff over the years. Your contributions/feedback to the Lucene project factored into these improvements being made part of Lucene itself.

We still have some work to do to tie all this stuff together nicely out of the box with Solr, though. More on that in my next reply.

Erik

Erik Hatcher

unread,
Aug 30, 2011, 6:02:35 AM8/30/11
to blacklight-...@googlegroups.com
Don't mix and match Lucene/Solr 3.x with 4.x. Very different stuff under the covers and results could be bad.

Solr (3.3 for example here) ships with apache-solr-analysis-extras-3.3.0.jar in the dist/ directory of the binary distro. This JAR file contains the Solr "factories" to wire Solr to the underlying Lucene libraries.

As Bob mentioned, you'll also need a couple of additional JAR files. These can be found in a binary distribution of Lucene (again, using 3.3 as an example), under contrib/icu. There's lucene-icu-3.3.0.jar (the actual analyzers that the above factories instantiate) and lib/icu4j-4_8.jar.

I strongly recommend keeping versions in sync. Solr and Lucene are versioned identically now, so just stick with the same 3_x release (again, 3.3 is recommended at this point) for both sides of things.

Erik

Reply all
Reply to author
Forward
0 new messages