Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFact ory
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  10 messages - Expand all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Michael Levy  
View profile  
 More options Aug 29 2011, 3:06 pm
From: Michael Levy <michaelrl...@gmail.com>
Date: Mon, 29 Aug 2011 12:06:17 -0700 (PDT)
Local: Mon, Aug 29 2011 3:06 pm
Subject: Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory
Hi all,

We're moving servers and want to move to Solr 3.1.  I am having an
issue using Blacklight and Solr 3.1.  There is an existing thread on
the topic:
http://groups.google.com/group/blacklight-development/browse_thread/t...

In 2008 I had acquired an older version of
UnicodeNormalizationFilterFactory.jar directly from Robert Haschart
and was using that (source code was dated around 2008-06-30) and I
have continued to use that with 1.4.1.  Now moving to Solr 3.1 and
have tried the older version of UnicodeNormalizationFilterFactory.jar
and a newer one I acquired from here along with normalizer.jar:
https://github.com/projectblacklight/blacklight-jetty/tree/master/sol...
...I can start the Solr admin app but when I try to do any query I see
this error:

java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
        at
org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTok enFilter.java:
48)
        at
org.apache.solr.analysis.WordDelimiterFilter.incrementToken(WordDelimiterFi lter.java:
338)
        at
org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.j ava:
60)
        at
org.apache.lucene.analysis.KeywordMarkerFilter.incrementToken(KeywordMarker Filter.java:
73)
        at
org.apache.lucene.analysis.snowball.SnowballFilter.incrementToken(SnowballF ilter.java:
76)
...

That error seems to be similar to those documented here:
http://lucene.472066.n3.nabble.com/K-Stemmer-for-Solr-3-1-td2929892.html
and here:
http://search.lucidimagination.com/search/document/ddce3a95ce8d7172/k...

At the same time I see there has been quite a bit of discussion of
UnicodeNormalizationFilterFactory versus ICUTokenizerFactory and
ICUFoldingFilterFactory

And I note Chris Beer's work using the ICU approach :
https://github.com/projectblacklight/blacklight-jetty/blob/solr-4/sol...

I don't know enough to prefer UnicodeNormalizationFilterFactory
versus ICUTokenizerFactory, but generally would like to keep up with
the Blacklight community generally.  If someone has run
UnicodeNormalizationFilterFactory with Solr 3.1, that would probably
be the easiest for me.

I am indexing data both from a MARC .mrc export from Voyager along
with other data from other cataloging systems (which is for me the #1
reason I love Blacklight -- it was easy to do).  So I'll need SolrMarc
and plain-old XML paths to index data.

Thanks in advance for any help!


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Hatcher  
View profile  
 More options Aug 29 2011, 3:16 pm
From: Erik Hatcher <erikhatc...@mac.com>
Date: Mon, 29 Aug 2011 15:16:36 -0400
Local: Mon, Aug 29 2011 3:16 pm
Subject: Re: [Blacklight-development] Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory
It would require a fairly involved rewrite of the UnicodeNormalizationFilter to get it to work with the newer version of Lucene in Solr.

I strongly recommend going to the ICU stuff - you'll get top notch support from the Lucene community should it not live up to your needs.

How about someone take some of your non-English text examples, and run them through Solr's analysis.jsp view using the UnicodeNormalizationFilter and then also run it through a Solr 3.x ICU configured analyzer and see what the diffs, if any, are?

Michael - why go to 3.1 when 3.3 is now the latest?  Just jump there.  Use the ICU stuff.  Then see if any users complain :)

        Erik

On Aug 29, 2011, at 15:06 , Michael Levy wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Beer  
View profile  
 More options Aug 29 2011, 3:24 pm
From: Chris Beer <chris_b...@wgbh.org>
Date: Mon, 29 Aug 2011 15:24:01 -0400
Local: Mon, Aug 29 2011 3:24 pm
Subject: Re: [Blacklight-development] Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory
I'd echo Erik's comments -- go with ICU. One of the hang-ups I ran into in preparing a blacklight-jetty running Solr 3.x was trying to determine if  if there are significant differences in the normalized output between UnicodeNormalizationFilterFactory and the ICU filters.  If you find anything so if you find anything, I'd like to know about it.

Chris

On Aug 29, 2011, at 3:16 PM, Erik Hatcher wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Levy  
View profile  
 More options Aug 29 2011, 5:00 pm
From: Michael Levy <michaelrl...@gmail.com>
Date: Mon, 29 Aug 2011 14:00:42 -0700 (PDT)
Local: Mon, Aug 29 2011 5:00 pm
Subject: Re: Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory
Erik, Chris,

Thank you very much for your prompt responses.  Sounds quite clear:
ICU here we come.

Re 3.1 versus 3.3, we just haven't kept up since May, but we will.

I've been pretty quiet on the listserv but we have rolled out an
internal Blacklight implementation at USHMM and are working on a plan
to roll out a version for the web.  I'll keep the list updated when we
get close to rolling it out.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Haschart  
View profile  
 More options Aug 29 2011, 5:48 pm
From: Robert Haschart <rh...@virginia.edu>
Date: Mon, 29 Aug 2011 17:48:19 -0400
Local: Mon, Aug 29 2011 5:48 pm
Subject: Re: [Blacklight-development] Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory

Even though I don't believe that rewriting the
UnicodeNormalizationFilter code would be a major effort, since it is
mostly a boilerplate token filter factory that calls functions from the
ICU libraries to do the actual work, I still think it is probably time
to retire the UnicodeNormalizationFilter code, in favor of the  
solr.ICUFoldingFilterFactory  code that is in Solr 3.1.   The
UnicodeNormalizationFilter was only written because the previously
existing filter for processing accented characters  
ISOLatin1AccentFilterFactory  was abysmally bad.  

Now that a supported filter is available that uses the ICU libraries is
available,  the filter pro tem: UnicodeNormalizationFilter should be
retired and replaced.

I believe that removing the two jar files (normalizer.jar and
UnicodeNormalizeFilter.jar) from the lib directory and replacing the
line(s) in schema.xml
        <filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
with
        <filter class="solr.ICUFoldingFilterFactory" />

should achieve largely the same results. (I think you'll need    
apache-solr-analysis-extras.3.x.jar    lucene-icu-3.x.jar   and  
icu4j-4_6.jar  in the solr lib directory)

-Bob Haschart


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Levy  
View profile  
 More options Aug 29 2011, 5:59 pm
From: Michael Levy <michaelrl...@gmail.com>
Date: Mon, 29 Aug 2011 14:59:12 -0700 (PDT)
Local: Mon, Aug 29 2011 5:59 pm
Subject: Re: Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory
I pretty much followed this schema (adding in a few mods I'd made
previously):
https://github.com/projectblacklight/blacklight-jetty/blob/master/sol...
and I picked up the three jar's mentioned by Bob from here
(icu4j-4_6.jar, apache-solr-analysis-
extras-4.0-2011-03-26_08-06-09.jar, and lucene-analyzers-
icu-4.0-2011-03-26_08-06-09.jar):
https://github.com/projectblacklight/blacklight-jetty/tree/solr-4/sol...
it pretty much works.  I've done a bit of preliminary testing (for
example, searching for Lodz and for Łódź should return the same
results) which at first glance seems to indicate the two methods
return the same results.

It would seem I'm mixing Solr 3.1 with some 4.0 jars, and I might try
to get other versions, but so far so good.

Again, thanks to all.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Beer  
View profile  
 More options Aug 29 2011, 6:19 pm
From: Chris Beer <chris_b...@wgbh.org>
Date: Mon, 29 Aug 2011 18:19:10 -0400
Local: Mon, Aug 29 2011 6:19 pm
Subject: Re: [Blacklight-development] Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory

There's also a Solr 3.3 branch of blacklight-jetty at https://github.com/projectblacklight/blacklight-jetty/tree/solr-3.3 which is probably what you want to use as a reference copy. I believe the outstanding issues with blacklight-jetty using Solr 3.3 were outlined on this list earlier this month.

On Aug 29, 2011, at 5:59 PM, Michael Levy wrote:

I pretty much followed this schema (adding in a few mods I'd made
previously):
https://github.com/projectblacklight/blacklight-jetty/blob/master/sol...
and I picked up the three jar's mentioned by Bob from here
(icu4j-4_6.jar, apache-solr-analysis-
extras-4.0-2011-03-26_08-06-09.jar, and lucene-analyzers-
icu-4.0-2011-03-26_08-06-09.jar):
https://github.com/projectblacklight/blacklight-jetty/tree/solr-4/sol...
it pretty much works.  I've done a bit of preliminary testing (for
example, searching for Lodz and for Łódź should return the same
results) which at first glance seems to indicate the two methods
return the same results.

It would seem I'm mixing Solr 3.1 with some 4.0 jars, and I might try
to get other versions, but so far so good.

Again, thanks to all.

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-development@googlegroups.com.
To unsubscribe from this group, send email to blacklight-development+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Beer  
View profile  
 More options Aug 29 2011, 6:22 pm
From: Chris Beer <chris_b...@wgbh.org>
Date: Mon, 29 Aug 2011 18:22:17 -0400
Local: Mon, Aug 29 2011 6:22 pm
Subject: Re: [Blacklight-development] Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory

Thanks Bob, it's great to hear they are largely compatible with each other.

Just as a reminder, Tom raised some issues with using CJK and the ICUTokenizer on this list earlier [1] that we should probably keep in mind for documenting our future use of the ICU packages.

Thanks,
Chris

[1] http://groups.google.com/group/blacklight-development/browse_thread/t...

On Aug 29, 2011, at 5:48 PM, Robert Haschart wrote:

Even though I don't believe that rewriting the UnicodeNormalizationFilter code would be a major effort, since it is mostly a boilerplate token filter factory that calls functions from the ICU libraries to do the actual work, I still think it is probably time to retire the UnicodeNormalizationFilter code, in favor of the   solr.ICUFoldingFilterFactory  code that is in Solr 3.1.   The UnicodeNormalizationFilter was only written because the previously existing filter for processing accented characters   ISOLatin1AccentFilterFactory  was abysmally bad.

Now that a supported filter is available that uses the ICU libraries is available,  the filter pro tem: UnicodeNormalizationFilter should be retired and replaced.

I believe that removing the two jar files (normalizer.jar and UnicodeNormalizeFilter.jar) from the lib directory and replacing the line(s) in schema.xml
        <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
with
        <filter class="solr.ICUFoldingFilterFactory" />

should achieve largely the same results. (I think you'll need    apache-solr-analysis-extras.3.x.jar    lucene-icu-3.x.jar   and   icu4j-4_6.jar  in the solr lib directory)

-Bob Haschart

Chris Beer wrote:

I'd echo Erik's comments -- go with ICU. One of the hang-ups I ran into in preparing a blacklight-jetty running Solr 3.x was trying to determine if  if there are significant differences in the normalized output between UnicodeNormalizationFilterFactory and the ICU filters.  If you find anything so if you find anything, I'd like to know about it.

Chris

On Aug 29, 2011, at 3:16 PM, Erik Hatcher wrote:

It would require a fairly involved rewrite of the UnicodeNormalizationFilter to get it to work with the newer version of Lucene in Solr.

I strongly recommend going to the ICU stuff - you'll get top notch support from the Lucene community should it not live up to your needs.

How about someone take some of your non-English text examples, and run them through Solr's analysis.jsp view using the UnicodeNormalizationFilter and then also run it through a Solr 3.x ICU configured analyzer and see what the diffs, if any, are?

Michael - why go to 3.1 when 3.3 is now the latest?  Just jump there.  Use the ICU stuff.  Then see if any users complain :)

        Erik

On Aug 29, 2011, at 15:06 , Michael Levy wrote:

Hi all,

We're moving servers and want to move to Solr 3.1.  I am having an
issue using Blacklight and Solr 3.1.  There is an existing thread on
the topic:
http://groups.google.com/group/blacklight-development/browse_thread/t...

In 2008 I had acquired an older version of
UnicodeNormalizationFilterFactory.jar directly from Robert Haschart
and was using that (source code was dated around 2008-06-30) and I
have continued to use that with 1.4.1.  Now moving to Solr 3.1 and
have tried the older version of UnicodeNormalizationFilterFactory.jar
and a newer one I acquired from here along with normalizer.jar:
https://github.com/projectblacklight/blacklight-jetty/tree/master/sol...
...I can start the Solr admin app but when I try to do any query I see
this error:

java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
     at
org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTok enFilter.java:
48)
     at
org.apache.solr.analysis.WordDelimiterFilter.incrementToken(WordDelimiterFi lter.java:
338)
     at
org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.j ava:
60)
     at
org.apache.lucene.analysis.KeywordMarkerFilter.incrementToken(KeywordMarker Filter.java:
73)
     at
org.apache.lucene.analysis.snowball.SnowballFilter.incrementToken(SnowballF ilter.java:
76)
...

That error seems to be similar to those documented here:
http://lucene.472066.n3.nabble.com/K-Stemmer-for-Solr-3-1-td2929892.html
and here:
http://search.lucidimagination.com/search/document/ddce3a95ce8d7172/k...

At the same time I see there has been quite a bit of discussion of
UnicodeNormalizationFilterFactory versus ICUTokenizerFactory and
ICUFoldingFilterFactory

And I note Chris Beer's work using the ICU approach :
https://github.com/projectblacklight/blacklight-jetty/blob/solr-4/sol...

I don't know enough to prefer UnicodeNormalizationFilterFactory
versus ICUTokenizerFactory, but generally would like to keep up with
the Blacklight community generally.  If someone has run
UnicodeNormalizationFilterFactory with Solr 3.1, that would probably
be the easiest for me.

I am indexing data both from a MARC .mrc export from Voyager along
with other data from other cataloging systems (which is for me the #1
reason I love Blacklight -- it was easy to do).  So I'll need SolrMarc
and plain-old XML paths to index data.

Thanks in advance for any help!

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-development@googlegroups.com<mailto:blacklight-development@googl egroups.com>.
To unsubscribe from this group, send email to blacklight-development+unsubscribe@googlegroups.com<mailto:blacklight-devel opment+unsubscribe@googlegroups.com>.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-development@googlegroups.com<mailto:blacklight-development@googl egroups.com>.
To unsubscribe from this group, send email to blacklight-development+unsubscribe@googlegroups.com<mailto:blacklight-devel opment+unsubscribe@googlegroups.com>.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-development@googlegroups.com<mailto:blacklight-development@googl egroups.com>.
To unsubscribe from this group, send email to blacklight-development+unsubscribe@googlegroups.com<mailto:blacklight-devel opment+unsubscribe@googlegroups.com>.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Hatcher  
View profile  
 More options Aug 30 2011, 5:40 am
From: Erik Hatcher <erikhatc...@mac.com>
Date: Tue, 30 Aug 2011 05:40:14 -0400
Local: Tues, Aug 30 2011 5:40 am
Subject: Re: [Blacklight-development] Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory
As for the effort involved - Lucene's analysis API's changed a fair bit since 2.x and thus why I made that comment.  It's trickier stuff under the covers than ever before, to achieve reusable token streams and leverage "attributes" and so on.  Certainly not a major undertaking, but hopefully an unnecessary one since the new ICU filters should do the trick.

Bob - thanks for your efforts with this normalization stuff over the years.  Your contributions/feedback to the Lucene project factored into these improvements being made part of Lucene itself.

We still have some work to do to tie all this stuff together nicely out of the box with Solr, though.  More on that in my next reply.

        Erik

On Aug 29, 2011, at 17:48 , Robert Haschart wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Hatcher  
View profile  
 More options Aug 30 2011, 6:02 am
From: Erik Hatcher <erikhatc...@mac.com>
Date: Tue, 30 Aug 2011 06:02:35 -0400
Local: Tues, Aug 30 2011 6:02 am
Subject: Re: [Blacklight-development] Re: Blacklight and Solr 3.1, problems with UnicodeNormalizationFilterFactory
Don't mix and match Lucene/Solr 3.x with 4.x.  Very different stuff under the covers and results could be bad.

Solr (3.3 for example here) ships with apache-solr-analysis-extras-3.3.0.jar in the dist/ directory of the binary distro.  This JAR file contains the Solr "factories" to wire Solr to the underlying Lucene libraries.

As Bob mentioned, you'll also need a couple of additional JAR files.  These can be found in a binary distribution of Lucene (again, using 3.3 as an example), under contrib/icu.  There's lucene-icu-3.3.0.jar (the actual analyzers that the above factories instantiate) and lib/icu4j-4_8.jar.

I strongly recommend keeping versions in sync.  Solr and Lucene are versioned identically now, so just stick with the same 3_x release (again, 3.3 is recommended at this point) for both sides of things.

        Erik

On Aug 29, 2011, at 17:59 , Michael Levy wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »