Implementers with non-roman script needs

68 views
Skip to first unread message

Antonio Barrera

unread,
Mar 15, 2010, 11:11:15 AM3/15/10
to blacklight-...@googlegroups.com

Hello all,

 

I’m looking for institutions which have or are planning to implement Blacklight and have need to index, search and display non-roman script.  I primarily want to have one of our people test your interface and potentially discuss the trial and error that you would have encountered in implantation.

 

Thanks,

Antonio Barrera

Princeton University Library

 

Tom Cramer

unread,
Mar 15, 2010, 11:28:34 AM3/15/10
to blacklight-...@googlegroups.com, Tom Cramer, Naomi Dushay
Antonio,

Stanford has Blacklight in production (http://searchworks.stanford.edu), and has a considerable number of records in, and an emphasis on CJK, Hebrew and Arabic. We'd be interested in working with your staff to evaluate current state of vernacular support (especially if it produces a roadmap for enhancements for the whole community), and also describing our indexing & implementation approach.

My colleague, Naomi Dushay, is our technical lead on both our indexing and our vernacular script support, and is also on this list. 

I'd also be interested in getting a sense of who else in the BL and solr community is concerned with improving vernacular script search, sort and display functionality; it seems to me that there are enough of us now that it may make sense to coordinate our efforts.

- Tom

  | Tom Cramer
  | Associate Director, Digital Library Systems & Services
  | Stanford University LIbraries & Academic Information Resources
  | Stanford University





-- 
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

Lovins, Daniel

unread,
Mar 15, 2010, 11:28:54 AM3/15/10
to blacklight-...@googlegroups.com, vufin...@lists.sourceforge.net, nonromans...@googlegroups.com, Lovins, Daniel

Hi Antonio.

 

You’re welcome to test our interface, though the non-Roman script capacity is still fairly primitive: http://yufind.library.yale.edu . You can search in any Unicode script, retrieve records if the script is present, and anything in the MARC 880 fields will display at the bottom of the Record view (but not yet in Result Lists).

 

BTW, you might be interested in joining this Google Group I set up for discussion of non-Roman scripts in Next-Gen discovery tools: http://groups.google.com/group/nonromanscripts4lib

 

Daniel

--

Jonathan Rochkind

unread,
Mar 15, 2010, 11:34:56 AM3/15/10
to blacklight-...@googlegroups.com
Our local prototype DOES index and display non-roman script. However, we
haven't done anything special for tokenization or stemming of non-roman
(or other non-English) script, it's just going through the same Solr
filters as our English script. I'm sure there are current problems
because of that. I also noticed not-yet-diagnosed problems with display
of certain Korean script in Firefox.

But you can see our (very rough, development, in-progress, demo)
prototype at:

https://blacklight.mse.jhu.edu/demo

Here is a non-roman example record with Chinese text:
https://blacklight.mse.jhu.edu/demo/catalog/bib_474146

Antonio Barrera wrote:
> Hello all,
>

> I�m looking for institutions which have or are planning to implement Blacklight and have need to index, search and display non-roman script. I primarily want to have one of our people test your interface and potentially discuss the trial and error that you would have encountered in implantation.


>
> Thanks,
> Antonio Barrera
> Princeton University Library
>
>

Bill Dueber

unread,
Mar 15, 2010, 12:15:27 PM3/15/10
to Demian Katz, vufin...@lists.sourceforge.net, blacklight-...@googlegroups.com
[copied to blacklight-dev, too]

We've very interested for something here at  UMich/HathiTrust. We have a ton of non-roman script stuff which we dutifully index, but we're not doing anything smart yet to tokenize CJK -- not even using the CJKAnalyzer as of yet (which just produces character bigrams, but that may get us close enough). 

So we, too, would love to hear from anyone doing anything useful with CJK.

On Mon, Mar 15, 2010 at 10:25 AM, Demian Katz <demia...@villanova.edu> wrote:

I don't think there are a lot of these yet.  I know that the University of Tsukuba in Japan was playing with VuFind at one point, but their demo instance  (http://price6.tulips.tsukuba.ac.jp/vufind/) no longer seems to be active.  Beyond that, your best bet may be talking to someone at the HathiTrust (http://www.hathitrust.org/) since their catalog encompasses many languages, and I'm sure they've dealt with some non-Roman script issues.  I believe the Yale team is also interested in this area, though I'm not sure how far they have gotten.  At least some of the people involved are on this list, so hopefully you'll get further information from more informed parties!

 

Good luck!

 

- Demian

 

From: Antonio Barrera [mailto:abar...@Princeton.EDU]

Sent: Monday, March 15, 2010 11:11 AM
To: vufin...@lists.sourceforge.net
Subject: [VuFind-Tech] Implementers with non-roman script needs

 

Hello all,

 

I’m looking for institutions which have or are planning to implement Vufind and have need to index, search and display non-roman script.  I primarily want to have one of our people test your interface and potentially discuss the trial and error that you would have encountered in implantation.

 

Thanks,

Antonio Barrera

Princeton University Library


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Vufind-tech mailing list
Vufin...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-tech




--
Bill Dueber
Library Systems Programmer
University of Michigan Library

Robert Haschart

unread,
Mar 15, 2010, 1:14:14 PM3/15/10
to blacklight-...@googlegroups.com
Antonio,

At UVa our implementation of Blacklight stores and indexes non-roman
scripts extracted from the 880 fields of the marc records. In the
tokenization phase of indexing searchable fields, we use a custom
written CJK tokenizer, that (if I recall correctly) looks for characters
whose unicode code points define them as Chinese, Japanese or Korean
characters and will then split those characters into words using a very
simplistic one "character" = one word tokenization scheme.
While this probably wouldn't make sense for searching a an index where
the bulk of the material is in CJK characters. In our implemenation
where the data is mostly Roman characters, this scheme seems to provide
a good trade-off between searching power without needing to implement a
CJK word boundary detector.

and Jonathan,

We have what might be the same problem with Korean script here at UVa,
and we are fairly certain that it is a font issue for the
browser/machine in question. ie. Looking at a record that displays
incorrectly on one machine in one browser on a different machine, the
record displays perfectly fine. Furthermore if you enlarge the display
on the machine that displays incorrectly, the character displayed
appears as a box containing four hexadecimal digits, which correspond to
the unicode code point for the character that ought to be displayed.

-Bob Haschart

Jonathan Rochkind

unread,
Mar 15, 2010, 1:17:51 PM3/15/10
to blacklight-...@googlegroups.com
Robert Haschart wrote:
> and Jonathan,
>
> We have what might be the same problem with Korean script here at UVa,
> and we are fairly certain that it is a font issue for the
> browser/machine in question.

This is interseting, but the weird thing for me is that Korean
characters DO show fine in the same Firefox browser looking at my legacy
OPAC. So why is that same browser not able to display them when they
come out of BL? Somehow the unicode being output by my legacy OPAC is
different, and displayable by the browser, but the unicode coming out of
my BL is not? Very odd.

Jonathan Rochkind

unread,
Mar 15, 2010, 5:07:29 PM3/15/10
to blacklight-...@googlegroups.com
Robert Haschart wrote:
> In the
> tokenization phase of indexing searchable fields, we use a custom
> written CJK tokenizer, that (if I recall correctly) looks for characters
> whose unicode code points define them as Chinese, Japanese or Korean
> characters and will then split those characters into words using a very
> simplistic one "character" = one word tokenization scheme.
>
Hey Bob and UVA... would you be willing to share this custom naive CJK
tokenizer code? And the solrconfig code you use to, presumably, pass all
your updates and querries through tokenizer chains that will naively
tokenize CJK like this, while still letting you do other more
appropriate tokenization for other scripts?

Maybe put it in a public repo somewhere?

While simplistic, it does sound like your approach is a reasonable first
step, better than the even more simplistic things the rest of us are
doing. I'd love to try it out without having to re-invent that wheel.

Jonathan

Robert Haschart

unread,
Mar 15, 2010, 5:31:53 PM3/15/10
to blacklight-...@googlegroups.com
Jonathan,

You may already have the code and not know it. If you have and use the
UnicodeNormailzeFilter.jar which is placed in the lib sub-directory
under the solr home. That file also contains the CJKFilterFactory code.

The bit in schema.xml to use the CJK filter factory is:

<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<!-- INDEX -->
<analyzer type="index">
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
<filter class="schema.CJKFilterFactory" bigrams="false"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
enablePositionIncrements=true ensures that a 'gap' is left to
allow for accurate phrase queries.
-->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<!-- QUERY -->
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
<filter class="schema.CJKFilterFactory" bigrams="false"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

Jonathan Rochkind

unread,
Mar 15, 2010, 6:14:54 PM3/15/10
to blacklight-...@googlegroups.com
Nice! So the UnicodeNormalizeFilter is already doing your guys naive CJK
approach? Great.

I will check my schema.xml to make it work according to your
instructions, if it's not already from the stock/example Blacklight
schema.xml.

You don't have any synonyms for CJK, right? So your synonym stuff just
applies to the other stuff; not sure what the benefit of only doing
synonyms are query time is? The solr docs I've read seem to _in general_
suggest only doing synonyms at index time, instead. But I can work that
out for myself, just making sure there aren't any special considerations
with the CJK stuff regarding synonyms and whether you do them at index
or query time. There aren't?

Some day (soon?) we should put the UnicodeNormalizeFilter in a public
repo, so people can submit patches (or fork if needed). There's
something weird going on with phrase searches in my BL, and I wonder if
it's related to the any of this code, would be really good to be able to
see/test the source (that's what open source is for, right?).

Jonathan

Robert Haschart

unread,
Mar 15, 2010, 6:43:15 PM3/15/10
to blacklight-...@googlegroups.com
I looked into submitting it to the Solr project, and someone told me
"No no. It should be a part of the lucene project instead." and when I
tried broaching the subject there, I got questions about why I made it
work the way it does, assertions that a Polish L with a bar through it
is completely different from an L, so folding them together is
"incorrect", discussions about how in German the u with umlaut is
essentially an accented u, whereas in other languages they are separate
distinct entities, like the Polish L bar, (or maybe the other way around)

I didn't feel slogging through all of the philosophical linguistic
arguments and justifing the design choices I made, where essentially the
justification was "It helps users find what they might be looking for"
and "It seems to work for us." So I gave up trying to get it included
as a part of the official release of Lucene or Solr and just included it
as a optional part of Blacklight.

-Bob Haschart

Jonathan Rochkind

unread,
Mar 15, 2010, 6:49:49 PM3/15/10
to blacklight-...@googlegroups.com
Okay, I can totally understand not wanting to deal with the bueocracy of
a much larger open source project.

But we can still put the source in a public repo somewhere of our own.
Is it in the Blacklight repo somewhere I'm not seeing? That's what I'm
asking, I've got the .jar file yeah, but is there some public repo I can
see the _source_ (ideally with an ant buildfile that is capable of
building it) for the plugin?

If there is, I don't know it!

And if there isn't... can you put it somewhere? You can put it
somewhere like github without dealing with Solr or Lucene projects at
all. Then, if it's open source, someone ELSE can deal with trying to
get it into Solr or Lucene if they want to spend time doing it. If you
don't want to, I'm totally sympathetic (not my favorite way to spend
time either), but you can still share the source another way!

Jonathan

Jonathan Rochkind

unread,
Mar 15, 2010, 7:33:49 PM3/15/10
to blacklight-...@googlegroups.com
PS: For anyone playing the home game, if you downloaded the 'demo'
Blacklight app back when it existed the sample jetty DID include the
UnicodeNormalizeFilter.jar, but the sample schema did NOT actually use
the CJKFilterFactory.

I'm not sure if the current Blacklight optional jetty-via-template
install is the same -- probably.

I have modified my schema.xml to include teh CJKFilterFactory, and
things do work MUCH better now, after re-indexing. At least I can find
things in the index on a partial match of CJK characters -- without the
CJKFilterFactory, you'll only get a CJK match if you enter EVERY
adjacent CJK char in the source material together. Which is of course
the point of the CJKFilterFactory. Nice, thanks for that, it is an
improvement. (Would still REALLY like the actual source to be available
somewhere, so I can look at it to understand, for instance, what
"bigrams=false" or true does.)

So if anyone else has a jetty schema.xml that originated from the
Blacklight sample, I'd suggest you follow Bob's instructions below and
put a CJKFilterFactory in there. And whoever understands how the
template stuff works (I do not) might want to include that in the sample
schema.xml it gives you too.

Jonathan

Jonathan Rochkind

unread,
Mar 15, 2010, 8:36:05 PM3/15/10
to blacklight-...@googlegroups.com
Although I'd add that I _think_ adding the CJK tokenizer significantly
slowed down my indexing. Which is to be expected, to some extent,
although maybe I'm mis-interpreting my slowdown (or maybe the CJK
tokenizer could use some optimization. Which I or someone else could
look at if when they had time only if the code is actually public, if
it's actually open source! ).

I wonder if the reason a lot of us have slower than "typical" solr
indexing is just because we're doing SO MUCH analysis on index of a
great many fields.

Bill Dueber

unread,
Mar 15, 2010, 8:41:27 PM3/15/10
to blacklight-...@googlegroups.com
Are you running every every everything through the CJK tokenizer? Or just the fields you have reason to believe are CJK?

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

Jonathan Rochkind

unread,
Mar 15, 2010, 8:45:13 PM3/15/10
to blacklight-...@googlegroups.com
Everything that's a "text" field, because pretty much any field that's a
text field could have CJK in it, the way I have things set up now.

I could re-configure my index mapping and request handler setup to put
for instance 880's in seperate solr fields. Right now they're not, the
"title" field for instance has any title info in it, whether it came
from an 880 (and might be CJK) or not.

I understand (and brief test on a few records demonstrated) the CJK
tokenizer is smart enough to handle this, but you're right that maybe
there ways to rejigger it for performance. But there's a tradeoff of
having twice as many fields to be indexed and searched on every search
too. And it would take me time I don't have to spend right now on
reconfiguring my solr set up -- took no human me time at all just to add
the CJK tokenizer in there. I don't really know. If anyone tries it
various ways and benchmarks it, do let us know. :)

Jonathan

Bill Dueber wrote:
> Are you running every every everything through the CJK tokenizer? Or just the fields you have reason to believe are CJK?
>

> To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-development%2Bunsu...@googlegroups.com>.

Robert Haschart

unread,
Mar 16, 2010, 10:24:20 AM3/16/10
to blacklight-...@googlegroups.com, solrma...@googlegroups.com
Jonathan,

I realize that the CJKFilterFactory stuff (as well as the
UnicodeFilterFactory stuff) should be on a open-source repository
somewhere. However given that you have the UnicodeNormalizeFilter.jar,
you have the source code. If you open the UnicodeNormalizeFilter.jar
file with a zip file reader you will see that the source code for the
CJKFilterFactory (as well as the UnicodeFilterFactory are right there
beside the .class files.

-Bob Haschart

Jonathan Rochkind

unread,
Mar 16, 2010, 11:01:28 AM3/16/10
to blacklight-...@googlegroups.com, solrma...@googlegroups.com
Okay, great! I'm new to Java, I didn't realize that .java source (not
just compiled class files) were (sometimes?) in a .jar.

So, when someone (could be me or could be someone else) gets to the
point where we want to hack on that code, you don't mind us putting it
in a public repo somewhere? It is (or could be) open source licensed?

Reply all
Reply to author
Forward
0 new messages