Hello all,
I’m looking for institutions which have or are planning to implement Blacklight and have need to index, search and display non-roman script. I primarily want to have one of our people test your interface and potentially discuss the trial and error that you would have encountered in implantation.
Thanks,
Antonio Barrera
Princeton University Library
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
Hi Antonio.
You’re welcome to test our interface, though the non-Roman script capacity is still fairly primitive: http://yufind.library.yale.edu . You can search in any Unicode script, retrieve records if the script is present, and anything in the MARC 880 fields will display at the bottom of the Record view (but not yet in Result Lists).
BTW, you might be interested in joining this Google Group I set up for discussion of non-Roman scripts in Next-Gen discovery tools: http://groups.google.com/group/nonromanscripts4lib
Daniel
--
But you can see our (very rough, development, in-progress, demo)
prototype at:
https://blacklight.mse.jhu.edu/demo
Here is a non-roman example record with Chinese text:
https://blacklight.mse.jhu.edu/demo/catalog/bib_474146
Antonio Barrera wrote:
> Hello all,
>
> I�m looking for institutions which have or are planning to implement Blacklight and have need to index, search and display non-roman script. I primarily want to have one of our people test your interface and potentially discuss the trial and error that you would have encountered in implantation.
>
> Thanks,
> Antonio Barrera
> Princeton University Library
>
>
I don't think there are a lot of these yet. I know that the University of Tsukuba in Japan was playing with VuFind at one point, but their demo instance (http://price6.tulips.tsukuba.ac.jp/vufind/) no longer seems to be active. Beyond that, your best bet may be talking to someone at the HathiTrust (http://www.hathitrust.org/) since their catalog encompasses many languages, and I'm sure they've dealt with some non-Roman script issues. I believe the Yale team is also interested in this area, though I'm not sure how far they have gotten. At least some of the people involved are on this list, so hopefully you'll get further information from more informed parties!
Good luck!
- Demian
From: Antonio Barrera [mailto:abar...@Princeton.EDU]
Sent: Monday, March 15, 2010 11:11 AM
To: vufin...@lists.sourceforge.net
Subject: [VuFind-Tech] Implementers with non-roman script needs
Hello all,
I’m looking for institutions which have or are planning to implement Vufind and have need to index, search and display non-roman script. I primarily want to have one of our people test your interface and potentially discuss the trial and error that you would have encountered in implantation.
Thanks,
Antonio Barrera
Princeton University Library
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Vufind-tech mailing list
Vufin...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-tech
At UVa our implementation of Blacklight stores and indexes non-roman
scripts extracted from the 880 fields of the marc records. In the
tokenization phase of indexing searchable fields, we use a custom
written CJK tokenizer, that (if I recall correctly) looks for characters
whose unicode code points define them as Chinese, Japanese or Korean
characters and will then split those characters into words using a very
simplistic one "character" = one word tokenization scheme.
While this probably wouldn't make sense for searching a an index where
the bulk of the material is in CJK characters. In our implemenation
where the data is mostly Roman characters, this scheme seems to provide
a good trade-off between searching power without needing to implement a
CJK word boundary detector.
and Jonathan,
We have what might be the same problem with Korean script here at UVa,
and we are fairly certain that it is a font issue for the
browser/machine in question. ie. Looking at a record that displays
incorrectly on one machine in one browser on a different machine, the
record displays perfectly fine. Furthermore if you enlarge the display
on the machine that displays incorrectly, the character displayed
appears as a box containing four hexadecimal digits, which correspond to
the unicode code point for the character that ought to be displayed.
-Bob Haschart
This is interseting, but the weird thing for me is that Korean
characters DO show fine in the same Firefox browser looking at my legacy
OPAC. So why is that same browser not able to display them when they
come out of BL? Somehow the unicode being output by my legacy OPAC is
different, and displayable by the browser, but the unicode coming out of
my BL is not? Very odd.
Maybe put it in a public repo somewhere?
While simplistic, it does sound like your approach is a reasonable first
step, better than the even more simplistic things the rest of us are
doing. I'd love to try it out without having to re-invent that wheel.
Jonathan
You may already have the code and not know it. If you have and use the
UnicodeNormailzeFilter.jar which is placed in the lib sub-directory
under the solr home. That file also contains the CJKFilterFactory code.
The bit in schema.xml to use the CJK filter factory is:
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<!-- INDEX -->
<analyzer type="index">
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
<filter class="schema.CJKFilterFactory" bigrams="false"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
enablePositionIncrements=true ensures that a 'gap' is left to
allow for accurate phrase queries.
-->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<!-- QUERY -->
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="schema.UnicodeNormalizationFilterFactory"
version="icu4j" composed="false" remove_diacritics="true"
remove_modifiers="true" fold="true"/>
<filter class="schema.CJKFilterFactory" bigrams="false"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I will check my schema.xml to make it work according to your
instructions, if it's not already from the stock/example Blacklight
schema.xml.
You don't have any synonyms for CJK, right? So your synonym stuff just
applies to the other stuff; not sure what the benefit of only doing
synonyms are query time is? The solr docs I've read seem to _in general_
suggest only doing synonyms at index time, instead. But I can work that
out for myself, just making sure there aren't any special considerations
with the CJK stuff regarding synonyms and whether you do them at index
or query time. There aren't?
Some day (soon?) we should put the UnicodeNormalizeFilter in a public
repo, so people can submit patches (or fork if needed). There's
something weird going on with phrase searches in my BL, and I wonder if
it's related to the any of this code, would be really good to be able to
see/test the source (that's what open source is for, right?).
Jonathan
I didn't feel slogging through all of the philosophical linguistic
arguments and justifing the design choices I made, where essentially the
justification was "It helps users find what they might be looking for"
and "It seems to work for us." So I gave up trying to get it included
as a part of the official release of Lucene or Solr and just included it
as a optional part of Blacklight.
-Bob Haschart
But we can still put the source in a public repo somewhere of our own.
Is it in the Blacklight repo somewhere I'm not seeing? That's what I'm
asking, I've got the .jar file yeah, but is there some public repo I can
see the _source_ (ideally with an ant buildfile that is capable of
building it) for the plugin?
If there is, I don't know it!
And if there isn't... can you put it somewhere? You can put it
somewhere like github without dealing with Solr or Lucene projects at
all. Then, if it's open source, someone ELSE can deal with trying to
get it into Solr or Lucene if they want to spend time doing it. If you
don't want to, I'm totally sympathetic (not my favorite way to spend
time either), but you can still share the source another way!
Jonathan
I'm not sure if the current Blacklight optional jetty-via-template
install is the same -- probably.
I have modified my schema.xml to include teh CJKFilterFactory, and
things do work MUCH better now, after re-indexing. At least I can find
things in the index on a partial match of CJK characters -- without the
CJKFilterFactory, you'll only get a CJK match if you enter EVERY
adjacent CJK char in the source material together. Which is of course
the point of the CJKFilterFactory. Nice, thanks for that, it is an
improvement. (Would still REALLY like the actual source to be available
somewhere, so I can look at it to understand, for instance, what
"bigrams=false" or true does.)
So if anyone else has a jetty schema.xml that originated from the
Blacklight sample, I'd suggest you follow Bob's instructions below and
put a CJKFilterFactory in there. And whoever understands how the
template stuff works (I do not) might want to include that in the sample
schema.xml it gives you too.
Jonathan
I wonder if the reason a lot of us have slower than "typical" solr
indexing is just because we're doing SO MUCH analysis on index of a
great many fields.
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
I could re-configure my index mapping and request handler setup to put
for instance 880's in seperate solr fields. Right now they're not, the
"title" field for instance has any title info in it, whether it came
from an 880 (and might be CJK) or not.
I understand (and brief test on a few records demonstrated) the CJK
tokenizer is smart enough to handle this, but you're right that maybe
there ways to rejigger it for performance. But there's a tradeoff of
having twice as many fields to be indexed and searched on every search
too. And it would take me time I don't have to spend right now on
reconfiguring my solr set up -- took no human me time at all just to add
the CJK tokenizer in there. I don't really know. If anyone tries it
various ways and benchmarks it, do let us know. :)
Jonathan
Bill Dueber wrote:
> Are you running every every everything through the CJK tokenizer? Or just the fields you have reason to believe are CJK?
>
> To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-development%2Bunsu...@googlegroups.com>.
I realize that the CJKFilterFactory stuff (as well as the
UnicodeFilterFactory stuff) should be on a open-source repository
somewhere. However given that you have the UnicodeNormalizeFilter.jar,
you have the source code. If you open the UnicodeNormalizeFilter.jar
file with a zip file reader you will see that the source code for the
CJKFilterFactory (as well as the UnicodeFilterFactory are right there
beside the .class files.
-Bob Haschart
So, when someone (could be me or could be someone else) gets to the
point where we want to hack on that code, you don't mind us putting it
in a public repo somewhere? It is (or could be) open source licensed?