Chinese searching

70 views
Skip to first unread message

Mark Jordan

unread,
Jan 18, 2016, 7:42:51 PM1/18/16
to islandora, island...@googlegroups.com
Hello,

Has anyone configured Solr to search Chinese text? Using the Solr configuration files from https://github.com/discoverygarden/basic-solr-config, our tests show that single character searches (e.g., 蘇) work; multiple character searches with no spaces between them (e.g., 蘇美) don't work; and searches with spaces between the characters (e.g., 蘇 美) works. These characters were copied from the OCR text that was indexed in Solr on ingest, and we performed our tests using the default simple search form.

If anyone has any suggestions for making the "phrase" searching work on Chinese text work, I'd love to hear them. The OCR transcripts contain mainly Traditional Chinese text with some English present as well, in much smaller quantities (the Chinese text is the full text of newspaper pages, the English text is the ads in the pages).

Mark





Diego Pino

unread,
Jan 19, 2016, 7:44:13 AM1/19/16
to islandora, island...@googlegroups.com
Hi Mark,

I remember the Islandora Conf 2015 presentation Louisa Lam and Jeff Liu from the Chinese University of Hong Kong did. They talked about how they configured Solr/OCR and other related stuff. I think this presentation was in parallel to yours! I would recommend you get in touch with them, they are the specialist!


Also, in my experience it's all about the language analyser/tokenizer mix you use, overlapping n-grams(bi grams ad tri grams) are something you should consider since Chinese Words don't have spaces, this also means your index grows a lot since you are basically indexing a lot of overlapping combinations. There is <tokenizer class="solr.CJKTokenizerFactory"/> for chinese/japanese that does the work for you, but you can also do it manually using <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="3"/> for example

Some more info: 


Best!

Diego Pino N

Melissa Anez

unread,
Jan 19, 2016, 8:51:19 AM1/19/16
to islandora, island...@googlegroups.com

Jeff Liu

unread,
Jan 19, 2016, 10:16:55 AM1/19/16
to islandora, island...@googlegroups.com
Hi all,

DiscoveryGarden has implemented this feature for us, you may refer to the settings at https://github.com/discoverygarden/cuhk-basic-solr-config
It enables the search by Chinese phrase and character, and also the search of Traditional Chinese and Simplified Chinese and their variants. (Chinese characters is quite complicated....)

I know you may be very interested to see how it works in our site, but sorry that we are still very busy to fine tune and ingest more items into our instance before going live at around mid-Feb.

Thanks,
Jeff

Melissa Anez於 2016年1月19日星期二 UTC+8下午9時51分19秒寫道:

Diego Pino

unread,
Jan 19, 2016, 10:49:25 AM1/19/16
to islandora, island...@googlegroups.com
jeff++, thanks for sharing this

Mark Jordan

unread,
Jan 19, 2016, 10:53:01 AM1/19/16
to isla...@googlegroups.com, island...@googlegroups.com

Indeed, thanks very much everyone,

Mark


--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/bdaf6504-18be-4d67-ae77-ae61711a7776%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Jordan

unread,
Feb 2, 2016, 6:31:53 PM2/2/16
to isla...@googlegroups.com, island...@googlegroups.com
Hi everyone,

I want to give a big shout out to Jeff Liu, who spent some time helping us get the DGI Solr configs for searching in Chinese working. Now that we have searching in one additional language ready to go, getting it work in the other three languages we have full text for in our collections should be a lot easier.

Thanks Jeff!

Mark


--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.

Jeff Liu

unread,
Feb 4, 2016, 9:22:47 PM2/4/16
to islandora, island...@googlegroups.com
Glad that we can help each other to improve Islandora. It is a great place to share our different use cases (though some of them might be very unique at the beginning, but we will always find it helpful and suitable to our instance later on)
I am sure everyone of us have been benefited from this group! 

Next week is our Chinese New Year, hope you all healthy and lucky in the "Year of Monkey"! (http://www.travelchinaguide.com/intro/social_customs/zodiac/monkey.htm)

Jeff

p.s. I am still a newbie learning Islandora, Solr and XSLT everyday. 


Mark Jordan於 2016年2月3日星期三 UTC+8上午7時31分53秒寫道:
Reply all
Reply to author
Forward
0 new messages