Indexing UTF-8 sites

0 views
Skip to first unread message

raven

unread,
Jun 25, 2008, 5:54:32 AM6/25/08
to hounder
Is it possible to index and search sites containing utf-8 unicode ?
For testing reasons I indexed our own website, and found that in the
russian vestion all cyrilic characters are replaced by question
marks,
and if I send a search to the web search in cyrillic letters, an error
occures.

Alejandro Jorge Pérez

unread,
Jun 25, 2008, 2:52:11 PM6/25/08
to hou...@googlegroups.com, sch...@gmail.com


2008/6/25 raven <sch...@gmail.com>:

Both the indexer and searcher are 100% compatible with unicode. In the past we had trouble with the crawler interpreting the page's encoding, and sometimes with the web interface.

First, check that the default encoding of the machine you're running the java processes is utf-8.
If it is, then use de "idx" script (you can find it under the "indexer" directory in your installation) to check that the index stored the text with the right encoding (use "search" and then maybe "terms").

Drop me a line when you're done with this tests.


Spike.

raven

unread,
Jun 26, 2008, 5:10:51 AM6/26/08
to hounder
> If it is, then use de "idx" script (you can find it under the "indexer"
> directory in your installation) to check that the index stored the text with
> the right encoding (use "search" and then maybe "terms").
> Drop me a line when you're done with this tests.

In the index the text is stored in right encoding. I think this
problem is caused by jetty. I made the mistake, only download the
binary package, so I don't have much documentation and can't get a
real overview how things work together. I will now integrate the
search application in Tomcat witch is already installed and configured
on my server. I think this will solve the problem.
A suggestion, you should give users who only download binaries, a bit
more documentation, or at last a hint to download sources to get more
information. You could also distribute documentation as a separate
download.

Jorge Handl

unread,
Jun 26, 2008, 9:17:26 AM6/26/08
to sch...@gmail.com, hou...@googlegroups.com
> You could also distribute documentation as a separate download

Raven, I was thinking exactly the same thing. Will do that today. Thanks!
Reply all
Reply to author
Forward
0 new messages