Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Support of foreign languages with htmlparser in lucene????

81 views

Skip to first unread message

Code Master

unread,

Sep 27, 2002, 5:35:02 AM9/27/02

Hi,

I want to include the support for Danish in the HTMLParser of Lucene.
Workflow:
1) In the HTMLParser.jj I have added this to a token: < #LET:
["A"-"Z","a"-"z","0"-"9","æ","å","Ø","ø","Å","Æ"] >

2) In the StandarTokenizer.jj I have added "\u0080"-"\u00ff",
"\u0100"-"\u017f", "\u0180"-"\u024f", "\u00c0"-"\u00d6" to the
#LETTER: tag

When I search (after compiling and indexing) after words with special
characters in it, the search engine can't find them. For example: when
I search for "civilingeniør", I will get no result. When I write out
the comment string of the result (after another search of a word
nearby), I see this in my browser: "civilingeniÃ¸r". So somehow the
parser is mapping the wrong characters.. What am I doing wrong?

Thx..

Arnaud Clère

unread,

Oct 21, 2002, 7:56:01 AM10/21/02

Wow, I guess there's a lot of places where it can go wrong... I can just
mention a few of them :
- Check that your .jj file is saved with the right encoding (utf-8 ?). If
it's not, your editor could display ø correctly, but JavaCC may read
something totally different ! (you could use an hex editor to check that, or
look at the lexical analysis generated). Another way to ensure that would be
to use unicode escapes for HTMLParser.jj as you did with
StandarTokenizer.jj.
- Use the UNICODE_INPUT option and create or ReInit the Parser with a Reader
using the appropriate encoding (Cp277 for Denmark ?) instead of an
InputStream.
- Check that i18n.jar is in your bootclasspath (it should if you don't
specify otherwise), so that the reader can correctly encode/decode
characters with the specified encoding.

Hope it helps

Arnaud

"Code Master" <gal...@x-cago.com> wrote in message
news:be49b900.02092...@posting.google.com...

0 new messages