indexing arabic documents

4 views
Skip to first unread message

amirsadig

unread,
Aug 2, 2008, 9:02:14 AM8/2/08
to inverted-index-discuss
firstly I want to thanks you for this implementation,

I have database (utf8) and create on your test tables (article table)
and added the procedure and functions and also the data. the index
work very well without any problem.

actualy my document which I want to indexing they are arabic,
therefore I use an utf8 encoding to make this easier.

but the indexing doesn't work, because the 'nextWord' procedure seem
not to return correct words.

it find only a small portion of the text.
i can gues it can't find the space between words.

have you any idea how i can try to fix this?


thanks

Tom Gidden

unread,
Aug 2, 2008, 9:46:19 AM8/2/08
to inverted-in...@googlegroups.com
Hi,

I must admit I know nothing about the way Arabic is stored in UTF8 or
handled in MySQL, so all I can suggest is modifying the two REGEXPs in
sp_nextWord.sql.

You'll need to find a suitable replacement for [[:alnum:]] to match a
"word character", in such a way that a word is a string of only word
characters, ie. not whitespace, control characters or punctuation.

http://dev.mysql.com/doc/refman/5.1/en/regexp.html

You'll probably need to change sp_sanitizeWord.sql as well.

Failing that, MySQL have a forum on Unicode and UTF8, so you might
want to check there:

http://forums.mysql.com/list.php?103

Please let us know how you get on!

Regards,

--
Tom Gidden
http://gidden.net/tom

Stig Palmquist

unread,
Aug 2, 2008, 10:05:06 AM8/2/08
to inverted-in...@googlegroups.com
Dear Amir

Please also feel free to send us any patches :-)

Best,
Stig.


--
Stig Palmquist <sti...@gmail.com> [pgp:0x01D1D208]

amirsadig

unread,
Aug 2, 2008, 10:49:01 AM8/2/08
to inverted-index-discuss
Hi Tom and Stig,
Thanks for quick answer, I will try to experement with REGEX and as
soon I found a solution I will post
it here.

-Amir

amirsadig

unread,
Aug 2, 2008, 1:03:38 PM8/2/08
to inverted-index-discuss

when I execture the same statement on phpmyadmnin like

SELECT SUBSTRING(_sentence, _incr); // by replace both with the above
value
it return the correct sub string.


when I declare a variable of type text like "_temp" and then change
the statement to the following

SET _temp = SUBSTRING(_sentence, _incr);
SET _sentence = _temp;

then it work like a charm


I don't know why the original code work with english text but not
with arabic !!

amirsadig

unread,
Aug 2, 2008, 1:04:34 PM8/2/08
to inverted-index-discuss
I have found a solution but I can't understand it well, it was just
after testing.
to find I have added a debug table, which the nextWord procedure
insert text on , so I found that at first that this line return wrong
length

SET _len = LENGTH(_sentence);

LENGTH return number of byte not character , thus I use
CHAR_LENGTH, which return the correct length

then every thing goes well until the last line

SET _sentence = SUBSTRING(_sentence, _incr);

here the SUBSTRING return always string with the lenght 2 which are
none character (doesn't know what)
Reply all
Reply to author
Forward
0 new messages