Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

MySQL 5.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

153 views
Skip to first unread message

jrs_...@yahoo.com

unread,
Jun 14, 2006, 8:43:28 AM6/14/06
to
Hello All,

This post is essentially a reply a previous post/thread
here on this mailing.database.myodbc group titled:

MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

[This version has a couple subtle edits from the orginial I posted
on mailing.database.myodbc - I'm cross posting here on this
topic/subject related newsgroup]

I was wondering if anybody has experienced the same issues
challenges I'm experiencing I'll describe shortly. Once
resolved some fascinating and powerful multi-lingual
apps incorporating non-English/latin character sets can be
realized by many developers.

I have a Unicode utf8 English - Arabic - Hebrew - Greek (and
several other languages) database in Microsoft Excel. I KNOW
that it is Unicode utf8 data because MySQL tells me it
recognizes the encoding as such but not in the context I want.

Allow me to explain ...

I can search the Unicode utf8 encoding with no problem in
Excel. While in Excel I highlight a complete word or a
partial string of an Arabic word copy it to the clipboard
(i.e. memory). I then do a find and the process is the
same successful result as if it was an English string.

MySQL 5.0 is supposed to handle Unicode utf8

I created a MySQL database I named: languages

CREATE DATABASE languages ;

and I implemented the following command on a MySQL
command prompt:

ALTER DATABASE languages DEFAULT CHARACTER SET utf8;

No problem (so far) MySQL seemingly recognized utf8 and
accepted it. My understanding is with the ALTER command
the tables I create against languages will be utf8.

I now created a table I named mainlang which denotes it
will be the main table for my languages.

mysql>CREATE TABLE mainlang
->(
->langNumID varchar(30),
->colB varchar(30),
->colC varchar(30),
->primary key (langNumID, colB)
->);

Again so far no problem: Table successfully created.
My third column 'colC' is where the Unicode data
will be stored.

I now attempt to import the database from my
Excel file into my MySQL database as follows:

mysql>load data infile 'c:\\arabicdictionary.csv'
->into table mainlang
->fields terminated by ','
->lines terminated by '\n'
->(langNumID, colB, colC);
ERROR 1406 (22001): Data too long for 'colC' at row 1

So what to do? I did a search and found other
people seemingly had the same problem and someone
suggested:

ALTER DATABASE languages DEFAULT CHARACTER SET cp1250;

I dropped mainlang, recreated it, redid the load and
Lo and behold ... it seemed to work. No Data too long
error occurred and when I did the following query:

mysql>select langNumID, colB, colC
->from mainlang
->where colB = '4994';

I see colA have a correct numeric value, colB a
correct numeric value (4994) and for colC a string of
unintelligible characters with diacritical marks,
oomlats etc. which I know is the cp1250 encoding
interpretation of the Unicode utf8 data which is
similarly unintelligible in its own regard.

Now what I try is: do a copy of the obscure colC
cp1250 character string into the clipboard/memory
and then do the following tweak on the original
select statement to see if I can search on the
(now) cp1250 character string:

mysql>select langNumID, colB, colC
->from mainlang
->where colc = 'paste of the cp1250 character string';

The computer would not allow a paste unless I pressed
the escape key. On initiating this select command
I got an empty set (no match)

My questions are:

Has anyone been successful creating a Unicode utf8
MySQL database that accepts Arabic?

If yes, how did you get around or not encounter the
Data too long issue?

Have you tried the cp1250 (or cp1251 - same mechanics
same results) work around as I have? Are you
able to search the cp1250 character string (my colC)?
If yes, how did you successfully manage to do it?

Lastly, if I take the cp1250 encoded string and paste
it into Excel ... I can string search the cp1250
encoding with no problem.

Also, here's how I know my Unicode utf-8 data is
correct apart from my own manual cross-referencing
and being recognized by MySQL in some respect:

When I copy the Unicode utf8 encoding and try to
paste it into the select command to see what would
happen I get the following error:

ERROR 1257 (HY000): Illegal mix of collations
(cp1250_general_ci, IMPLICIT) and
(utf8_general_ci, COERCIBLE) for operation '='

So what I have here is a situation where MySQL
is recognizing Unicode utf8 encoding but not
from the respect of packing a table!

Go Figure ...

Anyone wishing to share any insight or solution would
be GREATLY appeciated. I promise if I find a solution
I will share it.

Thank you Very Much, Shukran Jiddan, Todah Rabah,
Muchos Gracias ...

Joel S
(585) 255-0997
jrs_14618 at yahoo.com


Timothy Partridge

unread,
Jun 15, 2006, 3:54:34 PM6/15/06
to
In article <e6qdc3$6m3$1...@reader2.panix.com>,

I don't know MySQL, but I speculate that because UTF-8 is a variable length
encoding, specifying varchar(30) reserves 30 bytes. (I presume that your
column is 30 characters long in Excel.) If you are using the standard Unicode
block for Arabic it needs two bytes per letter. If you are using the
compatibility chracters (the one that are preshaped) or Greek with accents
it will take 3 bytes per character.

I suggest you try UTF-8 character set but make colC varchar(60) or varchar(90).

Tim

--
Tim Partridge. Any opinions expressed are mine only and not those of my employer

0 new messages