IBM-864

0 views
Skip to first unread message

Putman, Harold

unread,
Dec 13, 2001, 11:47:48 AM12/13/01
to icu-ch...@www-126.southbury.usf.ibm.com
Hello,
I am a naive user trying to convert a file from UTF-8 into the Arabic OS/2 codepage 864. I built uconv.exe from icuapps and it appears to work fine. I can convert the file into (windows Arabic) codepage 1256 and it looks OK. However when I try to convert to ibm-864 it gives me just 7F everywhere an Arabic caharacter appears.
I am running on a Windows 2000 system with ICU 1.8 installed. Am I doing something wrong?
 
Thanks for any help.
 
Harold Putman
 
Harold PUTMAN : Diebold Incorporated
              : 5995 Mayfair Road
              : North Canton, OH 44720
              : U.S.A.
              : +1(330) 490-4723
              : fax +1(330) 490-4508
              : put...@diebold.com
         
 

George Rhoten

unread,
Dec 13, 2001, 7:08:48 PM12/13/01
to Putman, Harold, icu-ch...@www-126.southbury.usf.ibm.com
Since you didn't say what Unicode arabic characters weren't converted, it
would be difficult to say what went wrong.

My guess is that you are using Unicode characters that can be converted to
windows-1256, but can't be converted to ibm-864. These two codepages have
different character set mappings to/from Unicode. Many Unicode characters
can not be represented in all codepages. If you don't want to lose any data, my
suggestion is to keep the data in Unicode.

If you really want to, you can see the details of the Unicode mappings for
these codepages in the icu/data directory. If you know what the Unicode
characters are, you can see what a given Unicode character gets mapped
into/from a codepage.

George Rhoten
IBM Globalization Center of Competency/ICU Cupertino, CA, USA

Sent by: icu-chars...@www-126.southbury.usf.ibm.com
To: "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>
cc:
Subject: IBM-864

Putman, Harold

unread,
Dec 14, 2001, 1:09:41 PM12/14/01
to George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com
Thanks for pointing me in the right direction...
After digging up the source for the character conversion tables, I discovered the problem has to do with character shaping. The conversion table has mappings for the Presentation forms of the characters (UFE7D-UFEFC) but not for the all the nominal characters (U060C-U064A). The document I am trying to convert contains only nominal characters.
My first reaction was to add code to shape the text as I read it in and convert the shaped characters. But, the way codepage 864 is used on OS/2 (my document is an OS/2 resource file), characters are stored in their isolated form and then shaped by the operating system when it is displayed. So I don't really want to store them shaped. It would make it hard to compare this to older versions of these documents.
So, what makes sense to me is for entries to be added to the 864 conversion table that would convert the nominal forms to isolated forms.
I can't figure out from a casual reading of Unicode Technical Report #22 (describing the mapping table) whether is OK to have two Unicode characters map to the same Legacy character. For example for the letter ALEF I would want
<a u="FE8D" b="C7" /> <!-- existing mapping-->
<a u="0627" b="C7" /> <!-- new mapping -->
I think the existing mapping should go in the fallback section, and the new mapping would be the default. If I were mapping from 864 to Unicode I would want C7 converted to the nominal form 0627 to get the same behavior, even though technically C7 refers to a glyph which is the ISOLATED ALEF.
I'm sure some people more familiar with Arabic than I am have discussed this, or maybe everyone uses codepage 1256 which doesn't have any presentation forms defined in the mapping or the codepage.

In any case I guess now I will dig to find out hoe to create a conversion table and make it work the way the accomplishes my goal. I am willing to share the results.

Regards,

Harold Putman

Markus Scherer

unread,
Dec 14, 2001, 4:33:12 PM12/14/01
to Putman, Harold, George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com
Hello,

(I am in Europe, so someone else might have responded again already.)

You can make your own mapping table, starting from ibm-864.ucm. For
fallbacks in .ucm files you need |1 in the third column where roundtrip
mappings have |0.
You can also alternatively try to use the u_shape API in ICU that Egyptian
colleagues wrote to get "nominal" character codes into shaped ones, or
back.

Good luck,
markus

Markus Scherer IBM GCoC-Unicode/ICU Cupertino, CA
markus....@us.ibm.com (also for SameTime)





"Putman, Harold" <Put...@diebold.com>
Sent by: icu-chars...@www-126.southbury.usf.ibm.com
12/14/2001 07:09 PM


To: George Rhoten/Cupertino/IBM@IBMUS
cc: "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>
Subject: RE: IBM-864
_______________________________________________
icu-charsets mailing list
icu-ch...@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu-charsets



Putman, Harold

unread,
Dec 17, 2001, 6:05:28 PM12/17/01
to jim_snyd...@us.ibm.com, Putman, Harold, George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com, Markus Scherer
Thanks for the advice. I had to modify the code in uconv.cpp to use the C-style interface in order to turn fallbacks on, and that mostly worked. It raised a conceptual question though... what is the philosophy for converting text from Unicode into a lower-fidelity encoding? Should the mappings silently discard extra accents and things? To use an English-speaking analogy, if there was a codepage that only contained CAPITAL letters, and you mapped Unicode text into this codepage, should 'a' be converted to 'A' or the substitution character?

I had a problem with ARABIC LETTER ALEF WITH HAMZA BELOW <U0625>. This character is not in the IBM-864 codepage. By comparing to our previous translations I found that this was translated as a plain ALEF. I cannot read Arabic, but I believe that the 'adornments' like HAMZAs are mostly for pronounciation and the text is still readable without these. So I added a rule to convert 625 into a plain ALEF.

The other problem I found was with <U064F> and <U064B>. These are combining characters used to add adornments onto the preceding letter. I fixed this by removing these characters manually from the original UTF-8 files. Is there any way in the UCM file to specify that a certain Unicode character translates to "nothing" in the target codepage? I could just filter out the substitution character, but if I know that I can just "ignore" combining characters, that should be different from encountering an untranslatable character.

Regards,

Harold Putman



-----Original Message-----
From: jim_snyd...@us.ibm.com [mailto:jim_snyd...@us.ibm.com]
Sent: Friday, December 14, 2001 3:31 PM
To: Putman, Harold
Cc: 'George Rhoten'; 'icu-ch...@oss.software.ibm.com'
Subject: RE: IBM-864


Hi Harold:

This is easier than you think. These extra mappings ('fallbacks' in ICU
nomenclature) are already there - you just have to turn them on. Have your
application call
ucnv_setFallback(cnv, 1) to turn on fallback mappings for any converters
you want this behavior for.

You can see this in the UCM file for this code page. Check
icu\data\ibm-864.ucm & notice these lines:

<U0627> \xC7 |1
...
<UFE8D> \xC7 |0

This means that both FE8B and 627 will map to C7 (and that C7 will map
back to FE8B).

(PS: Be sure you really want 864 - I've switched my app over to 17248 for
these cases, which is the IBM designation for 864 supplemented with a Euro
mapping.)

-Jim Snyder-Grant
Lotus Development Corp
12/14/2001 01:09 PM


To: "'George Rhoten'" <grh...@us.ibm.com>
cc: "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>, (bcc: Jim
Snyder-Grant/CAM/Lotus)

George Rhoten

unread,
Dec 17, 2001, 7:11:37 PM12/17/01
to Putman, Harold, icu-ch...@www-126.southbury.usf.ibm.com, i...@www-126.southbury.usf.ibm.com
Harold,

You have several questions that are addressed by different ICU APIs.

There is the normalization API (unorm_* C API and Normalizer C++ API),
which is best described here
http://oss.software.ibm.com/icu/userguide/normalization.html . This API
can be used to do a best effort normalization to change uncomposed
characters to composed characters or vis versa.

There is the shaping API (ushape_* C API), which is best described here
http://oss.software.ibm.com/icu/apiref/ushape_8h.html . This is very
helpful for Arabic shaping, and changing between the presentation forms of
Arabic.

There is the bidi API (ubidi_* C API), which is best described here
http://oss.software.ibm.com/icu/apiref/ubidi_8h.html and here
http://oss.software.ibm.com/icu/userguide/bidi.html . This is useful for
encodings that traditionally have a visual ordering instead of a logical
ordering.

If you want to skip invalid characters for a specific conversion, I
suggest you use the skip callbacks (UCNV_FROM_U_CALLBACK_SKIP,
UCNV_TO_U_CALLBACK_SKIP), which are best described here
http://oss.software.ibm.com/icu/userguide/codepageConverters.html and here
http://oss.software.ibm.com/icu/apiref/ucnv_8h.html . There are other
callbacks that can be used instead of these two.

More information about codepage conversion fallbacks is available here
http://oss.software.ibm.com/icu/userguide/conversion-data.html (towards
the middle of the page).

If you wish to do a more specific "transliteration" to remove or change
certain Unicode characters before a conversion, I suggest you take a look
at the transliterator API. We don't recommend that you change the UCM
files for simple problems like these, since your changes have a good
chance to make the converter do strange and incompatible conversions. The
transliterator API (utrans_* C API and Transliterator C++ API), is best
described here
http://oss.software.ibm.com/icu/userguide/Transliteration.html and here
http://oss.software.ibm.com/icu/apiref/classTransliterator.html .

Overall the User's Guide
(http://oss.software.ibm.com/icu/userguide/index.html) and the API
reference (http://oss.software.ibm.com/icu/apiref/index.html) have good
documentation on these topics. After saying that, I would also like to
mention that there is always room for improvement. So if there are some
problems that you find in the documentation, feel free to send us an
e-mail about how to improve the documentation.

We also find that the Unicode site (www.unicode.org) is a very help source
of information about Unicode manipulation in general.

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA

To: Jim Snyder-Grant/CAM/Lotus@LOTUS, "Putman, Harold" <Put...@diebold.com>
cc: George Rhoten/Cupertino/IBM@IBMUS, "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>, Markus
Scherer/Cupertino/IBM@IBMUS
Reply all
Reply to author
Forward
0 new messages