Harold,
You have several questions that are addressed by different ICU APIs.
There is the normalization API (unorm_* C API and Normalizer C++ API),
which is best described here
http://oss.software.ibm.com/icu/userguide/normalization.html . This API
can be used to do a best effort normalization to change uncomposed
characters to composed characters or vis versa.
There is the shaping API (ushape_* C API), which is best described here
http://oss.software.ibm.com/icu/apiref/ushape_8h.html . This is very
helpful for Arabic shaping, and changing between the presentation forms of
Arabic.
There is the bidi API (ubidi_* C API), which is best described here
http://oss.software.ibm.com/icu/apiref/ubidi_8h.html and here
http://oss.software.ibm.com/icu/userguide/bidi.html . This is useful for
encodings that traditionally have a visual ordering instead of a logical
ordering.
If you want to skip invalid characters for a specific conversion, I
suggest you use the skip callbacks (UCNV_FROM_U_CALLBACK_SKIP,
UCNV_TO_U_CALLBACK_SKIP), which are best described here
http://oss.software.ibm.com/icu/userguide/codepageConverters.html and here
http://oss.software.ibm.com/icu/apiref/ucnv_8h.html . There are other
callbacks that can be used instead of these two.
More information about codepage conversion fallbacks is available here
http://oss.software.ibm.com/icu/userguide/conversion-data.html (towards
the middle of the page).
If you wish to do a more specific "transliteration" to remove or change
certain Unicode characters before a conversion, I suggest you take a look
at the transliterator API. We don't recommend that you change the UCM
files for simple problems like these, since your changes have a good
chance to make the converter do strange and incompatible conversions. The
transliterator API (utrans_* C API and Transliterator C++ API), is best
described here
http://oss.software.ibm.com/icu/userguide/Transliteration.html and here
http://oss.software.ibm.com/icu/apiref/classTransliterator.html .
Overall the User's Guide
(
http://oss.software.ibm.com/icu/userguide/index.html) and the API
reference (
http://oss.software.ibm.com/icu/apiref/index.html) have good
documentation on these topics. After saying that, I would also like to
mention that there is always room for improvement. So if there are some
problems that you find in the documentation, feel free to send us an
e-mail about how to improve the documentation.
We also find that the Unicode site (
www.unicode.org) is a very help source
of information about Unicode manipulation in general.
George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA
To: Jim Snyder-Grant/CAM/Lotus@LOTUS, "Putman, Harold" <
Put...@diebold.com>
cc: George Rhoten/Cupertino/IBM@IBMUS, "'
icu-ch...@oss.software.ibm.com'"
<
icu-ch...@www-126.southbury.usf.ibm.com>, Markus
Scherer/Cupertino/IBM@IBMUS