1A - 1C - 7F

0 views
Skip to first unread message

Joseph Boyle

unread,
Jun 2, 2003, 6:23:14 PM6/2/03
to icu-ch...@www-126.southbury.usf.ibm.com

IBM 943 has these mappings in the control range:

<subchar>                     \x1A
<U001A> \x7F |0
<U001C> \x1A |0
<U007F> \x1C |0

I've been asked if these should really be included in the Siebel conversion table for IBM-style Japanese codepage, as none of the other ASCII based charsets we use do anything with the 00 - 7F range except map them to the same value, or possibly a substitution character.

I'd be interested in hearing the story behind these mappings. Thanks, Joseph


Markus Scherer

unread,
Jun 2, 2003, 8:03:06 PM6/2/03
to Joseph Boyle, icu-ch...@www-126.southbury.usf.ibm.com
IBM conversion tables usually do this if they convert between a "DOS" (or
Windows or OS/2) codepage and one for a different platform.

The idea is that ISO defines 0x1A to be the SUB character, but on
DOS/Windows/OS/2 systems, 0x1A==^Z is used in text files as "end of file".
Therefore, IBM decided for DOS codepages to "rotate" several control
codes. Instead of a simple swap of two codes, three codes were rotated.

ISO semantics IBM DOS
1A SUB 7F
1C IS4 1A
7F DEL 1C

I am actually not aware of any software that _interprets_ byte values this
way, except for 1A="end of file" and except for IBM conversion tables. I
have not seen DEL=1C used in any actual code...

(Please don't kill the messenger!)

The <subchar> should therefore be 7F, not 1A, if you go by the book.
Note the line
<icu:alias> "ibm-943_VASCII_VSUB_VPUA"

in ibm-943_P130-1999.ucm - the VSUB part indicates the control code
rotation. (This is present in files generated by one of our tools.)

Note that some conversion tables do map 00-7F differently from ASCII,
especially 0x5C to the Yen character. This is common for East Asian
codepages and used to be common for European countries in the 70s/80s
before 8-bit ISO 8859 codepages became prevalent. In such cases (as for
CCSID 943), there are often multiple IBM conversion tables between Unicode
and the CCSID, such that one of them maps ASCII "as itself" and the other
one maps as the national standard (JIS) defines. The IBM CCSID alone will
not tell you this difference.

Best regards,
markus

Markus Scherer マルクス IBM GCoC-Unicode/ICU San José, CA
markus....@us.ibm.com





"Joseph Boyle" <Joseph...@Siebel.com>
Sent by: icu-chars...@www-124.southbury.usf.ibm.com
2003-06-02 15:23


To: icu-ch...@www-124.southbury.usf.ibm.com
cc:
Subject: 1A - 1C - 7F

Joseph Boyle

unread,
Jun 3, 2003, 12:02:16 AM6/3/03
to Markus Scherer, icu-ch...@www-126.southbury.usf.ibm.com
Checking convrtrs.txt it looks like:
U+001A to 7F: CJK Windows, all DOS ("PC") codepages
U+001A to 1A: non-CJK Windows and ISO, and Apple codepages
U+001A to 3F: EBCDIC, some "Host" codepages, and some IBM codepages with no other info that I don't recognize
Of course in EBCDIC almost everything moves anyway.

From our point of view the anomaly is that on CJK Windows codepages it suddenly appears.

A more general point is that while the European and Middle Eastern codepages have a clear division between Windows, DOS, ISO, the CJK codepages do not, and are usually ostensibly the same national standard, but with individual differences that are not systematic. (The 1A - 7F is a DOS-Windows difference though, so probably IBM's CJK pages started as DOS and separate ones were not issued for Windows.)

We find ourselves answering a lot of CJK compatibility questions especially for Shift-JIS, and are now going to resort to providing conversion different vendors' versions of the same codepage. ICU was not sufficient for this out-of-box mainly becaues of the lack of the Oracle version.

Oracle has had to face similar questions from its users (sometimes also our users) and has issued the JA16SJISTILDE charset in response. (They already had a JA16SJISYEN)

I think ICU could be a universal solution to charset conversion by:
- getting all the major vendors' to allow publication of their tables in ICU. For enterprise apps this has to include the database vendors. Although the vendors could submit tables, it looks like you are moving towards getting them out of the platform programmatically which is more reliable. Oracle tables can be exported from Oracle Locale Builder although I haven't been able to eliminate one manual copying step.
- continuing to publish tools for working with conversion tables
- paying attention to minor issues like the 1A. We've had other problems with 1A when it is created as a subchar in DB2 conversion, then is rejected by XML processors that don't regard any controls but CR LF TAB as legal characters. 1A might be an IBM issue or a DB2 team issue instead of primarily an ICU issue, not sure how you divide them up.
- handling minor variations without less space cost by allowing definition of a conversion by a main table and a small supplementary table

Markus Scherer

unread,
Jun 3, 2003, 12:05:28 PM6/3/03
to Joseph Boyle, icu-ch...@www-126.southbury.usf.ibm.com, Mark Davis, Helena S Chapman, George Rhoten
Joseph, welcome to the nightmarish world of codepages...

First, a few short answers:

- I have limited knowledge about the ins and outs of why existing
conversion tables are the way they are.
- We are planning to collect and publish more vendor tables, but the
collection and review process takes a lot of time; we need to balance it
with more forward-looking I18N/Unicode functionality.
- At least, ICU is not creating new conversion tables when possible [we
had to create a few for stateful encodings like ISO 2022 where we could
not find any reliable-looking ones].
- We are already recommending to use the most matching conversion tables,
and are trying to educate our users about the many variants.
- SUB characters are easy to change via the ICU API.
- The data structure for ICU conversion tables is a compromise between
size and speed. With the current data structure, it is expensive to add
more tables.

Next, a very personal comment:

A few years ago, when I redesigned some of the ICU conversion data
structures and code, I was hoping that the use of legacy codepages would
go down over time, and eventually we would be able to reduce the number of
conversion tables.
This was naive. I did not realize that with more data exchange with legacy
systems and legacy data sets, it would become more important to match
conversion behavior, leading to more and more variant tables.

Finally, an invitation:

You share a lot of concerns and user requirements with ICU. I suggest that
we get together and discuss the issues and possible ways to deal with
them.
What do you think?

markus

Markus Scherer マルクス IBM GCoC-Unicode/ICU San José, CA
markus....@us.ibm.com





"Joseph Boyle" <Joseph...@Siebel.com>
2003-06-02 21:02


To: Markus Scherer/Cupertino/IBM@IBMUS
cc: icu-ch...@www-124.southbury.usf.ibm.com
Subject: RE: 1A - 1C - 7F
Reply all
Reply to author
Forward
0 new messages