Korean charset coverage

2 views
Skip to first unread message

Kurosaka, Teruhiko

unread,
May 2, 2003, 12:31:09 PM5/2/03
to icu-ch...@www-126.southbury.usf.ibm.com, Kurosaka, Teruhiko
We are attempting to implement Korean code sets listed in OSF Code Set Registry
ftp://ftp.opengroup.org/pub/code_set_registry/code_set_registry1.2g.txt
using icu4c. But we do not find many of them in source/data/mappings/convrtrs.txt.

Below is a table of comparison among various code set registry and implementations:

OSF IANA Java ICU Notes
0x00040001 EUC-KR? EUC-KR? ibm-970 KS C5601:1987; questionable mapping
0x00040002 - - - KS C5657:1991
0x0004000a EUC-KR EUC-KR EUC-KR Via link to 0x004001 (KS C5601) which is questionable
0x10020341 Cp833 Cp833# - Korean Host Extended SBCS
0x10020342 Cp834 Cp834# - Korean Host DBCS incl 1227 UDC
0x1002037b Cp891 Cp891# - Korean PC Data SBCS
0x1002039e Cp926 Cp926# - Korean PC Data DBCS incl 1880 UDC
0x100203a5 - - ibm-933 Korean Host Extended SBCS (IBM CP933 = CP833 + CP834, according to comments in OSF registry)
0x100203a6 - - - Korean PC Data Mixed (IBM CP934, subset of IBM CP944)
0x100203b5 cp949 cp949 cp949 IBM KS PC Data Mixed
0x100203b7 cp951 cp951# - IBM KS PC Data DBCS incl 1880 UDC
0x100203ca cp970 cp970 ibm-970 Korean EUC
0x10020410 cp1040 cp1040# - Korean PC Data Extended SBCS
0x10020440 cp1088 cp1088# - IBM KS Code PC Data SBCS
0x100213cb cp5067 cp5067# - Korean Hangul and Hanja; superset of KS C5601:1987
0x10022341 cp833 cp833# - Korean Host SBCS
0x10022342 cp834 cp834# - Korean Host DBCS incl 1880 UDC
0x100223a5 cp9125 cp9125# - Korean Host Mixed incl 1880 UDC (CP833 + CP834)
Key: # JDK 1.4.1 implementation have them but its supported encoding list do not have them

At least the following OSF code sets seem to be missing:

0x00040002 (KS C5657:1991)
0x10020341 (IANA Cp833)
0x10020341 (Cp833?;Korean Host Extended SBCS)
0x10020342 (Cp834;Korean Host DBCS incl 1227 UDC)
0x1002037b (Cp891;Korean PC Data SBCS)
0x1002039e (Cp926;Korean PC Data DBCS incl 1880 UDC)
0x100203a6 (Korean PC Data Mixed (IBM CP934, subset of IBM CP944))
0x100203b7 (cp951; IBM KS PC Data DBCS incl 1880 UDC)
0x10020410 (cp1040; Korean PC Data Extended SBCS)
0x10020440 (cp1088; IBM KS Code PC Data SBCS)
0x100213cb (cp5067; Korean Hangul and Hanja; superset of KS C5601:1987)
0x10022341 (cp833; Korean Host SBCS)
0x10022342 (cp834; Korean Host DBCS incl 1880 UDC)
0x100223a5 (cp9125; Korean Host Mixed incl 1880 UDC (CP833 + CP834))

Could you confirm that these are indeed missing, or if some of them are supported
by alternative names?
Thank you in advance.

T. "Kuro" Kurosaka
Internationalization Architect
teruhiko...@iona.com
-------------------------------------------------------
IONA Technologies
2350 Mission College Blvd. Suite 650
Santa Clara, CA 95054
Tel: (408) 350 9684/9500
Fax: (408) 350 9501
-------------------------------------------------------
Making Software Work Together TM

George Rhoten

unread,
May 2, 2003, 1:26:10 PM5/2/03
to Kurosaka, Teruhiko, icu-ch...@www-126.southbury.usf.ibm.com
Please be careful with the cp aliases and names. Any alias that starts
with "cp" is ambiguous because it can also mean Windows. If you mean ibm,
please say ibm. IBM and Microsoft have been known to have different
Unicode mapping tables for various charsets.

You should know that many of those SBCS and DBCS are combined to make MBCS
Coded Character Sets. I'm not sure why OSF has registered these CCSIDs.
They usually are not used by themselves, and are usually not portable.

If you would like to add mapping tables to your copy of ICU, you can visit
our Charset page at "http://oss.software.ibm.com/icu/charset/index.html".
Instructions for adding ICU converters are located here
"http://oss.software.ibm.com/icu/userguide/icudata.html#custom_data_library".
Except for ksc5657, all of those mappings are in our Charset Repository.

FYI the alias names EUC-KR and ksc5601have been known to be ambiguous on
several platforms. For example, Solaris and Java makes ksc5601 equal to
EUC-KR, but many other platforms considers them different.

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA




"Kurosaka, Teruhiko" <Teruhiko...@iona.com>
Sent by: icu-chars...@oss.software.ibm.com
05/02/2003 09:31 AM


To: <icu-ch...@oss.software.ibm.com>
cc: "Kurosaka, Teruhiko" <teruhiko...@iona.com>
Subject: Korean charset coverage
_______________________________________________
icu-charsets mailing list
icu-ch...@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu-charsets


Reply all
Reply to author
Forward
0 new messages