Charset roundtrip info

0 views
Skip to first unread message

Yves Arrouye

unread,
Feb 11, 2002, 5:12:18 PM2/11/02
to icu-ch...@www-126.southbury.usf.ibm.com
> http://oss.software.ibm.com/icu/charset/roundtripIndex.html,

I have a simple suggestion for when you next generate that page again:

1. At the beginning of every table, generate an anchor, so one can refer
someone to
http://oss.software.ibm.com/icu/charset/roundtripIndex.html#windows-932-2000
for example (we could even have one anchor per line then, as in
#windows-932-2000xsolaris-PCK-2.7)

2. Change the column heading of the tables so one may search for the table
for windows-932-2000 rather than each line where it is compared with another
charset in its own table.

3. Have links from the tables to the .ucm and .xml files.

I also have a question re: which of these tables are complete / directly
usable or not? I am not sure.

Thanks,
YA
--
Sailing is harder than flying. It's amazing that man learned how to sail
first. -- Burt Rutan.

George Rhoten

unread,
Feb 11, 2002, 5:39:17 PM2/11/02
to Yves Arrouye, icu-ch...@www-126.southbury.usf.ibm.com
You're obviously talking to me. :-)

None of the tables are really useable "straight out of the box." This is
mentioned in the readme.txt in the CVS repository. Most of the UCM files
still require some tweaking of the header information, especially where
the state information is concerned. Some of the XML files have a similar
problem.

I don't want to put links to the UCM/XML files yet because of the reasons
mentioned above, but I will do this eventually.

I'll consider doing your suggestion #1 and #2. It depends on how big the
file becomes. If it gets too large, like a few megabytes, I'll have to
skip that suggestion. Displaying a one megabyte HTML file over the
Internet is difficult as it is. I doubt it will increase the file size
that much.

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA




Yves Arrouye <yv...@realnames.com>
Sent by: icu-chars...@www-124.southbury.usf.ibm.com
02/11/2002 02:12 PM


To: "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>
cc:
Subject: Charset roundtrip info
_______________________________________________
icu-charsets mailing list
icu-ch...@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu-charsets



Yves Arrouye

unread,
Feb 11, 2002, 5:53:24 PM2/11/02
to George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com
> None of the tables are really useable "straight out of the box." This is
> mentioned in the readme.txt in the CVS repository. Most of the UCM files
> still require some tweaking of the header information, especially where
> the state information is concerned. Some of the XML files have a similar
> problem.

OK. It would be nice to have that disclaimer on
http://oss.software.ibm.com/icu/charset/ itself (or a link to it). Any
timetable on when:

* The mappings for the stateful and MBCS encodings may not be correct for
the
non ibm-*.ucm files. The program for collecting such mappings is still
not
complete.
* Some of the MBCS and stateful character sets are currently missing the
<icu:state> and <uconv_class> tags.

Will be addressed? Anything people outside of IBM can help with? I am
actually only interested in Windows mappings right now. If I look at
http://oss.software.ibm.com/cvs/icu/charset/data/ucm/windows-936-2000.ucm?re
v=1.1&content-type=text/x-cvsweb-markup I can see:

<code_set_name> "windows-936-2000"
<mb_cur_max> 2
<mb_cur_min> 1
<uconv_class> "MBCS"
<subchar> \x3F
<lead_bytes> \x81


Now I can also see:

<U00A4> \xA1\xE8 |0
<U00A5> \xA3\xA4 |1
<U2161> \xA2\xF2 |0

and others, which makes me think (according to
http://oss.software.ibm.com/icu/userguide/conversion-data.html) that there
should be some <icu:state> declaration like:

<icu:state> 81:1, a1-a3:1, etc...

At the same time, the user guide does not document <lead_bytes>, nor is it
used by the 197 .ucm files in the current ICU HEAD. Does that mean that the
tool that generates the files on the Web is old, or that the files are not
up to date? And what prevents <icu:state> to be generated from all the data
we can see in the file? (If it's just time, I'll be glad to change the tool
myself. Actually, even if it's not just time :-))

(Another dumb question: does it matter if the class is "MBCS" or "DBCS"? Is
there some special optimization for a DBCS vs MBCS?)

> I don't want to put links to the UCM/XML files yet because of the reasons
> mentioned above, but I will do this eventually.

Fair enough.

> I'll consider doing your suggestion #1 and #2. It depends on how big the
> file becomes. If it gets too large, like a few megabytes, I'll have to
> skip that suggestion. Displaying a one megabyte HTML file over the
> Internet is difficult as it is. I doubt it will increase the file size
> that much.

It may get large if you do one anchor per line but not if you only do one
per table.

Thanks,
YA

George Rhoten

unread,
Feb 11, 2002, 6:27:18 PM2/11/02
to Yves Arrouye, icu-ch...@www-126.southbury.usf.ibm.com
Right after I finish the locale data verification, I'll be working on this
stuff full time with the alias information. More help is always
appreciated :-)

The <uconv_class> has a very important distinction between MBCS and DBCS.
The DBCS has an implied state table, unless overridden. See User's Guide
for details.

All the ibm-* tables are generated directly from the IBM official Unicode
charset table repository. The other tables are generated by various tools
by collecting it directly from the platform. Most of the tools have small
flaws that I keep on stumbling over. I try to fix them when I encounter a
bug. Unfortunately, those tools create some other tables that I can't put
into the CVS repository because I get really weird data. For example,
none of the tools can handle iso-2022, EBCDIC stateful or some other
platform converter that has a bizarre behavior.

Generating the correct state table does take some time. Part of the
problem is that there are so many tools to do this, and they all do it
differently. Some try to figure out the state table, others parse the
convrtrs.txt file to find the alias for a converter and then it puts in a
precanned state table. The various tools need to be integrated a bit
more. The previous authors had the very wrong assumption that the tools
would only be used once.

The <lead_bytes> tag is ignored. That was put in accidentally by the UCM
generation tool from Windows from a previous author. Oops.

Generating the reverse fallbacks from Windows is still very difficult,
especially since Windows is inconsistent on their C and COM APIs. I tried
to file a bug report to Microsoft, but no one would listen at Microsoft.
So I gave up on the bug report.

There are many other issues for creating correct information for each
platform. If you have any other issues with my tools, you can ask me
directly.

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA




Yves Arrouye <yv...@realnames.com>
02/11/2002 02:53 PM


To: George Rhoten/Cupertino/IBM@IBMUS
cc: "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>
Subject: RE: Charset roundtrip info

Yves Arrouye

unread,
Feb 11, 2002, 7:12:16 PM2/11/02
to George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com
Markus said:

> State tables for "MBCS" are not trivial. In general, it is not possible to
> guess complete state tables because you would miss unassigned but valid
> byte combinations.

Couldn't we just try all of them? FFxFF isn't that much tries for your
typical 2-bytes MBCS and it will actually really be less (just start with
the identified lead bytes). If brute force works for this, and given that
the collection tools do not need to be run in real-time, it should be okay.
Then the problem will just be to reduce these big arrays to a small set of
states (where's my set arithmetic algorithm book?).

George wrote:

> Generating the correct state table does take some time. Part of the
> problem is that there are so many tools to do this, and they all do it
> differently. [...]

It'd be easier with a common framework, sure, so that one could quickly add
data-gathering routines for a new platform and have the same set of
algorithms manipulate the data to produce the .ucm. Later :)

> Generating the reverse fallbacks from Windows is still very difficult,
> especially since Windows is inconsistent on their C and COM APIs. I tried
> to file a bug report to Microsoft, but no one would listen at Microsoft.
> So I gave up on the bug report.

Again could that be achieved by brute force tryings?

Lastly, I guess tables like
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT are
not very useful to that process because of fallbacks and reverse fallbacks,
right?

YA


George Rhoten

unread,
Feb 11, 2002, 7:41:43 PM2/11/02
to Yves Arrouye, icu-ch...@www-126.southbury.usf.ibm.com
Brute force only works for simple converters. The multi-state converters
can't be done by brute force, especially for the 4 byte encodings.
Sometimes the platform will try to out smart you, and sometimes it's just
buggy. From experience, creating the state table by brute force never
works accurately. There are many types of state tables, and you always
miss certain valid states. The precanned version of the state tables have
always worked accurately.

For example, the Windows COM and C APIs convert differently. It will give
you different reverse fallback substitution characters. It's either a
middle dot, a question mark or something else. I'm still in the process
of figuring it out. Sometimes the API will try to convert bad data to
something valid.

Every conversion API is incompatible with every other conversion API, even
on the same platform. I guess this is why ICU looks so nice because you
can get consistent behavior. It may not be correct behavior yet, but at
least we're consistent.

ICU is the only place where you can get reverse fallback information. I'm
not aware of any other vendor that publishes the fallback or reverse
fallback information for the Unicode charset tables.

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA




Yves Arrouye <yv...@realnames.com>
02/11/2002 04:12 PM


To: George Rhoten/Cupertino/IBM@IBMUS
cc: "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>
Subject: RE: Charset roundtrip info



Yves Arrouye

unread,
Feb 11, 2002, 7:44:53 PM2/11/02
to George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com
> Brute force only works for simple converters. The multi-state converters
> can't be done by brute force, especially for the 4 byte encodings.

I am thinking about 2-bytes encodings right now. I have a very simple
problem right now, which is that we reject data that is labeled gb2312 and
was generated using Microsoft's cp936 converter, and so any way to get an
ICU mapping for cp936 that is correct is something I'm considering...

YA

George Rhoten

unread,
Feb 11, 2002, 7:58:27 PM2/11/02
to Yves Arrouye, icu-ch...@www-126.southbury.usf.ibm.com
As long as you don't mind massaging the header information, the answer is
'yes, you can use our mapping tables from the charset repository in CVS.'

You probably just need similar header information from ibm-1383. Try to
add this to
http://oss.software.ibm.com/cvs/icu/charset/data/ucm/windows-936-2000.ucm
and see if it works.

<code_set_name> "windows-936"
<mb_cur_max> 2
<mb_cur_min> 1
<uconv_class> "MBCS"
<subchar> \xA1\xA1
# Windows substitutes to ? by default
<subchar1> \x3F
<icu:state> 0-9f, a1-fe:1
<icu:state> a1-fe

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA




Yves Arrouye <yv...@realnames.com>
02/11/2002 04:44 PM


To: George Rhoten/Cupertino/IBM@IBMUS
cc: "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>
Subject: RE: Charset roundtrip info



Yves Arrouye

unread,
Feb 11, 2002, 8:01:26 PM2/11/02
to George Rhoten, icu-ch...@www-126.southbury.usf.ibm.com
> You probably just need similar header information from ibm-1383. Try to
> add this to
> http://oss.software.ibm.com/cvs/icu/charset/data/ucm/windows-936-2000.ucm
> and see if it works.

Thanks. Somehow, I always consider writing a tool a better solution than
copy and paste :(

YA

George Rhoten

unread,
Feb 11, 2002, 8:04:53 PM2/11/02
to Yves Arrouye, icu-ch...@www-126.southbury.usf.ibm.com
That's why I want to integrate all the tools. :-)

George Rhoten
IBM Globalization Center of Competency/ICU San Jose, CA, USA




Yves Arrouye <yv...@realnames.com>
02/11/2002 05:01 PM


To: George Rhoten/Cupertino/IBM@IBMUS
cc: "'icu-ch...@oss.software.ibm.com'"
<icu-ch...@www-126.southbury.usf.ibm.com>
Subject: RE: Charset roundtrip info



Yves Arrouye

unread,
Feb 14, 2002, 1:00:18 AM2/14/02
to Markus Scherer, icu-ch...@www-126.southbury.usf.ibm.com
> Yves, copy the state table from ibm-1386.ucm. This and windows-936 are
> both GBK (as you can see from convrtrs.txt where we have "cp936" as an
> alias for ibm-1386).

And you have gb2312 as an alias for ibm-1383. What is the difference between
the two (since for MSFT cp936 == gb2312).

Thanks,
YA

Yves Arrouye

unread,
Feb 14, 2002, 1:14:53 AM2/14/02
to Yves Arrouye, Markus Scherer, icu-ch...@www-126.southbury.usf.ibm.com
Hmmm, also:

ICU_DATA=.
LD_LIBRARY_PATH=../../common:../../i18n:../../tools/toolutil:../../extra/ust
dio:../../tools/ctestfw:../../data:../../stubdata/:$LD_LIBRARY_PATH
../../tools/makeconv/makeconv -c -d . ../../../data/windows-936-2000.ucm
error: byte sequence ends in illegal state at U+20ac<->0x80
error: byte sequence ends in illegal state at U+f8f5<->0xff

I guess more probing is needed... Any ideas welcome.

YA

Yves Arrouye

unread,
Feb 14, 2002, 1:20:26 AM2/14/02
to Yves Arrouye, Markus Scherer, icu-ch...@www-126.southbury.usf.ibm.com
> ICU_DATA=.
> LD_LIBRARY_PATH=../../common:../../i18n:../../tools/toolutil:../../extra/u
> stdio:../../tools/ctestfw:../../data:../../stubdata/:$LD_LIBRARY_PATH ../.
> ./tools/makeconv/makeconv -c -d . ../../../data/windows-936-2000.ucm
> error: byte sequence ends in illegal state at U+20ac<->0x80
> error: byte sequence ends in illegal state at U+f8f5<->0xff
>
> I guess more probing is needed... Any ideas welcome.

(BTW I know 0x80 is the Euro as in all modern Windows encodings---especially
after seeing it is U+20AC :)). But U+F8F5 is unassigned... Is this one an
artifact of the collection tool? (Could it be U+2F8F5 CJK COMPATIBILITY
IDEOGRAPH-2F8F5 clipped on 16 bits?)

YA

Yves Arrouye

unread,
Feb 14, 2002, 2:11:38 AM2/14/02
to icu-ch...@www-126.southbury.usf.ibm.com
> (BTW I know 0x80 is the Euro as in all modern Windows encodings---
> especially after seeing it is U+20AC :)). But U+F8F5 is unassigned... Is
> this one an artifact of the collection tool? (Could it be U+2F8F5 CJK
> COMPATIBILITY IDEOGRAPH-2F8F5 clipped on 16 bits?)

Promised, I am going to sleep to think straight now. U+F8F5 is a PUA char,
and the Windows collecting tool only converts UCS-2. No idea what this
U+F8F5 is in Windows. Anybody knows?

YA

Reply all
Reply to author
Forward
0 new messages