Joseph, welcome to the nightmarish world of codepages...
First, a few short answers:
- I have limited knowledge about the ins and outs of why existing
conversion tables are the way they are.
- We are planning to collect and publish more vendor tables, but the
collection and review process takes a lot of time; we need to balance it
with more forward-looking I18N/Unicode functionality.
- At least, ICU is not creating new conversion tables when possible [we
had to create a few for stateful encodings like ISO 2022 where we could
not find any reliable-looking ones].
- We are already recommending to use the most matching conversion tables,
and are trying to educate our users about the many variants.
- SUB characters are easy to change via the ICU API.
- The data structure for ICU conversion tables is a compromise between
size and speed. With the current data structure, it is expensive to add
more tables.
Next, a very personal comment:
A few years ago, when I redesigned some of the ICU conversion data
structures and code, I was hoping that the use of legacy codepages would
go down over time, and eventually we would be able to reduce the number of
conversion tables.
This was naive. I did not realize that with more data exchange with legacy
systems and legacy data sets, it would become more important to match
conversion behavior, leading to more and more variant tables.
Finally, an invitation:
You share a lot of concerns and user requirements with ICU. I suggest that
we get together and discuss the issues and possible ways to deal with
them.
What do you think?
markus
Markus Scherer マルクス IBM GCoC-Unicode/ICU San José, CA
markus....@us.ibm.com
"Joseph Boyle" <Joseph...@Siebel.com>
2003-06-02 21:02
To: Markus Scherer/Cupertino/IBM@IBMUS
cc:
icu-ch...@www-124.southbury.usf.ibm.com
Subject: RE: 1A - 1C - 7F