Unicode support in C++ 17

193 views
Skip to first unread message

stg

unread,
Aug 12, 2015, 2:10:09 PM8/12/15
to



I would really like to see improved unicode support in C++ 17.After
reading the following discussion, I thought maybe I might be able to
participate in the discussion:
[https://groups.google.com/a/isocpp.org/forum/?fromgroups=#!searchin/
std-proposals/unicode/std-proposals/SGFtQkKE0bU/overview]

Everything in this document reflects my best understanding
about Unicode, and C++. I would be delighted to have that
understanding improved or corrected.

I was hoping the knowledgeable folk in this newsgroup might help me
evaluate some ideas. Please find my thoughts below, and be both
critical and kind:


1.2 Desired functionality
~~~~~~~~~~~~~~~~~~~~~~~~~

1. composed-character awareness -- single display character may be
composed of multiple codepoints, or may be comprised of ligatures.
2. multi-byte codepoint awareness.
3. char_t indexing -- This is the current default behavior, and I
suppose we must keep it for the sake of backward compatiblity, and
for the implementation of 1 & 2.

Currently 3 is the default, but we can get 1&2 compliant behavior for
much string handling by specifying a locale. We can steer the default
behavior by setting the globale local, and a great deal of work has
been done to improve C++'s locale handling (see boost::locale).

I consider that 1 is in fact the usual use-case, and 2 and 3 are
typically only of interest to library implementers.


1.3 Current behavior
~~~~~~~~~~~~~~~~~~~~

Let's consider a concrete example which is likely to be a very common
use case in the future: migrating legacy code from latin1 to utf-8, or
a developer who is used to thinking in terms of ascii want to write a
new application as a utf-8 application. I think this specific example
generalizes (e.g. to utf-16 or 32) in a trivial way, but I welcome
further insight.

The developer may start by setting the global locale. If she wants
numbers to behave like the c-locale, except when given specific
context instructions, she might use a boost::locale, or perhaps she
rolls her own locale, comprising it out of existing facets that suit
her needs. The relevant detail is that the locale specifies that she
will be working with a utf-8 character set.

If there is a legacy application being modernized or replaced, she'll
have to convert data sources and sinks to utf-8, but that's likely to
be a pretty trivial task.

Streaming operations will work as expected, so she won't have to
modify the std::iostream and std::stringstream stuff.

std::string will work fine as a container. That's where the good news
ends.


1.3.1 sorting
-------------

To use std::sort she would have to specify that the application use
the locale() operator:

,----
| std::sort (str.begin(), str.end(), std::locale);
`----

As the default sort uses the numeric src_<C++>(<) operator --
i.e. it's a byte order sort that is efficient, but not humanly
meaningful. The above code works but isn't parsimonious.


1.3.2 find and substr
---------------------

Consider:
,----
| auto pos1 = foo.find(someChar);
| // sanity check...
| auto bar = foo.substr(pos1, 3);
`----

The determination of pos1 can fail because it might find a match
inside a composite character. The determination of src_<C++>(bar) will
fail whenever there's a composite or multi-byte character in within
the next three positions.


1.4 My naive proposal:
~~~~~~~~~~~~~~~~~~~~~~

- A std::basic_string has a locale awareness, either "NONE (default,
current implementation), CODEPOINT (mainly or library implementers
who want to investigate codepoints, not composed characters), and
COMPOSITE (alternatively DISPLAY, or CHARACTER -- a displayable
character).
- std::locale gets a cc_iterator (composed-character iterator --
iterates over displayable characters).
- std::locale gets a cp_iterator (codepoint iterator -- iterates over
displayable characters. for utf-32 locales this is just the byte
operator)
- std::string methods use the locale-aware iterators if the string is
locale-aware. So size() returns the number of displayble characers
for a std::string<COMPOSITE>, the number of codepoints for a
std::string<CODEPOINT>, and the number of bytes for a
std::string<NONE>

For a locale-aware string, the following behavior would change:
- std::sort would use the locale's () operator by default. Maps with
a la_string key would work in a locale aware way, maps with a
std::string would work with the old byte src_<C++>(<).
- integer positional arguments would refer to *composed characters*.
So src_<C++>(s.substr(pos,3)) would give the last 3 display
characters, regardless of whether or not they are ligatures,
composed, or simply 1-byte ascii codepoints. That would apply to
str[i] and str.size() as well.


1.4.1 Pros
----------

- updating legacy code should be almost-trivial -- change the
string construction to create locale-aware strings, and everything
should work as desired.

- Minimal language pollution. Seems consistent with current language
design.


1.4.2 Cons
----------

- What to do when comparing std::string<locale_aware==false> with a
std::string<locale_aware==true>? I suggest default behavior is
byte-comparision, but compilers should generate a warning. May need
to introduce a cast operations to avoid the warning.
- I don't see a way to prevent a developer from setting an
incompatible locale, and using an incompatible string. I suppose
this would have to throw an exception.
- std::string<locale_aware> or std::la_string is clunky.


1.5 Questions
~~~~~~~~~~~~~

- chage locale awareness via typecasting?


--
[ comp.std.c++ is moderated. To submit articles, try posting with your ]
[ newsreader. If that fails, use mailto:std-cpp...@vandevoorde.com ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]

Jakob Bohm

unread,
Aug 14, 2015, 2:40:06 PM8/14/15
to

On 12/08/2015 21:00, stg wrote:
>
>
>
> I would really like to see improved unicode support in C++ 17.After
> reading the following discussion, I thought maybe I might be able to
> participate in the discussion:
> [
https://groups.google.com/a/isocpp.org/forum/?fromgroups=#!searchin/
> std-proposals/unicode/std-proposals/SGFtQkKE0bU/overview]
>
> Everything in this document reflects my best understanding
> about Unicode, and C++. I would be delighted to have that
> understanding improved or corrected.
>
> I was hoping the knowledgeable folk in this newsgroup might help me
> evaluate some ideas. Please find my thoughts below, and be both
> critical and kind:
>
>
> 1.2 Desired functionality
> ~~~~~~~~~~~~~~~~~~~~~~~~~
>
> 1. composed-character awareness -- single display character may be
> composed of multiple codepoints, or may be comprised of ligatures.

The subset of programs which care about this consists
mostly of those programs which do additional text
formatting (e.g. columns, word line breaks etc.)
and/or control cursor navigation in text input (like
a C++ equivalent of GNU readline etc.). Such programs
generally are more concerned if a sequence of codepoints
represent a single screen location (and how big that is)
on the actual output device in use, not if a hypothetical
mega-implementation of all Unicode formatting features
would do so.

For instance, some display systems will artificially
cause multi-codepoint (and sometimes even multi-char_t)
characters to occupy as much space as their encoding,
while others will not. Some display systems will do
the right-to-left vs. left-to-right direction shifts
automatically, while others expect applications to
reorder displayed characters before output.

One thing that is of general interest, but is too
big/slow to be done implicitly during every string
operation is to convert Unicode strings to one of
the official normalized forms (A, B, C, D plus any
future standard form that prevents visually equivalent
display strings from having different encoding, to
avoid security attacks that depend on fooling humans
into accepting a made up name that looks just like
a different name they trust, such as 0bama vs.
Obama or V1adimir vs. Vladimir). These things are
already available in libraries such as IBM's libuni.

> 2. multi-byte codepoint awareness.

This is important for UTF-8 and the higher codepoints in
UTF-16, and has always been important for non-Unicode
encodings of East Asian alphabets. Thus where possibly,
standard library features for this should be done as
natural extensions / bugfixes for the existing library
functions that have always done this for traditional
encodings.

For UTF-8 and to a lesser degree UTF-16, the Unicode
standard designers did extra work to ensure that things
like sorting and searching would work in most cases
when naively using routines that only use char_t
values, specifically:

1. No UTF-8 or UTF-16 encoding of a codepoint will
match at half-character locations when using a char_t
based string search algorithm.

2. Comparing the UTF-8, UTF-32 or plain UCS-4 encodings
of two strings using code that treats them simply as
arrays of unsigned char or unsigned char32_t will get
the same result and ordering as comparing those strings
codepoint by codepoint using the equivalent codepoint
numbers in the Unicode standard.

3. Comparing the UTF-16 encodings of two strings using
code that treats them simply as arrays of unsigned
char16_t values will get the same result as codepoint
by codepoint comparisons, except that codepoints
U+0000D800 to U+0000FFFF sort after U+10FFFF rather
than between U+0000CFFF and U+00010000 . However
this odd result is often needed for compatibility
with existing systems that were originally designed
for UCS-2 where that was the correct algorithm due
to the historic non-existence of codepoints above
U+00010000 .

These nice properties do not hold for traditional East
Asian encodings, though some of those encodings may
happen to match some locale specific lexicographic
orderings in a similar way.

> 3. char_t indexing -- This is the current default behavior, and I
> suppose we must keep it for the sake of backward compatiblity, and
> for the implementation of 1 & 2.

Also because this is the most relevant form in the following
cases:

1. When processing text strings for purposes of storage or
transmission, since most storage/transmission systems
stores/transmits bits and bytes, not abstract characters.

2. When using the string class as an efficient and convenient
container for arrays of non-text bytes, such code often gains
great benefits from the ways string classes differ from
vectors/lists of bytes, but would fail horribly if the string
classes started having opinions on what bytes can be stored
there.

The computer industry has a long history of the insane costs
imposed when interfaces are defined to process characters (in
any character set) rather than sequences of bits and bytes.
For instance because the Internet e-mail protocols were
historically defined to operate on sequences of human
readable English-characters from a common subset of ASCII and
EBCDIC, even though actual transmission was always ASCII bytes,
every e-mail containing attachments, pictures or non-English
text needs to be transmitted using clunky Base64 and Hex
encodings just in case some mail gateway on the way might
temporarily process the e-mail using arcane character
representations (e.g. on older IBM operating systems). And
this is just one instance of how such a decision in the past
has come back to haunt us.

Thus it is best if most standard library classes, methods,
types and functions are defined to be what some people call
"8-bit clean", meaning that they won't mangle or damage
arbitrary binary data given to them, if at all possible
(the classic std::strxxx() and std::wcsxxx() functions
obviously need to treat a char_t value of 0 specially as
per their definitions, but must refrain from mistreating
other values).



>
> Currently 3 is the default, but we can get 1&2 compliant behavior for
> much string handling by specifying a locale. We can steer the default
> behavior by setting the globale local, and a great deal of work has
> been done to improve C++'s locale handling (see boost::locale).
>
> I consider that 1 is in fact the usual use-case, and 2 and 3 are
> typically only of interest to library implementers.
>

In my experience, 3 is the most common use case where strings
are not treated as opaque blobs (then there is no difference),
the one exception being country-specific lexicographic ordering
which is never the same as any sorting done purely for
computational efficiency).

Real world situations that truly care about codepoints or display
characters often also care about words and sentences. For
instance in many locales a list sorted for human consumption
should ideally go like this

has one
h=C3=A1s one
hat on
h=C3=A2t on
have not

Which requires processing at the word and sentence level, not
just the code point level. Such rules tend to reflect the way
written text is usually pronounced (and thus memorized) amongst
native speakers in that culture/language combination.

I have heard rumors that some schools teach computing the other
way round, but that is mostly an artifact of those educators
lacking experience and/or deeper technical understanding before
overconfidently instilling superficial misunderstandings into
their pupils.
This depends on the purpose of the sort:

If the sort is used for a purpose where an ASCII application
would be happy to sort lowercase a after uppercase Z, then
sorting by (32 bit) Unicode code point is the natural
equivalent, and utf-8 was specifically designed (this is
explicitly stated in the original standards) such that the
naive byte comparison will yield the correct result with no
extra effort.

If the sort is used for a purpose where an ASCII application
would want upper and lower case A/a to sort in close proximity,
then the application will already need to use a more
intelligent string comparison function. For ASCII a simple
case-insensitive string compare function would do the trick,
while for anything else, the application would need a highly
locale-sensitive non-trivial comparison function such as the
parametrized string comparison function from the Unicode
standard (that function takes a bunch of parameters
specifying most of the commonly occurring locale
oddities, such as rules for the treatment of accents,
uppercase/lowercase multiple spaces and even punctuation),
or more practically a truly locale specific comparison
function that can take into account locale-specific issues
not covered by such a generic function. In practice this
would simply involve delegating the comparison operation
to a virtual method of the locale object, of which there
can be several depending on usage context, for instance
some locales have different rules for sorting dictionaries
versus phone books.

>
>
> 1.3.2 find and substr
> ---------------------
>
> Consider:
> ,----
> | auto pos1 = foo.find(someChar);
> | // sanity check...
> | auto bar = foo.substr(pos1, 3);
> `----
>
> The determination of pos1 can fail because it might find a match
> inside a composite character. The determination of src_<C++>(bar) will
> fail whenever there's a composite or multi-byte character in within
> the next three positions.

For all the standard UNICODE encodings (except UTF-7, a
victim of the e-mail design mistake previously mentioned),
the encoding has been designed to guarantee that
searching for a valid encoding of a string or character
in a valid encoding of a string will not result in false
matches.

However for any encoding that uses multiple char_t-s to
represent a single code point, code point operations
must be treated as substring operations, never as
character operations.

In your example above, if someChar is of type char_t,
then it can only be a single-char_t codepoint, if it
is a codepoint at all. If someChar is of type string,
then extracting text where it was found should already
account for someChar.length(), whatever unit that
function measures its result in. pos1 can use any unit
of measurement: Inches of paper, microliters of ink,
count of codepoints etc., but count of char_t-s is
just as useful for values that are treated simply as
abstract non-iterable iterators.

As for the second step of extracting a known character
plus the next two characters, then such an operation
makes sense only when the context makes clear why
exactly two extra characters are requested, and if
that reason refers to two display characters, two
codepoints or two char_t-s. This semantic problem
cannot be defaulted away without leading to lots of
malfunctioning applications (namely those that
needed either of the other two semantics in that
particular code line, unrelated to what the rest
of the application needs in unrelated code lines).

For instance if we are looking for a marker sign
followed by a two-letter abbreviation in some
human-originated convention, then one must look at
that convention to see if these abbreviations are
defined to consist of two display characters, two
codepoints or two char_t-s, taking into account
that many real world human-written documents will
use those words to refer to any of the other two
meanings.

If the relevant specification is unclear, then
the conversion of this program from ASCII to
utf-8 is the perfect time to settle that ambiguity
before failing to interoperate with another
application whose author would otherwise have
interpreted the convention differently.

If on the other hand we are looking to display the
beginning of a text in a narrow indicator field,
then we obviously want 3 display character cells,
using whichever definition of that concept matches
the actual properties of the intended output device,
we might even want to change this to the first "3em"
of the text using a specific font such as
"Helvetica" or the first 3 6-point cells in braille.
Only if that is the desired behavior, which often it is not
once one starts looking at the code details.

> - Minimal language pollution. Seems consistent with current language
> design.
>
>
> 1.4.2 Cons
> ----------
>
> - What to do when comparing std::string<locale_aware==false> with =
a
> std::string<locale_aware==true>? I suggest default behavior is
> byte-comparision, but compilers should generate a warning. May need
> to introduce a cast operations to avoid the warning.
> - I don't see a way to prevent a developer from setting an
> incompatible locale, and using an incompatible string. I suppose
> this would have to throw an exception.
> - std::string<locale_aware> or std::la_string is clunky.
>
>
> 1.5 Questions
> ~~~~~~~~~~~~~
>
> - chage locale awareness via typecasting?
>

Having all that locale-aware code in std::basic_string will
seriously bloat any application wanting only the non-locale
aware form.

It is thus better to have std::basic_lstring as a subclass
of std::basic_string, such that all the extra code will not
be linked into statically linked utility programs that don't
need this extra library code.

Making std::basic_string a protected base class of
std::basic_lstring will have additional benefits:

- accidentlly mixing string and lstring types will cause type
errors except where std::basic_lstring provides overloaded
operations to handle the combination.

- functions that need to be much more complex in
std::basic_lstring can do this without forcing their
simpler cousins in std::basic_string to be virtual and
incur the resulting call overhead, which may easily
exceed the low cost of the trivial non-locale
implementatins.

As an alternative to hiding the basic_string properties of a
basic_lstring, one could use different names for the non-basic
operations while keeping the basic operations from the base
class available. For example

size_t length() const; // Length in char_t units, usually
// quick, inherited from basic_string
size_t vlength() const; // Number of codepoints in string.
// often expensive and charset
// dependent, but may be cached
// for speed.
size_t tlength() const; // Text length in ideal screen
// character cells, assuming an
// semi-ideal display which merges
// all accents etc. into the main
// cell and uses no space for any
// occurrence of formatting specials
// such as the BOM.
// Expensive
size_t hlength() const; // Text length in ideal screen
// character halfwidth cells,
// assuming an ideal Asian (east) display
// which merges all accents etc. into
// the main cell, treats western
// characters as half-width unless
// explicitly marked full-width in the
// character standard. Also counts no
// space for non-spacing and formatting
// characters.
// Expensive
size_t flength() const; // Text length in ideal screen
// character fullwidth cells,
// assuming an ideal Asian (east) display
// which merges all accents etc. into
// the main cell, treats western
// characters as full-width unless
// explicitly marked half-width in the
// character standard. Also counts no
// space for non-spacing and formatting
// characters.
// Expensive

Similarly for the various substring and indexing operations.


P.S.

In the above document i distinguish explicitly between:

UCS-4: 4-byte/31-bit char32_t encoding of the full potential of the
Unicode Character Set, allowing codepoints from U+00000000 to
U+7FFFFFFF Note that the sign bit is still reserved, just as
it was in 1-byte/7-bit ASCII.

UTF-32: 4-byte/31-bit char32_t encoding of the subset of the Unicode
Character Set which can be encoded using the current UTF-16
encoding, i.e. the codepoints U+00000000 to U+0010FFFF
inclusive. This is the subset that will be assigned meanings
first, just as the codepoints from 0 to 127 were the first to
be assigned in ASCII-derived character sets.

UCS-2: Historic 2-byte/16-bit char16_t encoding of the first 64K
code points in the Unicode Character set. More than
20 years ago some believed this and not UCS-4 would become
the final standard and thus designed protocols and systems
accordingly, this includes the designs of Java, Microsoft
Windows, and mobile text messaging (SMS) standards of 160
7-bit chars or 70 16 bit chars.

UTF-16: An encoding of the first about 1 million Unicode codepoints
which is the same as UCS-2 for the common codepoints and a
special char16_t[2] encoding of codepoints from U+00010000 to
U+0010FFFF . This is mostly used when retrofitting UCS-2
systems to support a larger number of Unicode codepoints.

UTF-8: An encoding of the full Unicode character range from
U+00000000 to U+7FFFFFFF using a variable number of 8-bit
chars such that the ASCII subset U+00000000 to U+0000007F
encodes as itself and having many other practical properties.
Many official documents have changed the original UTF-8
definition to formally prohibit the encoding of codepoints
that cannot be encoded using UTF-16, but I view this as
short sighted and potentially subject to future reversal.



Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 S=C3=B8borg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Reply all
Reply to author
Forward
0 new messages