Unicode character names in universal-character-names

336 views
Skip to first unread message

Eelis

unread,
Sep 15, 2013, 8:25:03 AM9/15/13
to std-pr...@isocpp.org
Wouldn't it be nice to be able to use Unicode character names in
universal-character-names?

std::cout << "\u{PER MILLE SIGN}";

I think this better expresses the intent than "\u2030".

Ville Voutilainen

unread,
Sep 15, 2013, 8:43:01 AM9/15/13
to std-pr...@isocpp.org
Seems like a good idea. I played with some work-around ideas, and they.. ..don't work:

constexpr auto PER_MILLE_SIGN = "2030";
int main()
{
    auto x = "\u" PER_MILLE_SIGN;  // both clang and gcc complain that the universal-character name is incomplete
    auto y = "\u2030";
    cout << (string(x) == string(y));
}


constexpr auto PER_MILLE_SIGN = "\u2030";
int main()
{
    auto x = PER_MILLE_SIGN;  // sorta works, doesn't embed into strings
    auto y = "\u2030";
    cout << (string(x) == string(y));
}

Any kind of define would require string literal concatenation anyway, so it's never going to be as nice
as "\u{name}".

Maurice Bos

unread,
Sep 15, 2013, 8:50:18 AM9/15/13
to std-pr...@isocpp.org
Should this only work inside string literals, or do you want to modify the 'universal-character-name' grammer to allow this in identifiers as well? (just like \uXXXX)


2013/9/15 Ville Voutilainen <ville.vo...@gmail.com>

--
 
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

Eelis

unread,
Sep 15, 2013, 8:51:17 AM9/15/13
to std-pr...@isocpp.org, Ville Voutilainen
On 2013-09-15 14:43, Ville Voutilainen wrote:
> Seems like a good idea. I played with some work-around ideas, and they..
> ..don't work:
>
> [...]
>
> it's never going to be as nice
> as "\u{name}".

Yeah, I think so too. If there is interest, I'll implement this in Clang
and then write proposed wording.

Eelis

unread,
Sep 15, 2013, 8:52:40 AM9/15/13
to std-pr...@isocpp.org, Maurice Bos
On 2013-09-15 14:50, Maurice Bos wrote:
> Should this only work inside string literals, or do you want to modify
> the 'universal-character-name' grammer to allow this in identifiers as
> well? (just like \uXXXX)

I think doing it by extending universal-character-name makes the most sense.

Ville Voutilainen

unread,
Sep 15, 2013, 9:08:01 AM9/15/13
to std-pr...@isocpp.org
You know.. I guess doing this stuff with a user-defined literal would be one option.
It would look like "\u{PER MILLE SIGN}"_symbolic_unicode ;)
The suffix is certainly up to bike-shedding, but it would avoid having to do it
in the compiler.

Ville Voutilainen

unread,
Sep 15, 2013, 9:11:00 AM9/15/13
to std-pr...@isocpp.org
On 15 September 2013 16:08, Ville Voutilainen <ville.vo...@gmail.com> wrote:



On 15 September 2013 15:52, Eelis <ee...@eelis.net> wrote:
On 2013-09-15 14:50, Maurice Bos wrote:
Should this only work inside string literals, or do you want to modify
the 'universal-character-name' grammer to allow this in identifiers as
well? (just like \uXXXX)

I think doing it by extending universal-character-name makes the most sense.



You know.. I guess doing this stuff with a user-defined literal would be one option.
It would look like "\u{PER MILLE SIGN}"_symbolic_unicode ;)

...except it probably can't use \u inside the string, so it would need a different magical-cookie
to work, lest the universal-character name causes an error before the UDL is invoked.

David Krauss

unread,
Sep 15, 2013, 9:54:42 AM9/15/13
to std-pr...@isocpp.org, ee...@eelis.net
I think it's better to simply use a comment. You can put comments inside strings by availing of the string catenation facility.

std::cout << "\uFF11" /* FULLWIDTH DIGIT ONE */ "\u2030" /* PER MILLE SIGN */;

Every Unicode draft will expand the dictionary of names. Even if compilers keep up by methodically adopting these, users won't reliably upgrade. So there's a portability issue.

Many character names are oddly spelled. I initially put "FULL WIDTH" in the above comment but had to fix it. Other oddities like "LAMDA" abound. This is a critical usability issue.

Also, there's an issue in what the stringize operator does to UCNs; according to the letter of the law it would require

#define S(X) #X

S
("\u{DIGIT ZERO}") // => "\"\\u{DIGIT ZERO}\"

This is a minor functionality issue but I mention because I want that defect fixed… Exact spelling of UCNs is beyond the intent.

David Krauss

unread,
Sep 15, 2013, 10:12:22 AM9/15/13
to std-pr...@isocpp.org


On Sunday, September 15, 2013 8:43:01 PM UTC+8, Ville Voutilainen wrote:



On 15 September 2013 15:25, Eelis <ee...@eelis.net> wrote:
Wouldn't it be nice to be able to use Unicode character names in universal-character-names?

    std::cout << "\u{PER MILLE SIGN}";

I think this better expresses the intent than "\u2030".


Seems like a good idea. I played with some work-around ideas, and they.. ..don't work:

UCNs are translated in phase 1, before anything else, so they're pretty much atomic.

Your best bet is to put it inside a string:

#define PER_MILLE_SIGN "\u2030"

#define CODEPOINT_(x) * U ## x // Prepend char32_t prefix, get first element of string literal.
#define CODEPOINT(x) CODEPOINT_(x) // Tame catenation operator.

wchar_t *s1 = L"" PER_MILLE_SIGN;
char *s2 = "123 " PER_MILLE_SIGN;
char16_t c
{ CODEPOINT( PER_MILLE_SIGN ) }; // Narrowing safe; CODEPOINT is constant expression.


Any kind of define would require string literal concatenation anyway, so it's never going to be as nice
as "\u{name}".

Is there reasoning here, or just an aesthetic bias? String literal catenation may feel hackish, but it's nothing compared to UCNs which depend somewhat on context (inside a literal vs in an identifier), and still have a few ill-specified rough edges.

UCNs are their own text encoding. This proposal is about making a textual text encoding. Verging on XML territory here.

Eelis

unread,
Sep 15, 2013, 10:18:51 AM9/15/13
to std-pr...@isocpp.org, David Krauss
On 2013-09-15 15:54, David Krauss wrote:
> Every Unicode draft will expand the dictionary of names. Even if
> compilers keep up by methodically adopting these, users won't reliably
> upgrade. So there's a portability issue.

An implementation could emit a diagnostic if the user attempts to use a
character name that is newer than the C++ standard used.

A typical implementation would allow this diagnostic to be overridden,
so that users could then make a conscious choice to reduce the
portability of their program by adding a Unicode requirement newer than
the C++ standard.

Philipp Stephani

unread,
Sep 15, 2013, 11:34:08 AM9/15/13
to std-pr...@isocpp.org
2013/9/15 Eelis <ee...@eelis.net>
Wouldn't it be nice to be able to use Unicode character names in universal-character-names?

    std::cout << "\u{PER MILLE SIGN}";

I think this better expresses the intent than "\u2030".


I think it's a very good idea. Many other languages have it, and it's a simple and localized change. 

stackm...@hotmail.com

unread,
Sep 17, 2013, 2:46:30 AM9/17/13
to std-pr...@isocpp.org


Am Sonntag, 15. September 2013 17:34:08 UTC+2 schrieb Philipp Stephani
I think it's a very good idea. Many other languages have it, and it's a simple and localized change.
Name a few please. I am not aware of a single one and a quick search on google did not give any results.

Thiago Macieira

unread,
Sep 17, 2013, 10:24:23 AM9/17/13
to std-pr...@isocpp.org
On domingo, 15 de setembro de 2013 17:34:08, Philipp Stephani wrote:
> I think it's a very good idea. Many other languages have it, and it's a
> simple and localized change.

It only requires the compiler to have a full list of character names from the
Unicode database. It will also require the C++ standard to mandate a minimum
version of Unicode, update it once in a while, provide a macro to indicate
which version of Unicode is known, etc.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
signature.asc

Zhihao Yuan

unread,
Sep 17, 2013, 10:56:15 AM9/17/13
to std-pr...@isocpp.org
On Tue, Sep 17, 2013 at 10:24 AM, Thiago Macieira <thi...@macieira.org> wrote:
> It only requires the compiler to have a full list of character names from the
> Unicode database. It will also require the C++ standard to mandate a minimum
> version of Unicode [...]

Then it's much more then "only" :)

I don't like the idea because seldom people can remember
Unicode names, while program is written for human to
read. Even you can remember those names, some import
method support import Unicode characters through names,
like Fcitx; it's not an issue.


--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://4bsd.biz/

Martinho Fernandes

unread,
Sep 17, 2013, 11:19:37 AM9/17/13
to std-pr...@isocpp.org, ee...@eelis.net
On Sun, Sep 15, 2013 at 3:54 PM, David Krauss <pot...@gmail.com> wrote:
> Many character names are oddly spelled. I initially put "FULL WIDTH" in the
> above comment but had to fix it. Other oddities like "LAMDA" abound. This is
> a critical usability issue.
>

It doesn't stop at oddities. There are several characters names that
are just wrong, like ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴠ ᴡɪᴛʜ ʜᴏᴏᴋ (U+028B) and some
like ᴘʀᴇsᴇɴᴛᴀᴛɪᴏɴ ғᴏʀᴍ ғᴏʀ ᴠᴇʀᴛɪᴄᴀʟ ʀɪɢʜᴛ ᴡʜɪᴛᴇ ʟᴇɴᴛɪᴄᴜʟᴀʀ ʙʀᴀᴋᴄᴇᴛ
(U+FE18) actually have misspellings. The Unicode Stability Policy
prevents these mistakes from being fixed, as it states that the Name
property won't ever change once assigned.

I like this feature in principle, but since this is about readability
and it can actually work against readability, it's not as enticing.

Eelis

unread,
Sep 17, 2013, 1:36:39 PM9/17/13
to std-pr...@isocpp.org, Thiago Macieira
On 2013-09-17 16:24, Thiago Macieira wrote:
> On domingo, 15 de setembro de 2013 17:34:08, Philipp Stephani wrote:
>> I think it's a very good idea. Many other languages have it, and it's a
>> simple and localized change.
>
> It only requires the compiler to have a full list of character names from the
> Unicode database. It will also require the C++ standard to mandate a minimum
> version of Unicode, update it once in a while, provide a macro to indicate
> which version of Unicode is known, etc.

Such a macro already exists: __STDC_ISO_10646__.


Thiago Macieira

unread,
Sep 17, 2013, 1:40:50 PM9/17/13
to std-pr...@isocpp.org
Obviously the macro does not mean that character names are permitted because
they aren't right now. We can change the meaning of the macro and have it
contain a value that indicates the version of Unicode that is supported by the
compiler.

Ville Voutilainen

unread,
Sep 17, 2013, 1:47:01 PM9/17/13
to std-pr...@isocpp.org
On 17 September 2013 20:40, Thiago Macieira <thi...@macieira.org> wrote:
On terça-feira, 17 de setembro de 2013 19:36:39, Eelis wrote:
> > It only requires the compiler to have a full list of character names from
> > the Unicode database. It will also require the C++ standard to mandate a
> > minimum version of Unicode, update it once in a while, provide a macro to
> > indicate which version of Unicode is known, etc.
>
> Such a macro already exists: __STDC_ISO_10646__.

Obviously the macro does not mean that character names are permitted because
they aren't right now. We can change the meaning of the macro and have it
contain a value that indicates the version of Unicode that is supported by the
compiler.



It already contains the information about which C standard is the one matching the version
of C++ indicated by that macro (by the virtue of the aforementioned C++ standard referring
to a certain C standard), so if a given C++ version refers to a certain Unicode version,
the macro reveals that Unicode version, too, indirectly.

Martinho Fernandes

unread,
Sep 17, 2013, 1:49:11 PM9/17/13
to std-pr...@isocpp.org
On Tue, Sep 17, 2013 at 7:40 PM, Thiago Macieira <thi...@macieira.org> wrote:
>> Such a macro already exists: __STDC_ISO_10646__.
>
> Obviously the macro does not mean that character names are permitted because
> they aren't right now. We can change the meaning of the macro and have it
> contain a value that indicates the version of Unicode that is supported by the
> compiler.

It already indicates enough for this feature. It states a a year and
month and makes the Unicode required set consist of "all the
characters that
are defined by ISO/IEC 10646, along with all amendments and technical
corrigenda as of the specified year and month."

Eelis

unread,
Sep 17, 2013, 1:48:05 PM9/17/13
to std-pr...@isocpp.org, Martinho Fernandes
On 2013-09-17 17:19, Martinho Fernandes wrote:
> I like this feature in principle, but since this is about readability
> and it can actually work against readability, it's not as enticing.

Would you not agree that /most/ features in C++, including the ones we
love, can be abused? :)

Philipp Stephani

unread,
Sep 17, 2013, 2:50:25 PM9/17/13
to std-pr...@isocpp.org


Am Sonntag, 15. September 2013 17:34:08 UTC+2 schrieb Philipp Stephani
I think it's a very good idea. Many other languages have it, and it's a simple and localized change.
Name a few please. I am not aware of a single one and a quick search on google did not give any results.

Eelis

unread,
Sep 17, 2013, 4:15:05 PM9/17/13
to std-pr...@isocpp.org, Philipp Stephani
Ah, cool. I did not know about these. Thanks!

Richard Smith

unread,
Sep 17, 2013, 6:03:58 PM9/17/13
to std-pr...@isocpp.org
This thread seems to be missing justification. This proposal imposes a significant cost on compiler vendors (and indeed on programmers, who now need to learn another obscure lexical rule) and I've not seen anyone present any compelling use cases.

So, it would be useful if someone could provide:
 (a) some examples of actual code using this feature to good effect in Perl or Python
 (b) a demonstration that this should be a core language feature (as opposed to, say, a UDL, much as Ville proposed: R"(\u{PER MILLE SIGN})"_symbolic_unicode)

I would not expect this proposal to stand much chance in EWG without more analysis in this direction.

I note also that Perl's approach allows for character naming schemes other than the official Unicode character names, which might suggest to some that this proposal is insufficient as-is, and that a UDL might be a better approach.

Finally, compilers are increasingly allowing UTF-8 source files. Given such a compiler, when would this proposal be preferable to direct use of the relevant characters?

Eelis

unread,
Sep 17, 2013, 6:27:28 PM9/17/13
to std-pr...@isocpp.org, Richard Smith
On 2013-09-18 00:03, Richard Smith wrote:
> Finally, compilers are increasingly allowing UTF-8 source files. Given
> such a compiler, when would this proposal be preferable to direct use of
> the relevant characters?

When the characters are nonprintable, or when project coding standards
require ASCII source files, for example.

Klaim - Joël Lamotte

unread,
Sep 17, 2013, 6:29:03 PM9/17/13
to std-pr...@isocpp.org

On Wed, Sep 18, 2013 at 12:03 AM, Richard Smith <ric...@metafoo.co.uk> wrote:
(a) some examples of actual code using this feature to good effect in Perl or Python

By the way, as Python is mostly built over standard propositions, it's easy to look for the related paper for rational.
I'm not sure if it helps here but: http://www.python.org/dev/peps/pep-0263/

Zhihao Yuan

unread,
Sep 17, 2013, 7:41:13 PM9/17/13
to std-pr...@isocpp.org
Richard is asking whether the Unicode names are useful, your
link is talking about the Unicode source file support...

C++ the standard knows Unicode, and an implementation can
pick any encoding to support Unicode. AFAIK, clang supports
Unicode in string literals as well as Unicode identifiers with UTF-8,
but it seems that it does not support other encoding (correct me
if I'm wrong); GCC up to 4.8 supports Unicode in string literals
with any encoding (-finput-charset; I love GB18030), but does
not support Unicode identifiers.

Considering C++ is portable on systems which does
not even recognize ASCII, I think to leave encoding
implementation-defined is a right approach.

For short, I don't worry about the readability of C++ source
code without a Unicode character names support.

David Krauss

unread,
Sep 17, 2013, 11:16:01 PM9/17/13
to std-pr...@isocpp.org

The Perl feature goes much further and allows user-defined aliases, character sequences, and fuzzy matching. It appears to be a plugin, not part of their core language.

The Python 3.3 feature has more parity. That documentation links to a Unicode spec with abbreviations and corrections, including "PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET" following Martinho's mention, but there's still no LAMBDA. (The hooked V thing looks to me more like an orthographical issue.)

I tried Googling for a while but couldn't find any other mention of this feature besides a bug report, and a StackOverflow answer which had copy-pasted the documentation extraneously. At best it's obscure.

Is there anything my technique above doesn't do? It even gets codepoints as compile-time constants, just like character literals. That's something even Perl doesn't do.

Once again:


#define PER_MILLE_SIGN "\u2030"

#define CODEPOINT_(x) * U ## x // Prepend char32_t prefix, get first element of string literal.
#define CODEPOINT(x) CODEPOINT_(x) // Tame catenation operator.

wchar_t *s1 = L"" PER_MILLE_SIGN;
char *s2 = "123 " PER_MILLE_SIGN;
char16_t c
{ CODEPOINT( PER_MILLE_SIGN ) }; // Narrowing safe; CODEPOINT is constant expression.

Reply all
Reply to author
Forward
0 new messages