Add support for U+ prefix for Unicode literals

99 views
Skip to first unread message

bumblebr...@gmail.com

unread,
Jan 7, 2017, 11:08:21 AM1/7/17
to ISO C++ Standard - Future Proposals
I'd just use the preprocessor to define it as: 

#define _U+ 0x

that way it's easier to identify Unicode code points from other hex data at a glance.

D. B.

unread,
Jan 7, 2017, 11:12:49 AM1/7/17
to std-pr...@isocpp.org
What is the proposal? This can already be done, though I suspect user-defined literals are a far cleaner solution than macros. Do you want the Committee to standardise a trivial macro or UDL? Seems unlikely.

bumblebr...@gmail.com

unread,
Jan 7, 2017, 11:14:30 AM1/7/17
to ISO C++ Standard - Future Proposals
The main problem is that the preprocessor doesn't support using non-alphanumeric (and underscores and hyphens) in macros.

Nicol Bolas

unread,
Jan 7, 2017, 11:17:00 AM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com

We have a perfectly adequate way to identify a Unicode codepoint in C++:

char32_t c = U'\U12345678';
char32_t c_short
= U'\u1234';

If adding a macro makes this a bit more readable for you, that's up to you. But "U'\u1234'" is no shorter than "_U+(1234)" in terms of character count.

bumblebr...@gmail.com

unread,
Jan 7, 2017, 11:19:52 AM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
The plan was to #define _U+ to just U+, maybe I should update the OP? Oh, and + isn't supported in iso646 or di/tri graphs either, so it's pretty much impossible currently.

Nicol Bolas

unread,
Jan 7, 2017, 11:30:18 AM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
On Saturday, January 7, 2017 at 11:19:52 AM UTC-5, bumblebr...@gmail.com wrote:
The plan was to #define _U+ to just U+, maybe I should update the OP? Oh, and + isn't supported in iso646 or di/tri graphs either, so it's pretty much impossible currently.

I don't care. The existing code, while not entirely optimal, is good enough. Yes, `U+ABCD` indeed is significantly shorter than `U'\uABCD'`.

But that along is not a good reason to add such a feature.

bumblebr...@gmail.com

unread,
Jan 7, 2017, 11:35:16 AM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
It's an unreadable mess at best.

bumblebr...@gmail.com

unread,
Jan 7, 2017, 11:36:46 AM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
Also, It's incredibly ironic that the minus symbol is supported, but not the plus symbol.

Nicol Bolas

unread,
Jan 7, 2017, 12:28:05 PM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
On Saturday, January 7, 2017 at 11:35:16 AM UTC-5, bumblebr...@gmail.com wrote:
It's an unreadable mess at best.

That's your opinion, and you are welcome to it. But if you want to make a change to the C++ language, you'll need better motivation than this.

I'm sure that for people who deal in Unicode codepoints constantly, the language can be unwieldy. But since a macro is perfectly capable of resolving that problem for that relatively small set of users, I see no need to add a new type of literal to the language just for them.

bumblebr...@gmail.com

unread,
Jan 7, 2017, 12:42:20 PM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
I literally just explained the problem with a macro to you.

The + symbol can not be part of a Macro's name. the preprocessor literally doesn't allow that.

When the tokenizer reads the "#define U+ 0x" line, it takes the + symbol to be the addition operator between U (the macro's name to the preprocessor)  and 0x, an undefined number.

Without language support, it will not work.

Nicol Bolas

unread,
Jan 7, 2017, 12:49:53 PM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
On Saturday, January 7, 2017 at 12:42:20 PM UTC-5, bumblebr...@gmail.com wrote:
I literally just explained the problem with a macro to you.

The + symbol can not be part of a Macro's name. the preprocessor literally doesn't allow that.

But you could use `U_`. The problem you're trying to solve is that the current syntax is unwieldy. The solution doesn't have to be "U+".

bumblebr...@gmail.com

unread,
Jan 7, 2017, 12:51:18 PM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
That's true, but U+ is THE standard way to represent Unicode codepoints in every single language except C and C++

D. B.

unread,
Jan 7, 2017, 12:57:39 PM1/7/17
to std-pr...@isocpp.org
And now, 12 piecemeal posts in, we have something approaching a rationale.

bumblebr...@gmail.com

unread,
Jan 7, 2017, 1:10:19 PM1/7/17
to ISO C++ Standard - Future Proposals
Yeah, sorry guys this is my first time submitting a language feature proposal.

Nicol Bolas

unread,
Jan 7, 2017, 1:57:16 PM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
On Saturday, January 7, 2017 at 12:51:18 PM UTC-5, bumblebr...@gmail.com wrote:
That's true, but U+ is THE standard way to represent Unicode codepoints in every single language except C and C++

That might be a rationale if it were true.

But it isn't. Let's take a quick survey of programming languages and how to declare a Unicode codepoint:

Swift: "\u{1F496}" This gets a string rather than a character.

Go: '\u12e4'

Java: I'm not a Java expert, but my light Googling skills seem to reveal that there is no codepoint type in Java. `char` and `Character` both represent UTF-16 code units. They just use a regular 32-bit integer for codepoitns. So the closest you'll get is `0x12E4`. You can of course get a string literal containing a codepoint: "\u1F496".

C#: Similar to Java.

Not one of these languages permits `U+XXXXXXXX` syntax for naming a codepoint. "U+XX" is the common way to name a codepoint by value in text, not in actual programming languages.

Andrey Semashev

unread,
Jan 7, 2017, 2:03:39 PM1/7/17
to std-pr...@isocpp.org
On 01/07/17 20:51, bumblebr...@gmail.com wrote:
> That's true, but U+ is THE standard way to represent Unicode codepoints
> in every single language except C and C++

No, it's not.

Python:

>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
'\u0394'
>>> "\u0394" # Using a 16-bit hex value
'\u0394'
>>> "\U00000394" # Using a 32-bit hex value
'\u0394'

https://docs.python.org/3/howto/unicode.html


Ruby:

\unnnn Unicode code point U+nnnn (Ruby 1.9 and later)
\u{nnnnn} Unicode code point U+nnnnn with more than four hex digits
must be enclosed in curly braces

https://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals


D:

EscapeSequence:
[...]
\u HexDigit HexDigit HexDigit HexDigit
\U HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit
HexDigit
[...]

https://dlang.org/spec/lex.html


Java:

'\u03a9'

https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.10.4


JavaScript:

\uXXXX The Unicode character specified by the four hexadecimal digits
XXXX. For example, \u00A9 is the Unicode sequence for the copyright
symbol. See Unicode escape sequences.
\u{XXXXX} Unicode code point escapes. For example, \u{2F804} is the
same as the simple Unicode escapes \uD87E\uDC04.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Grammar_and_types#String_literals


PHP:

\u{[0-9A-Fa-f]+} the sequence of characters matching the regular
expression is a Unicode codepoint, which will be output to the string as
that codepoint's UTF-8 representation (added in PHP 7.0.0)

http://php.net/manual/en/language.types.string.php


Perl:

"\x{1d45b}"

http://www.perl.com/pub/2012/04/perlunicook-unicode-literals-by-number.html


Bash:

\0nn
\xnn

http://www.tldp.org/LDP/abs/html/escapingsection.html


VB.NET:

Seems to be no immediate way to escape Unicode literals, but here is a
workaround:

ChrW(&H25B2)

http://stackoverflow.com/a/3144774/4636534


C#:

'\u0066'

https://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx


FreePascal:

const
HalfNoteString = UnicodeString(#$D834#$DD5E);

http://stackoverflow.com/a/6963182/4636534


Common Lisp:

#\λ ; => #\GREEK_SMALL_LETTER_LAMDA
#\u03BB ; => #\GREEK_SMALL_LETTER_LAMDA

https://learnxinyminutes.com/docs/common-lisp/


In fact, is there a single language that uses the U+nnnn notation
directly? I'd say, some variation of \unnnn notation if the most widespread.

bumblebr...@gmail.com

unread,
Jan 7, 2017, 2:14:02 PM1/7/17
to ISO C++ Standard - Future Proposals
Except Swift, Go, and most of the other languages on the list are based on or inspired by C++.

That's like saying we shouldn't do X in fortran because C doesn't support it. Who cares? how is it relevant?

Also, the Unicode standard themselves uses the U+ prefix, who should we given a bit more weight.

Nicol Bolas

unread,
Jan 7, 2017, 2:18:30 PM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
On Saturday, January 7, 2017 at 2:14:02 PM UTC-5, bumblebr...@gmail.com wrote:
Except Swift, Go, and most of the other languages on the list are based on or inspired by C++.

That's like saying we shouldn't do X in fortran because C doesn't support it. Who cares? how is it relevant?

It's relevant because you claimed that "U+ is THE standard way to represent Unicode codepoints in every single language except C and C++". That was your justification for the feature: that other languages have it. That justification is not true, and therefore your feature has no justification.

bumblebr...@gmail.com

unread,
Jan 7, 2017, 2:21:20 PM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com
I did go a bit far with that part you're right, but frankly I don't care how other languages do it (especially when they inherited exactly what we're talking about right now. circular reasoning much?)

The STANDARD uses U+ almost exclusively, the pdf is right here: http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf

Nicol Bolas

unread,
Jan 7, 2017, 2:28:46 PM1/7/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com

The thing you're not getting is this: the fact that U+ is how the Unicode standard names codepoints is not good enough of a reason to make such a change to C++. Currently, we have a perfectly adequate solution for how to specify a codepoint. Therefore, your justification needs to demonstrate the following:

1) That our present solution is inadequate for the needs of a significant number of C++ programmers.

2) That "U+" syntax is significantly better than existing alternatives.

D. B.

unread,
Jan 7, 2017, 2:57:54 PM1/7/17
to std-pr...@isocpp.org
Talk of the Unicode Standard seems of limited utility when we're talking about (A) mere notation, not behaviour and (B) notation that directly conflicts with current parsing and conventions in C++. Even if the wording in the Standard says something, that doesn't mean the language should or can bend its own established syntax to accommodate that notation, in a way that would be confusing to many. As long as the observable behaviour of the final program doesn't contradict the Standard, I don't really see an issue.

(As an aside, the Unicode Standard occasionally does things like saying the character to use in contractions is the single right quote mark, not, y'know, an actual apostrophe... so I give it a healthy dose of scepticism.)

Thiago Macieira

unread,
Jan 7, 2017, 4:55:11 PM1/7/17
to std-pr...@isocpp.org
Em sábado, 7 de janeiro de 2017, às 08:36:46 PST, bumblebr...@gmail.com
escreveu:
> Also, It's incredibly ironic that the minus symbol is supported, but not
> the plus symbol.

The minus symbol is not supported. The underscore symbol is.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Matthew Woehlke

unread,
Jan 9, 2017, 11:41:48 AM1/9/17
to std-pr...@isocpp.org, bumblebr...@gmail.com
On 2017-01-07 12:51, bumblebr...@gmail.com wrote:
> That's true, but U+ is THE standard way to represent Unicode codepoints in
> every single language except C and C++

Let's ignore whether or not this is true for a moment.

You are proposing:

auto x = U+1234; // equivalent to U'\u1234'

But...

auto U = 5;
auto x = U+1234; // had better be 1239

This is just not going to work; reserving `U` as a keyword is not going
to happen, and that would be the only conceivable way that the meaning
of `U+` could be changed from its current meaning, which is an
identifier and an operator.

I think a better idea would be to have either `U` or `_U` as a literal
suffix. I believe the second can be a UDL, and is the same amount of typing.

--
Matthew

Nicol Bolas

unread,
Jan 9, 2017, 12:13:28 PM1/9/17
to ISO C++ Standard - Future Proposals, bumblebr...@gmail.com, mwoehlk...@gmail.com

In order to do that, you'd have to explicitly use a `0x` prefix for hexadecimal. That way, the compiler can tell the difference between the identifier `ABCD_U` and the literal `0xABCD_U`. And I'd say that you want that underscore, since `0xABCDU` just looks weird (though functional for parsing purposes).

Thiago Macieira

unread,
Jan 9, 2017, 12:52:55 PM1/9/17
to std-pr...@isocpp.org
On segunda-feira, 9 de janeiro de 2017 11:41:32 PST Matthew Woehlke wrote:
> I think a better idea would be to have either `U` or `_U` as a literal
> suffix. I believe the second can be a UDL, and is the same amount of typing.

U suffix is already in use and has been since the 1970s.

1234U
Reply all
Reply to author
Forward
0 new messages