C++11 Feature Proposal: Allowing/Requiring u"" for constant UTF-16 strings

229 views
Skip to first unread message

Avi Drissman

unread,
Sep 24, 2014, 5:38:53 PM9/24/14
to Chromium-dev, blink-dev
What:
C++ now has the syntax u"" to declare a constant UTF-16 string.

Why:
base::ASCIIToUTF16 is used all over the place with constant string parameters to create string16 instances. It would be nice to eliminate it (and its ASCIITo/ToASCII friends) and switch over to constant UTF-16 strings built by the compiler. It's less typing, probably faster, and cleaner.

A random example from src/net/ftp/ftp_util.cc:

    CHECK_EQ(1, map_[ASCIIToUTF16("jan")]);
    CHECK_EQ(2, map_[ASCIIToUTF16("feb")]);

becomes

    CHECK_EQ(1, map_[u"jan"]);
    CHECK_EQ(2, map_[u"feb"]);

Non-constant uses of ASCIIToUTF16 (and ASCIIToWide) would then be handled by the UTF8 variants.

(This might require redeclaring string16 as being built on top of char16_t rather than uint16 as it is today, with lossless conversions to/from wstring on Windows where today it is a typedef for wstring.)

Avi

Victor Khimenko

unread,
Sep 24, 2014, 5:51:28 PM9/24/14
to Avi Drissman, Chromium-dev, blink-dev
Have you actually looked in details into the ability to go from here to there? I think that's pretty good proposal, but it heavily depends on the ability to introduce that feature in piecemeal style. 

A single patch which changes thousands files all at once is much harder to swallow than gradual transition.

Another question is efficiency: u"jan" is obviously faster than ASCIIToUTF16("jan") (or at least not slower) but Windows uses UTF16 heavility in it's API and if result of such conversion will be multitude of places where char16_t strings will be copied to wchar_t based strings and back then this is not a good idea.

Peter Kasting

unread,
Sep 24, 2014, 6:15:37 PM9/24/14
to Avi Drissman, Chromium-dev, blink-dev
On Wed, Sep 24, 2014 at 2:38 PM, Avi Drissman <a...@chromium.org> wrote:
(This might require redeclaring string16 as being built on top of char16_t rather than uint16 as it is today, with lossless conversions to/from wstring on Windows where today it is a typedef for wstring.)

This was the worry I had.  I think we should find out whether this is the case.  If it is, we should probably propose "use char16_t in place of our existing char16 everywhere" as a first step before allowing u"".

PK 

Scott Graham

unread,
Sep 24, 2014, 8:40:14 PM9/24/14
to Avi Drissman, Chromium-dev, blink-dev
Doesn't seem to compile on current VS, fwiw, so it'd have to be limited to non-Windows for now.



d:\src\x>type x.cc
#include <windows.h>
#include <stdio.h>

int main() {
  HANDLE f = CreateFileW(u"wee.txt", GENERIC_WRITE, 0, NULL, CREATE_ALWAYS,
                         FILE_ATTRIBUTE_ARCHIVE, NULL);
  printf("%p\n", f);
}

d:\src\x>cl /nologo x.cc
x.cc
x.cc(5) : error C2065: 'u' : undeclared identifier
x.cc(5) : error C2143: syntax error : missing ')' before 'string'
x.cc(6) : error C2059: syntax error : ')'


On Wed, Sep 24, 2014 at 2:38 PM, Avi Drissman <a...@chromium.org> wrote:

Avi Drissman

unread,
Sep 24, 2014, 11:55:20 PM9/24/14
to Scott Graham, Chromium-dev, blink-dev
If u"" doesn't even compile on Windows today then this is a non-starter. I thought that was a language feature.

Avi

Nico Weber

unread,
Sep 25, 2014, 12:00:43 AM9/25/14
to Avi Drissman, Scott Graham, Chromium-dev, blink-dev
It's a language feature, but MSVS doesn't implement all language features. http://msdn.microsoft.com/en-us/library/hh567368.aspx apparently has a list, I'll move the things they don't support to the banned table. (Sorry for not doing that in the first place.)

Avi Drissman

unread,
Sep 25, 2014, 12:04:54 AM9/25/14
to Nico Weber, Scott Graham, Chromium-dev, blink-dev
When you do that, can you make sure to note that the reason it's there is because of MSVS, and when MSVS supports it then we should ponder it again?

Nico Weber

unread,
Sep 25, 2014, 12:05:27 AM9/25/14
to Avi Drissman, Scott Graham, Chromium-dev, blink-dev

Victor Khimenko

unread,
Sep 25, 2014, 5:18:09 AM9/25/14
to Avi Drissman, Scott Graham, Chromium-dev, blink-dev
On Thu, Sep 25, 2014 at 7:54 AM, Avi Drissman <a...@chromium.org> wrote:
If u"" doesn't even compile on Windows today then this is a non-starter. I thought that was a language feature.

Let's not read that too formally, shell we? The goal of this suggestion was not to enable a particular feature but to make life of people easier via it's use. Yes, "raw" unicode literals are not supported by MSVC and should obviously be banned.

But that does not mean that they are useless for us. MSVC always supported UTF-8 strings, it just called them wstrings. And it provided a way to specify them in the code, it just used L"blah-blah-blah" syntax. And it used UTF-16 in it's API quite extensively.

Thus we could switch char16 to char16_t on non-Windows platforms and leave it as wchar_t on Windows. Then we could provide something like _T to use with constants (we probably should not use _T itself since it's in reserved namespace, but something like UTF8CONST will work).

Yes, this will be not as nice as "raw" unicode literals, but will provide 90% of expected benefits anyway - and it's more backward-compatile change to boot.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Victor Khimenko

unread,
Sep 25, 2014, 5:19:15 AM9/25/14
to Avi Drissman, Scott Graham, Chromium-dev, blink-dev
On Thu, Sep 25, 2014 at 1:17 PM, Victor Khimenko <kh...@chromium.org> wrote:

On Thu, Sep 25, 2014 at 7:54 AM, Avi Drissman <a...@chromium.org> wrote:
If u"" doesn't even compile on Windows today then this is a non-starter. I thought that was a language feature.

Let's not read that too formally, shell we? The goal of this suggestion was not to enable a particular feature but to make life of people easier via it's use. Yes, "raw" unicode literals are not supported by MSVC and should obviously be banned.

But that does not mean that they are useless for us. MSVC always supported UTF-8 strings, it just called them wstrings. And it provided a way to specify them in the code, it just used L"blah-blah-blah" syntax. And it used UTF-16 in it's API quite extensively.

Thus we could switch char16 to char16_t on non-Windows platforms and leave it as wchar_t on Windows. Then we could provide something like _T to use with constants (we probably should not use _T itself since it's in reserved namespace, but something like UTF8CONST will work).

Oops. UTF16CONST("jan"), of course.

Rachel Blum

unread,
Sep 8, 2016, 7:05:33 PM9/8/16
to blink-dev, a...@chromium.org, sco...@chromium.org, chromi...@chromium.org, kh...@chromium.org
Since VS2015 does support this now, can we revive this?

dan...@chromium.org

unread,
Sep 9, 2016, 6:27:50 PM9/9/16
to Rachel Blum, blink-dev, Avi Drissman, Scott Graham, chromium-dev, Victor Khimenko
On Thu, Sep 8, 2016 at 4:02 PM, Rachel Blum <gr...@chromium.org> wrote:
Since VS2015 does support this now, can we revive this?

Peter Kasting

unread,
Sep 9, 2016, 6:31:42 PM9/9/16
to Dana Jansens, Rachel Blum, blink-dev, Avi Drissman, Scott Graham, chromium-dev, Victor Khimenko
On Fri, Sep 9, 2016 at 3:26 PM, <dan...@chromium.org> wrote:
On Thu, Sep 8, 2016 at 4:02 PM, Rachel Blum <gr...@chromium.org> wrote:
Since VS2015 does support this now, can we revive this?


Yes.

Raw string literals are useful for cases that would otherwise have a lot of escaped characters.

Unicode string literals are useful for trying to declare a UTF-16 string constant as such instead of as a narrow or wide string constant (which is then converted by a function call at runtime).

The former is a bigger readability win in the few cases where it's applicable.  The latter is a ton of smaller wins in a huge number of places.

PK

Rachel Blum

unread,
Sep 12, 2016, 7:48:47 PM9/12/16
to Peter Kasting, Dana Jansens, blink-dev, Avi Drissman, Scott Graham, chromium-dev, Victor Khimenko
Yay? Nay? Indifferent? Let's discuss user-defined literals instead?

Scott Graham

unread,
Sep 12, 2016, 7:57:15 PM9/12/16
to Rachel Blum, Peter Kasting, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
I guess this is probably to be expected, but char16_t aren't wchar_t, so interaction with Win32 doesn't look like it'd be great. But maybe that doesn't matter too much for us.

But otherwise, seems fine to me?

d:\src\x>type x.cc
#include <windows.h>
#include <stdio.h>

int main() {
  HANDLE f = CreateFileW(u"wee.txt", GENERIC_WRITE, 0, NULL, CREATE_ALWAYS,
                         FILE_ATTRIBUTE_ARCHIVE, NULL);
  printf("%p\n", f);
}

d:\src\x>cl /Bv /nologo x.cc
Compiler Passes:
 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe:        Version 19.00.24213.1
 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\c1.dll:        Version 19.00.24213.1
 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\c1xx.dll:      Version 19.00.24213.1
 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\c2.dll:        Version 19.00.24213.1
 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\link.exe:      Version 14.00.24213.1
 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\mspdb140.dll:  Version 14.00.24210.0
 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\1033\clui.dll: Version 19.00.24213.1

x.cc
x.cc(6): error C2664: 'HANDLE CreateFileW(LPCWSTR,DWORD,DWORD,LPSECURITY_ATTRIBUTES,DWORD,DWORD,HANDLE)': cannot convert argument 1 from 'const char16_t [8]' to 'LPCWSTR'
x.cc(6): note: Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast

Peter Kasting

unread,
Sep 12, 2016, 10:31:45 PM9/12/16
to Scott Graham, Rachel Blum, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
On Mon, Sep 12, 2016 at 4:56 PM, Scott Graham <sco...@chromium.org> wrote:
x.cc
x.cc(6): error C2664: 'HANDLE CreateFileW(LPCWSTR,DWORD,DWORD,LPSECURITY_ATTRIBUTES,DWORD,DWORD,HANDLE)': cannot convert argument 1 from 'const char16_t [8]' to 'LPCWSTR'
x.cc(6): note: Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast

Hmm.  MSDN says wchar_t and char16_t are both 16-bit character types.  Doesn't say if they differ in signedness; I assume both are signed.  If so, this is irritating.  The types are effectively the same, but because they're fundamental types with different labels, the compiler claims they're unrelated.

Since string16 is just a basic_string<...>, we probably can't (sanely) add a constructor for basic_string<wchar_t> from chart16_t*.  And if we make string16 build on char16_t (hint: there's already one of these called std::u16string, we should switch to that if we want to do this), then we probably can't do .c_str() on one and pass to a win32 API anymore.

Maybe we can officially Not Care about the latter, and force people to hop through more awkward casts?  It would probably make sense if it lets us get rid of a lot of ASCIIToUTF16() calls.

PK

Rachel Blum

unread,
Sep 13, 2016, 1:56:37 AM9/13/16
to Peter Kasting, Scott Graham, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
Doesn't that (char16_t/wchar_t interchangeability) change depending on if UNICODE is defined or not? (Or some similar arcane incantations?)


Adam Rice

unread,
Sep 13, 2016, 2:29:09 AM9/13/16
to Rachel Blum, Peter Kasting, Scott Graham, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
Can't we just stop using UTF-16?

Avi Drissman

unread,
Sep 13, 2016, 11:00:27 AM9/13/16
to Adam Rice, Rachel Blum, Peter Kasting, Scott Graham, Dana Jansens, blink-dev, chromium-dev, Victor Khimenko
On Tue, Sep 13, 2016 at 2:28 AM, Adam Rice <ri...@chromium.org> wrote:
Can't we just stop using UTF-16?

Replacing it with... ?

Re std::u16string, I would totally be on-board with replacing our custom "string16" class with it, a standard type, now that we have it in C++11.

Avi

Scott Graham

unread,
Sep 13, 2016, 12:55:15 PM9/13/16
to Rachel Blum, Peter Kasting, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
On Mon, Sep 12, 2016 at 10:55 PM, Rachel Blum <gr...@chromium.org> wrote:
Doesn't that (char16_t/wchar_t interchangeability) change depending on if UNICODE is defined or not? (Or some similar arcane incantations?)

I don't think /DUNICODE changes anything (but maybe someone else knows better?). AFAIK, they are both 16-bit unsigned values.

In practice Win32 *W() APIs have been defined as taking UTF-16 for a long time. Maybe they were kept as different for UCS-2 legacy reasons, or UTF-16LE vs. BE? Or maybe because L'' is 32 bit on other platforms so it'd make things confusing if you started mixing L'' with u''. But anyway, I guess from a language pov, wchar_t is __wchar_t and char16_t is ... whatever, and they just ain't the same.

I think the reason for UTF-16 is that Windows and ICU work natively with UTF-16.

Peter Kasting

unread,
Sep 13, 2016, 3:10:40 PM9/13/16
to Scott Graham, Rachel Blum, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
On Tue, Sep 13, 2016 at 9:54 AM, Scott Graham <sco...@chromium.org> wrote:
On Mon, Sep 12, 2016 at 10:55 PM, Rachel Blum <gr...@chromium.org> wrote:
Doesn't that (char16_t/wchar_t interchangeability) change depending on if UNICODE is defined or not? (Or some similar arcane incantations?)

I don't think /DUNICODE changes anything (but maybe someone else knows better?). AFAIK, they are both 16-bit unsigned values.

In practice Win32 *W() APIs have been defined as taking UTF-16 for a long time. Maybe they were kept as different for UCS-2 legacy reasons, or UTF-16LE vs. BE?

AFAIK, the win32 APIs really deal in UCS-2 and not UTF-16, but I could be misremembering.
 
But anyway, I guess from a language pov, wchar_t is __wchar_t and char16_t is ... whatever, and they just ain't the same.

Yeah, I think they're just different types at the core, even if they're the same size and signedness.  I would be surprised if there was a switch to work around this.

I wonder if there would be a way to rapid-prototype switching string16 to be u16string and seeing just how bad the workarounds to make Windows still compile would be.  It feels like this is the probable right way forward.

I think the reason for UTF-16 is that Windows and ICU work natively with UTF-16.

There were huge threads on UTF-8 vs. UTF-16 in different places in the browser Back In The Day, but I don't remember what they said.

PK 

Allen Bauer

unread,
Sep 13, 2016, 3:17:32 PM9/13/16
to Peter Kasting, Scott Graham, Rachel Blum, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
On Tue, Sep 13, 2016 at 12:09 PM, 'Peter Kasting' via Chromium-dev <chromi...@chromium.org> wrote:
On Tue, Sep 13, 2016 at 9:54 AM, Scott Graham <sco...@chromium.org> wrote:
On Mon, Sep 12, 2016 at 10:55 PM, Rachel Blum <gr...@chromium.org> wrote:
Doesn't that (char16_t/wchar_t interchangeability) change depending on if UNICODE is defined or not? (Or some similar arcane incantations?)

I don't think /DUNICODE changes anything (but maybe someone else knows better?). AFAIK, they are both 16-bit unsigned values.

In practice Win32 *W() APIs have been defined as taking UTF-16 for a long time. Maybe they were kept as different for UCS-2 legacy reasons, or UTF-16LE vs. BE?

AFAIK, the win32 APIs really deal in UCS-2 and not UTF-16, but I could be misremembering.

Windows should be UTF-16. It was UCS-2 in the Win9x era and early NT4. They now support UTF-16 surrogate pairs needed to handle code points outside the Basic Multilingual Plane (BMP).

-- 
Allen Bauer

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.



--
Allen Bauer

Peter Kasting

unread,
Sep 13, 2016, 3:24:03 PM9/13/16
to Allen Bauer, Scott Graham, Rachel Blum, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
On Tue, Sep 13, 2016 at 12:16 PM, Allen Bauer <kyl...@google.com> wrote:
On Tue, Sep 13, 2016 at 12:09 PM, 'Peter Kasting' via Chromium-dev <chromi...@chromium.org> wrote:
On Tue, Sep 13, 2016 at 9:54 AM, Scott Graham <sco...@chromium.org> wrote:
On Mon, Sep 12, 2016 at 10:55 PM, Rachel Blum <gr...@chromium.org> wrote:
Doesn't that (char16_t/wchar_t interchangeability) change depending on if UNICODE is defined or not? (Or some similar arcane incantations?)

I don't think /DUNICODE changes anything (but maybe someone else knows better?). AFAIK, they are both 16-bit unsigned values.

In practice Win32 *W() APIs have been defined as taking UTF-16 for a long time. Maybe they were kept as different for UCS-2 legacy reasons, or UTF-16LE vs. BE?

AFAIK, the win32 APIs really deal in UCS-2 and not UTF-16, but I could be misremembering.

Windows should be UTF-16. It was UCS-2 in the Win9x era and early NT4. They now support UTF-16 surrogate pairs needed to handle code points outside the Basic Multilingual Plane (BMP).

Cool, that actually makes me feel a lot better, since I've been worried about whether we had any UCS-2-is-not-UTF-16 issues on Windows.

It also means that, if we can find some convenient way to reinterpret_cast between UTF-16 and Wide strings on Win, stuff should Just Work.

PK 

Rachel Blum

unread,
Sep 13, 2016, 4:12:56 PM9/13/16
to Peter Kasting, Scott Graham, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko

On Tue, Sep 13, 2016 at 12:09 PM, Peter Kasting <pkas...@google.com> wrote:
I wonder if there would be a way to rapid-prototype switching string16 to be u16string

Don't we just need to replace 
typedef std::wstring string16;

with
typedef std::ustring16 string16;

to get a basic idea? (And I suspect the basic idea is "this will be painful")
 
I'd suspect we can ease some of the pain via dcheng's rewrite tools - most of the issues are likely to be "takes LPCWSTR, passing in char16_t*". Anybody with a Windows machine willing to try? :)

Peter Kasting

unread,
Sep 13, 2016, 4:18:46 PM9/13/16
to Rachel Blum, Scott Graham, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko
On Tue, Sep 13, 2016 at 1:10 PM, Rachel Blum <gr...@chromium.org> wrote:

On Tue, Sep 13, 2016 at 12:09 PM, Peter Kasting <pkas...@google.com> wrote:
I wonder if there would be a way to rapid-prototype switching string16 to be u16string

Don't we just need to replace 
typedef std::wstring string16;

with
typedef std::ustring16 string16;

to get a basic idea? (And I suspect the basic idea is "this will be painful")

You ignored the second half of my sentence :)

The important bit to prototype was to see what the Windows fixes would look like.  Prototyping just switching string16's type is trivial.

I'd suspect we can ease some of the pain via dcheng's rewrite tools - most of the issues are likely to be "takes LPCWSTR, passing in char16_t*". Anybody with a Windows machine willing to try? :)

Sure, "takes LPCWSTR, passing char16_t*" is the primary problem case, the question is what the solution is.

We could reverse course on banning UTF16ToWide() (and the reverse) in Windows code, reintroduce it everywhere we make these conversions, and implement it as a cast.  It would be kind of depressing to do this after we spent so long removing it.

We could see if there's a way to add a cast-to-char16_t* sort of operator or some kind of implicit reinterpret_cast somehow.  The only ideas I can think of either probably wouldn't compile or would require doing something like placing methods in the std:: namespace.

We could try to find a way to get wstring constructible with a char16_t*, which probably has similar issues.

PK

Rachel Blum

unread,
Sep 13, 2016, 5:29:14 PM9/13/16
to Peter Kasting, Scott Graham, Dana Jansens, blink-dev, Avi Drissman, chromium-dev, Victor Khimenko

On Tue, Sep 13, 2016 at 1:18 PM, Peter Kasting <pkas...@google.com> wrote:
the question is what the solution is.

All Windows version we support use UTF16 (instead of UCS2), so we don't need UTF16ToWide<>, a reinterpret_cast will suffice. I'd lean towards automated code rewriting instead of adding an implicit cast, though. (That's what I meant by 'rewrite tools', and that was supposed to address the second part :)

Also: Even if we inject in the std:: namespace, I don't think there's a way to make it work implicitly - assignments, cast operators, and converting constructors cannot be freestanding. 

The question is, do we want to invest that time, or is u"" such a minor gain, we don't care right now? 

Reply all
Reply to author
Forward
0 new messages