Is pointer aliasing with wchar_t, char16_t, char32_t allowed?

289 views
Skip to first unread message

Myriachan

unread,
Dec 1, 2016, 2:33:55 PM12/1/16
to ISO C++ Standard - Discussion
Is pointer aliasing legal between wchar_t, char16_t, char32_t and their underlying types?  In other words, is this legal on a platform for which char16_t and wchar_t have an underlying type of "unsigned short", as in Windows?

std::size_t CountNonzeroUnsignedShorts(const unsigned short *p)
{
   
return std::wcslen(reinterpret_cast<const wchar_t *>(p));
}

Or its inverse case:

wchar_t *wmemset(wchar_t *dest, wchar_t c, size_t count
{
    __stosw
(reinterpret_cast<unsigned short *>(dest), c, count);
   
return dest;
}


I'm wondering because of the seemingly ambiguous wording in the Standard:

10.4 a type that is the signed or unsigned type corresponding to the dynamic type of the object;

Is unsigned short an "unsigned type corresponding" to wchar_t (on platforms such as Windows NT, Nintendo 3DS, etc.)?

If wchar_t is not allowed to alias its underlying type, then perhaps it should be mentioned in [diff.wchar.t] that this aliasing incompatibility is another one of the differences from C.

Although unrelated, I'm wondering about this wording:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

What is meant by "access the stored value"--particularly, does it include writing in addition to reading?  When you write to a glvalue, you don't access the stored *value* at all, because you're replacing it.  This seems ambiguous to me.

Melissa

Nicol Bolas

unread,
Dec 1, 2016, 2:41:59 PM12/1/16
to ISO C++ Standard - Discussion
On Thursday, December 1, 2016 at 2:33:55 PM UTC-5, Myriachan wrote:
Is pointer aliasing legal between wchar_t, char16_t, char32_t and their underlying types?  In other words, is this legal on a platform for which char16_t and wchar_t have an underlying type of "unsigned short", as in Windows?

std::size_t CountNonzeroUnsignedShorts(const unsigned short *p)
{
   
return std::wcslen(reinterpret_cast<const wchar_t *>(p));
}

Or its inverse case:

wchar_t *wmemset(wchar_t *dest, wchar_t c, size_t count
{
    __stosw
(reinterpret_cast<unsigned short *>(dest), c, count);
   
return dest;
}


I'm wondering because of the seemingly ambiguous wording in the Standard:

10.4 a type that is the signed or unsigned type corresponding to the dynamic type of the object;

Is unsigned short an "unsigned type corresponding" to wchar_t (on platforms such as Windows NT, Nintendo 3DS, etc.)?

If wchar_t is not allowed to alias its underlying type, then perhaps it should be mentioned in [diff.wchar.t] that this aliasing incompatibility is another one of the differences from C.

[basic.lval]/8 is the principle location that deals with whether aliasing is allowed. And it says nothing about the "underlying type" for any type.

So you can't even perform aliasing between `int` and `enum Name:int`, let alone between the underlying types of the character types. So there's no reason to mention it there.

Although unrelated, I'm wondering about this wording:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

What is meant by "access the stored value"--particularly, does it include writing in addition to reading?  When you write to a glvalue, you don't access the stored *value* at all, because you're replacing it.  This seems ambiguous to me.

From [defns.access]: "to read or modify the value of an object"

FrankHB1989

unread,
Dec 4, 2016, 1:10:51 PM12/4/16
to ISO C++ Standard - Discussion


在 2016年12月2日星期五 UTC+8上午3:41:59,Nicol Bolas写道:

Totally agreed.

However, there may be more problems about interopability.

How do these literally identical types (like `wchar_t`) mapped from C to C++?

While the similar strict aliasing rules are also applicable in C, a `wchar_t` lvalue is allowed to alias to its "underlying type" because they are identical. Things are different in C++. Technically the distinct type `wchar_t` in C++ is not the same entity in C. Does it mean that `wchar_t` in function declaration with `extern "C"` in code translated by a C++ implementation can never link to a function definition implemented as strictly conforming C programs in a well-defined manner? Is it true that any C++ code assuming the identity of `wchar_t` between C and C++ TUs violates ODR, leading to ill-formed program and no diagnostics are required?

And how about `wchar_t` in the standard library?

(As of real implementations, I don't think they can analyze the aliasing across different TUs from unspecified foreign languages. So relying on these rules is the only choice to enable type-based aliasing analysis of a C++ program with such foreign definitions. However, the benefits are suspicious here.)

Nicol Bolas

unread,
Dec 4, 2016, 2:53:47 PM12/4/16
to ISO C++ Standard - Discussion
On Sunday, December 4, 2016 at 1:10:51 PM UTC-5, FrankHB1989 wrote:
在 2016年12月2日星期五 UTC+8上午3:41:59,Nicol Bolas写道:
On Thursday, December 1, 2016 at 2:33:55 PM UTC-5, Myriachan wrote:
Is pointer aliasing legal between wchar_t, char16_t, char32_t and their underlying types?  In other words, is this legal on a platform for which char16_t and wchar_t have an underlying type of "unsigned short", as in Windows?

std::size_t CountNonzeroUnsignedShorts(const unsigned short *p)
{
   
return std::wcslen(reinterpret_cast<const wchar_t *>(p));
}

Or its inverse case:

wchar_t *wmemset(wchar_t *dest, wchar_t c, size_t count
{
    __stosw
(reinterpret_cast<unsigned short *>(dest), c, count);
   
return dest;
}


I'm wondering because of the seemingly ambiguous wording in the Standard:

10.4 a type that is the signed or unsigned type corresponding to the dynamic type of the object;

Is unsigned short an "unsigned type corresponding" to wchar_t (on platforms such as Windows NT, Nintendo 3DS, etc.)?

If wchar_t is not allowed to alias its underlying type, then perhaps it should be mentioned in [diff.wchar.t] that this aliasing incompatibility is another one of the differences from C.

[basic.lval]/8 is the principle location that deals with whether aliasing is allowed. And it says nothing about the "underlying type" for any type.

So you can't even perform aliasing between `int` and `enum Name:int`, let alone between the underlying types of the character types. So there's no reason to mention it there.

Although unrelated, I'm wondering about this wording:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

What is meant by "access the stored value"--particularly, does it include writing in addition to reading?  When you write to a glvalue, you don't access the stored *value* at all, because you're replacing it.  This seems ambiguous to me.

From [defns.access]: "to read or modify the value of an object"

Totally agreed.

However, there may be more problems about interopability.

How do these literally identical types (like `wchar_t`) mapped from C to C++?

Is that a question C++ is supposed to answer? Interop is generally implementation-defined territory; indeed, the C++ object model doesn't even match up with the C object model. In C, you can malloc some memory, cast the pointer, and you suddenly have a valid object of the cast type. That doesn't work in C++; [intro.object]/1 does not permit malloc alone to create an object. You can only create a C++ object via a declaration, temporary, new expression, or union-manipulation.

So how "identical" these things are depends on your implementation. As it always has.

FrankHB1989

unread,
Dec 4, 2016, 8:19:01 PM12/4/16
to ISO C++ Standard - Discussion


在 2016年12月5日星期一 UTC+8上午3:53:47,Nicol Bolas写道:
On Sunday, December 4, 2016 at 1:10:51 PM UTC-5, FrankHB1989 wrote:
在 2016年12月2日星期五 UTC+8上午3:41:59,Nicol Bolas写道:
On Thursday, December 1, 2016 at 2:33:55 PM UTC-5, Myriachan wrote:
Is pointer aliasing legal between wchar_t, char16_t, char32_t and their underlying types?  In other words, is this legal on a platform for which char16_t and wchar_t have an underlying type of "unsigned short", as in Windows?

std::size_t CountNonzeroUnsignedShorts(const unsigned short *p)
{
   
return std::wcslen(reinterpret_cast<const wchar_t *>(p));
}

Or its inverse case:

wchar_t *wmemset(wchar_t *dest, wchar_t c, size_t count
{
    __stosw
(reinterpret_cast<unsigned short *>(dest), c, count);
   
return dest;
}


I'm wondering because of the seemingly ambiguous wording in the Standard:

10.4 a type that is the signed or unsigned type corresponding to the dynamic type of the object;

Is unsigned short an "unsigned type corresponding" to wchar_t (on platforms such as Windows NT, Nintendo 3DS, etc.)?

If wchar_t is not allowed to alias its underlying type, then perhaps it should be mentioned in [diff.wchar.t] that this aliasing incompatibility is another one of the differences from C.

[basic.lval]/8 is the principle location that deals with whether aliasing is allowed. And it says nothing about the "underlying type" for any type.

So you can't even perform aliasing between `int` and `enum Name:int`, let alone between the underlying types of the character types. So there's no reason to mention it there.

Although unrelated, I'm wondering about this wording:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

What is meant by "access the stored value"--particularly, does it include writing in addition to reading?  When you write to a glvalue, you don't access the stored *value* at all, because you're replacing it.  This seems ambiguous to me.

From [defns.access]: "to read or modify the value of an object"

Totally agreed.

However, there may be more problems about interopability.

How do these literally identical types (like `wchar_t`) mapped from C to C++?

Is that a question C++ is supposed to answer?
It is, because `extern "C"` is mandated by ISO C++. And it is surely not in the scope of ISO C.
 
Interop is generally implementation-defined territory; indeed, the C++ object model doesn't even match up with the C object model.
Yes, it is implementation-defined in general here, as specified in [dcl.link]/2. But `extern "C"` is the exception.
 
In C, you can malloc some memory, cast the pointer, and you suddenly have a valid object of the cast type.
Not exactly. Since C99 there is the notion of effective type, which performs like dynamic type in C++. The cast-to type is the fallback to specify the effective on a stored object without declared one, and it will be valid only when it is accessed.

 
That doesn't work in C++; [intro.object]/1 does not permit malloc alone to create an object. You can only create a C++ object via a declaration, temporary, new expression, or union-manipulation.

So how "identical" these things are depends on your implementation. As it always has.
However, different to OP's question, the identity here is first concerned with the type in a declarator, which is more similar to the static type of an expression. Further, it will have effect on how to figure out the related dynamic types of expressions which were introduced by parameters in that declarator, depending on the answer.

Strictly speaking, the current rules are sufficient to answer the question (though a bit unclear), but is it intended? At least there should better be diagnostics if so. And better add something to [diff.wchar.t] like OP suggested, though I don't know if it will be editorial.

 

Myriachan

unread,
Dec 5, 2016, 3:45:44 PM12/5/16
to ISO C++ Standard - Discussion

I just tested it on GCC and Clang, and sure enough, the behavior is different in C from C++ in both compilers:

#ifdef __cplusplus
#include <cwchar>
#else
#include <wchar.h>
#endif

unsigned Meow(int *a, wchar_t *b)
{
 
*a = 5;
 
*b = 10;
 
return *a;
}

With optimizations enabled, C returns 10 and C++ returns 5.  Note that this is a Linux x86-64 test; otherwise, wchar_t and int may not be the same type in C.

Melissa
Reply all
Reply to author
Forward
0 new messages