--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use
mailto:std...@netlab.cs.rpi.edu<std-c%2B%2...@netlab.cs.rpi.edu>
]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
std::string is fine for storing UTF-8, you do not need unsigned char to do
this.
/Leigh
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
use std::vector insted, you can acess &v[0] to acess the data.
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
> news:wr2dnVFb14zHA9_W...@giganews.com...
> > I want to use a std::string (or equivalent) to store UTF-8
> > characters can it always do this?
> std::string is fine for storing UTF-8, you do not need
> unsigned char to do this.
In practice. Formally, the results of assigning a value which
is not representable in the target type is unspecified. And if
char is signed, and 8 bits, things like 0xC3 aren't
representable.
--
James Kanze
If he must have a string of unsigned char then why not use
std::basic_string<unsigned char> instead of std::vector? However I don't
think this is necessary and std::string is just fine. When converting from
UTF-8 to some other encoding (e.g. UTF-16 and std::wstring) you can always
cast characters to unsigned as required.
/Leigh
>From wikipedia article on C++0X (although I'm pretty sure that I read
it in teh official stuff as well):
For the purpose of enhancing support for Unicode in C++ compilers, the
definition of the type char has been modified to be both at least the
size necessary to store an eight-bit coding of UTF-8 and large enough
to contain any member of the compiler's basic execution character set.
It was previously defined as only the latter.
There are three Unicode encodings that C++0x will support: UTF-8,
UTF-16, and UTF-32. In addition to the previously noted changes to the
definition of char, C++0x will add two new character types: char16_t
and char32_t. Each of these is designed to store UTF-16 and UTF-32
respectively.
Note that size() is always the number of char or char_16_t rather than
the number of printing characters (glyphs) - This pretty much has to
be so unless you have a much restricted interface or a very slow size
() function (that might fail!!).
So I would stick to using std::string rather than vector
Another issue that is beyond the scope of any string class and yet you
might want to consider is the issue of when/whether/how to output a
BOM (Byte order marker) for UTF-16/UTF-32
The value obtained by assigning 0xC3 to a signed 8-bit integer is
perfectly specified.
What isn't is how it is represented.
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use
mailto:std...@netlab.cs.rpi.edu<std-c%2B%2...@netlab.cs.rpi.edu>
"James Kanze" <james...@gmail.com> wrote in message
news:a4b66b5a-1997-4591...@v25g2000yqk.googlegroups.com...
> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>
>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>>
>
> news:wr2dnVFb14zHA9_W...@giganews.com...
>>
>
> > I want to use a std::string (or equivalent) to store UTF-8
>> > characters can it always do this?
>>
>
> std::string is fine for storing UTF-8, you do not need
>> unsigned char to do this.
>>
>
> In practice. Formally, the results of assigning a value which
> is not representable in the target type is unspecified. And if
> char is signed, and 8 bits, things like 0xC3 aren't
> representable.
>
>
If by "unspecified" you mean "Implementation Defined" then yes, however I
take issue with your definition of representable:
int main()
{
unsigned char ch1 = 0xC3;
char ch2 = ch1;
unsigned char ch3 = ch2;
// following is fine in VC++ and g++ where char is signed yet can represent
0xC3 because there are sufficient bits.
assert(ch3 == 0xC3);
// following is also fine
assert((ch2 & 0xFF) == 0xC3);
}
/Leigh
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use
mailto:std...@netlab.cs.rpi.edu<std-c%2B%2...@netlab.cs.rpi.edu>
>
>
> "James Kanze" <james...@gmail.com> wrote in message
> news:a4b66b5a-1997-4591...@v25g2000yqk.googlegroups.com...
>
>> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>>
>>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>>>
>>
>> news:wr2dnVFb14zHA9_W...@giganews.com...
>>>
>>
>> > I want to use a std::string (or equivalent) to store UTF-8
>>> > characters can it always do this?
>>>
>>
>> std::string is fine for storing UTF-8, you do not need
>>> unsigned char to do this.
>>>
>>
>> In practice. Formally, the results of assigning a value which
>> is not representable in the target type is unspecified. And if
>> char is signed, and 8 bits, things like 0xC3 aren't
>> representable.
>>
>>
> If by "unspecified" you mean "Implementation Defined" then yes, however I
> take issue with your definition of representable:
>
> int main()
> {
> unsigned char ch1 = 0xC3;
> char ch2 = ch1;
> unsigned char ch3 = ch2;
> // following is fine in VC++ and g++ where char is signed yet can
> represent 0xC3 because there are sufficient bits.
> assert(ch3 == 0xC3);
I don't think your example demonstrates that "char" can represent 0xC3
there.
It cannot represent the *value* 0xC3. What you have there is storing an
implementation defined value into ch2, and *that* value can be represented
(probably they will just take the bitpattern of 0xC3 from ch1 and store that
one into ch2). If they do that, then on a two's complement representation
when assigning back to ch3, of course the value won't have changed from ch1
to ch3, but this won't say anything with regard to whether 0xC3 is
representable by "char" on that platform.
For an example, -1 cannot be represented in "unsigned char", but instead if
you assign to one you will store the value UCHAR_MAX, which can be
represented.
> // following is also fine
> assert((ch2 & 0xFF) == 0xC3);
Same issue: The operation promotes "ch2" to an int - that integer will
probably be negative on these platforms. Masking off all but the first 8
bits yields a value that's the same as 0xC3. But again this won't say
anything either.
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
> On Jan 7, 3:54 am, James Kanze <james.ka...@gmail.com> wrote:
>> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
>>
>> > "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
>> >news:wr2dnVFb14zHA9_W...@giganews.com...
>> > > I want to use a std::string (or equivalent) to store UTF-8
>> > > characters can it always do this?
>> > std::string is fine for storing UTF-8, you do not need
>> > unsigned char to do this.
>>
>> In practice. Formally, the results of assigning a value which
>> is not representable in the target type is unspecified.
>
> The value obtained by assigning 0xC3 to a signed 8-bit integer is
> perfectly specified.
> What isn't is how it is represented.
>
4.7 Integral conversions: "If the destination type is signed, the value is
unchanged if it can be represented in the destination type (and bit-field
width); otherwise, the value is implementation-defined."
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
"Unspecified" means that the standard does not tell you which of the
various reasonable alternatives (usually listed in the specification)
will happen. "Implementation defined" means that a conforming
implementation must document what it does. Showing that code does what
you expect it to do does not tell you that the behavior is
implementation defined, nor that it is unspecified. You determine that
by reading the standard.
--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of
"The Standard C++ Library Extensions: a Tutorial and Reference"
(www.petebecker.com/tr1book)
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
AFAIK, C89, C99, C++98 and C++03 all already mandate that CHAR_BIT is
at least 8, so it's just redundancy.
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
However, consider:
unsigned char unsignedChar = foo();
signed char signedChar = unsignedChar;
1) The compiler cannot know in advance what value unsignedChar has so must
always generate the same code (probably optimized to a simple single store
instruction).
2) Stripping off the high bit of unsignedChar makes even less sense than
simply converting its value to a signed (negative) value.
3) There should be a 1 to 1 mapping of any unsigned value with the high bit
set to some signed (negative) value (assuming two's complement of course)
which in effect means that signed char *can* represent any unsigned value
allbeit sometimes as a negative number.
Although actual behaviour is "implementation-defined" I doubt there are many
sane implementations that do not follow the above and it is perhaps wise to
consider the behaviour of your target implementation in the real world over
and above some hypothetical implementation in fantasy land. :) Mixing
std::string and std::basic_string<unsigned char> in the same code base can
be a PITA.
If in doubt always check your compiler's documentation.
/Leigh
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
> From wikipedia article on C++0X (although I'm pretty sure that I read
>>
> it in teh official stuff as well):
>
> There are three Unicode encodings that C++0x will support: UTF-8,
> UTF-16, and UTF-32.
>
I believe this is a bit misleading, as C++0x seems to offer a funny mixture
of support for UTF-8, UTF-16, UCS-2 (which is a subset of UTF-16), and
UTF-32 (which is essentially identical to UCS-4). A lot depends on what you
mean by "support".
I suggest you consult these threads for discussions related to this topic:
http://tinyurl.com/yeafdzv
http://tinyurl.com/y8pzsd9
Scott
In g++, char is signed by default, but this can be controlled by
a command line option.
Note that this newsgroup is comp.STD.c++.
``Proof by several compilers''
isn't good enough. Of course, it's good enough in practical development
where portability is weighed against economics: schedules, budgets,
markets, software lifetimes.
> 0xC3 because there are sufficient bits.
If char is signed and 8 bits wide, the value 0xC3 is
not representable. The maximum value of the type is 127,
whereas 0xc3 is 195. 195 being greater than 127 is a significant
obstacle to representability.
Of course, we can /encode/ values outside of the range of char,
by sacrificing some other values.
If we don't need any negative values, then we can use the values
-CHAR_MAX through -1 to encode values beyond CHAR_MAX.
That's not a direct representation: then it's the char type plus an
additional convention we have imposed on it which is doing the
representing.
> assert(ch3 == 0xC3);
This assertion proves nothing, other than that the
implementation-defined mapping of the 0xC3 value to the char type is
reversible by a conversion to unsigned char, on those two compilers.
This need not be the case. An implementation can treat out-of-range
numbers by clamping them, so that char c = 195 produces a value of
127. Good idea or not, this is valid implementation-defined behavior.
Programs which avoid such a conversion are more portable (even if only
in a mathematical sense, rather than in the real world).
> // following is also fine
> assert((ch2 & 0xFF) == 0xC3);
This additionally relies two behaviors:
- conversions of an out-of-range value to an integer type are treated
by bit truncation, whereby a mantissa bit in the original value which
corresponds to the narrower type's sign bit is preserved into
that sign bit.
- the two's complement representation is used for signed integers,
thus obeying sign-extension when such values are converted to a wider
signed type.
This means that the out-of-range value 0xC3 becomes a two's complement
value in the type char, which has the bit pattern 0xC3: there are 8
bits, which are simply preserved in the conversion; bit 7 becomes the
sign bit.
When this char object is evaluated, it produces a negative value of type
int (due to promotion) which is sign-extended; i.e the least signfiicant
8 bits of this int value continue to hold the value 0xC3, and the sign
bit is propagated through all the higher bits including the sign bit.
When we reduce this value with the bit operation & 255, we retain the
least significant 8 bits which hold the pattern 0xC3. That's the binary
value we obtain, since the 7th bit is not treated as the sign bit.
> However, consider:
> unsigned char unsignedChar = foo();
> signed char signedChar = unsignedChar;
> 1) The compiler cannot know in advance what value unsignedChar
> has so must always generate the same code (probably optimized
> to a simple single store instruction).
Certainly (except that it might not be a single store
instruction.)
> 2) Stripping off the high bit of unsignedChar makes even less
> sense than simply converting its value to a signed (negative)
> value.
And raising an implementation defined signal make more sense
than either.
> 3) There should be a 1 to 1 mapping of any unsigned value with
> the high bit set to some signed (negative) value (assuming
> two's complement of course) which in effect means that signed
> char *can* represent any unsigned value allbeit sometimes as a
> negative number.
Why should there be a 1 to 1 mapping. The standard certainly
doesn't require it, and there is hardware being sold today which
doesn't support it.
> Although actual behaviour is "implementation-defined" I doubt
> there are many sane implementations that do not follow the
> above
I would expect that a robust implementation raise the signal in
such cases. It has definite runtime costs (which is why it
isn't widely done), but it's the most robust solution---what one
would want in a critical system for example.
> and it is perhaps wise to consider the behaviour of your
> target implementation in the real world over and above some
> hypothetical implementation in fantasy land. :) Mixing
> std::string and std::basic_string<unsigned char> in the same
> code base can be a PITA.
No one said that you should mix the two. That does cause
problems. (Starting with the fact that instantiating
std::basic_string< unsigned char > is undefined behavior.)
--
James Kanze
> >> On Jan 6, 6:47 pm, "Leigh Johnston" <le...@i42.co.uk> wrote:
> >>> "Peter Olcott" <NoS...@SeeScreen.com> wrote in message
> >>news:wr2dnVFb14zHA9_W...@giganews.com...
[...]
> > int main()
> > {
> > unsigned char ch1 = 0xC3;
> > char ch2 = ch1;
> > unsigned char ch3 = ch2;
> > // following is fine in VC++ and g++ where char is signed yet can represent
[...]
> > assert(ch3 == 0xC3);
> This assertion proves nothing, other than that the
> implementation-defined mapping of the 0xC3 value to the char
> type is reversible by a conversion to unsigned char, on those
> two compilers.
> This need not be the case. An implementation can treat
> out-of-range numbers by clamping them, so that char c = 195
> produces a value of 127. Good idea or not, this is valid
> implementation-defined behavior.
You don't have to go to that point (although a conforming
implementation could even raise an implementation defined signal
if the value overflows). The simplest implementation of
converting between signed and unsigned is just to copy the bits.
A 1's complement or signed magnitude implementation doesn't have
this liberty when converting from signed to unsigned, however,
so it's quite possible that while the unsigned to signed does
what is wanted (here), the signed back to unsigned doesn't. (I
know of at least two machines where this would be an issue.
Both have plain char unsigned, however. Probably intentionally,
to avoid this issue.)
--
James Kanze
--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
Can it really raise? My understanding was that C allows, while C++ doesn't.
C++ says ";... otherwise, the value is implementation-defined." while C says
"either the result is implementation-defined or an implementation-defined
signal is raised.". To me the C++ portion reads that anything other than
producing a value is not allowed. I suspect i'm reading it wrongly, but what
part? Is there some difference between C and C++ wrt what "value" means? I.e
can it include traps in C++ while in C it can't? C says "An unspecified
value cannot be a trap representation." and C++ says "For POD types, the
/value representation/ is a set of bits in the object representation that
determines a /value/, which is one discrete element of an implementation-
defined set of values". I'm not getting a clue when reading the C++ part wrt
traps.
Why is instantiating std::basic_string<unsigned char> undefined behaviour?
I thought std::basic_string was a container like any other. The
unspecialized char_traits should work fine for unsigned char I think after
taking a quick look at character traits requirements. The standard mentions
char-like objects but I cannot find a mention of UB:
"The class template basic_string describes objects that can store a sequence
consisting of a varying number
of arbitrary char-like objects with the first element of the sequence at
position zero. Such a sequence is also
called a "string" if the type of the char-like objects that it holds is
clear from context. In the rest of this
Clause, the type of the char-like objects held in a basic_string object is
designated by charT."
/Leigh
Yeah I guess it is implementation defined behaviour (not undefined
behaviour) as the standard only mandates the following for unspecialized
char_traits:
template<class charT> struct char_traits;
I know it is not proof of anything but VC++, g++ and comeau all seem to
provide a functioning unspecialized char_traits that is compatible with
unsigned char.
/Leigh
Unless there's a requirement that the implementation document its
behavior, the behavior is not implementation defined. That's a formal
term in the standard, and it's not equivalent to the informal
"implementation specific".
--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of
"The Standard C++ Library Extensions: a Tutorial and Reference"
(www.petebecker.com/tr1book)
[ comp.std.c++ is moderated. To submit articles, try just posting with ]