I just wanted to know how exactly it handle wchar_t internally?
Thanks and Regards,
Sadanand
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Why NOT?
Unicode has more than 65536 code points and the C++ standard allows
wchar_t to be big enough to represent that. Does "implementation-
defined" ring a bell?
Cheers,
SG
So it can store things that are not what you call "simple" chars. The
purpose of wchar_t is to encode the entire execution wide character set
with a fixed number of bytes per code point. The standard allows a
trivial implementation of wchar_t, where the execution wide character
set is the same as the execution character set, which can be stored at
one byte per code point in a char object. However, the whole point of
having wchar_t is to allow the wide set to be larger.
What gcc actually uses for wchar_t, by default, is either UTF-16 or
UTF-32, depending upon the size of wchar_t: both wchar_t and the
character set used are configurable on the gcc command line. UTF-32
represents 1114112 different code points in the Universal Character Set.
wchar_t is supposed to hold a character of the wide execution
character set. GCC uses UTF-32 as its WECS (unless you force it to use
UTF-16, which is possible, and the default for MinGW). Thus, wchar_t
is a 32-bit type.
Sebastian
It is using UCS4 to encode character strings, see www.unicode.org for
information on that encoding. The gist is that it doesn't only allow Latin
characters but also Greek, Cyrillic, Arabian, east Asian and lots others.
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
{ edits: quoted sig & banner removed. don't quote extraneous material. tia., -mod }
I know Wikipedia isn't the best source but
http://en.wikipedia.org/wiki/Wide_character
The idea of the wchar_t is to support characters outside of the 256
available in your standard ascii character set. This is also why a
char is larger than a byte in languages like Java and C# (and why they
have a separate byte type). Coming from a mostly Windows world,
wchar_t's are 16 bits for UTF-16 (which threw me off when reading your
post), and GCC & friends implement UTF-32 (thus your 4-byte sizes).
The short answer to your question: why the 3 extra bytes? For all the
languages not derived from the basic Latin character set, you can
represent every character from any language and then some
(4,294,967,295 individual character slots).
4 bytes is the minumum needed to encode any Unicode character.
Most languages that use unicode use UTF-16 but this means that either
a character isn't always a character or else you have to be careful of
the distinction between the element index and character index.
Google Unicode and UTF-32
In practice you can't write truly portable code with wchar_t because
it isn't defined whether it what the encoding is - I think that the
next standard is supposed to mandate types for UTF-16 and UTF-32
explicitly.
Why not? The standard doesn't mandate a specific size.
> a simple char will take 1 byte
> and for what reason GCC needs extra 3 bytes for wchar_t?
Probably because the wide character set it uses requires that much
space.
> I just wanted to know how exactly it handle wchar_t internally?
char holds a character of the narrow character set, wchar_t holds a
character of the wide character set.
Both are platform- and potentially locale- and environment-specific.
On GNU/Linux, char is in the locale character set and wchar_t is in
UTF-32.
On Microsoft Windows, char is in the ANSI character set and wchar_t is
in UCS-2 or UTF-16 (note wchar_t is 2 bytes there).
> 4 bytes is the minumum needed to encode any Unicode character.
Actually, that's not true.
3 bytes is already more than enough. You just need 21 bits.
While we're picking nits, let me mention that you need 21 bits for the
codepoint, but some characters (or should I say glyphen?) are actually
expressed using multiple codepoints...
Oh, and don't get me started on the size of a byte... ;)
Cheers!
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
It would have been nice if Microsoft and GCC agreed on the sizeof
(wchar_t). Since they are different it seems to me that you should
write in char to have portable code.
> In practice you can't write truly portable code with wchar_t because
> it isn't defined whether it what the encoding is - I think that the
> next standard is supposed to mandate types for UTF-16 and UTF-32
> explicitly.
For the lurkers like me who want more details:
http://stackoverflow.com/questions/872491/new-unicode-characters-in-c0x
http://www.devx.com/cplus/10MinuteSolution/34328/1954
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html
Why? The only potential problem is with binary data and mostly that is a
problem regardless.
You do know that even char varies from platform to platform. It is
commonly an octet but isn't always
--
Note that robinton.demon.co.uk addresses are no longer valid.
This is not a question about compilers but about operating systems.
Linux uses 32 bits to store a Unicode character whereas Windows has
standardised on 16 bits and (if I understand correctly UTF-16). There
are probably also operating systems out there that don't support
anything but some basic 8-bit character sets.
Do not expect compiler writers to write compilers that will make it
more difficult to use them on the systems, they're supposed to work
on.
/Peter
There is no rule that a particular type cannot be larger than it has
to be. After all, with gcc, the size of a C++ bool type is also four
bytes.
Greg
I just tested that with GCC4.2 on Linux/x86, the 'bool' type there is
exactly one byte. Where did you see four bytes for that?
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932