Our group is moving towards using unicode characters for all i/o in our systems. We are looking for a standard implementation of a unicode class which handles all of the basic string functions in the context of unicode characters. Could any of you point us to such a class or to any other information regarding this?
> Our group is moving towards using unicode characters for all i/o in > our systems. We are looking for a standard implementation of a unicode > class which handles all of the basic string functions in the context > of unicode characters. Could any of you point us to such a class or to > any other information regarding this?
Have you looked at std::wstring? It's a specialization of template class basic_string for elements of type wchar_t. wchar_t is the 16-bit wide character (aka Unicode) data type.
-- +---------------------------------+ | Paul Grealish | | GEOPAK-TMS Limited | | Cambridge, England | | paul.greal...@uk.geopak-tms.com | +---------------------------------+
[ Send an empty e-mail to c++-h...@netlab.cs.rpi.edu for info ] [ about comp.lang.c++.moderated. First time posters: do this! ]
> > Our group is moving towards using unicode characters for all i/o in > > our systems. We are looking for a standard implementation of a unicode > > class which handles all of the basic string functions in the context > > of unicode characters. Could any of you point us to such a class or to > > any other information regarding this?
> Have you looked at std::wstring? > It's a specialization of template class > basic_string for elements of type wchar_t. > wchar_t is the 16-bit wide character (aka > Unicode) data type.
Correction: wchar_t may be a 16-bit wide character. It may also be an 8-bit wide character. The standard makes no guarantees.
Realistically, *if* the implementation supports Unicode, I would expect it to use wchar_t to do so.
-- James Kanze GABI Software, Sàrl Conseils en informatique orienté objet -- -- Beratung in industrieller Datenverarbeitung mailto: ka...@gabi-soft.fr mailto: James.Ka...@dresdner-bank.com
-----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
[ Send an empty e-mail to c++-h...@netlab.cs.rpi.edu for info ] [ about comp.lang.c++.moderated. First time posters: do this! ]
You wrote: > > Our group is moving towards using unicode characters for all i/o in > > our systems. We are looking for a standard implementation of a unicode > > class which handles all of the basic string functions in the context > > of unicode characters. Could any of you point us to such a class or to > > any other information regarding this?
> Have you looked at std::wstring? > It's a specialization of template class > basic_string for elements of type wchar_t. > wchar_t is the 16-bit wide character (aka > Unicode) data type.
Using the wchar_t type is fine, but hard-wires the finished binary for Unicode strings.
Are you developing for Windows NT? If so, you can write code which can be rebuilt to cater for either ANSI or Unicode strings with the flip of a couple of define's.
You could start by typedef'ing some string classes like this:
If _UNICODE is defined (e.g. project settings or hard-wired into the source code), TString and TStringStream become Unicode string classes and all template operations work a treat. If _UNICODE is undefined, TString and TStringStream become 8-bit string classes as per usual.
If _UNICODE is defined, UNICODE needs to be defined too so the Win32 Unicode API gets called instead of the ANSI API.
The string handling C run time functions should be replaced with their text-mapped equivalents (e.g, strlen() is replaced by _tcslen()). Character variables should be declared using the _TCHAR type (e.g. _TCHAR strBuf[64]). String literals should be wrapped in the _T() or _TEXT() text-mapping macros (e.g. _TCHAR mystring = _T("This is a string")). On occasion you may find yourself having to convert between 8-bit and 16-bit character arrays (and vice-versa) using the "wcstombs()" and "mbstowcs()" run time routines. Using the above approach, defining _UNICODE and rebuilding will generate a Unicode binary. Undefining _UNICODE will result in an ANSI string binary.
There is lots of help regarding this in the MSDN help libraries, although you have to trawl through them for a while to make sense of it all.
Good luck. __________ Sean Dynan Senior Software Analyst C-C-C Technology Ltd sdy...@cccgroup.co.uk
[ Send an empty e-mail to c++-h...@netlab.cs.rpi.edu for info ] [ about comp.lang.c++.moderated. First time posters: do this! ]
Er, the typical wchar_t is actually UCS-2, or a two-octet character set, which is able to represent most of the world's characters. The standard also defines UCS-4, a four-octet character set, which does represent all of the characters in the world.
Once you have everything represented internally in UCS-n, you have to provide I/O in UTF-8 format for compatibility with legacy tools and with other UTF-8 sources and sinks. UTF-8 is an encoding method that is able to encode all 128 Latin-1 characters using one octet apiece and contains escape bits to represent the remaining combinations, with some characters reaching 5 or 6 octets (I can't remember). This provides a compact and compatible solution. I'm not sure if it is compatible with IBM's DBCS, but it would be nice if it were.
Alan Bellingham wrote in message <36aaafbb.340110...@news.lspace.org>...
[snip]
>>Are you developing for Windows NT? If so, you can write code which can be >>rebuilt to cater for either ANSI or Unicode strings with the flip of a >>couple of define's.
>But why bother? If you've coped with the concept of UTF-16 [1], why not >stick with it?
>If you've decided there's benefit in going up to 16-bit chars, why not >stick with it. Allowing a compiler switch to flip between them is going >to lead to some very subtle and hard to find bugs, if you're not very >careful.
For most projects, at any time you might find yourself faced with a need to port the code to environments that do not support UTF-16 as well as one would wish; for example, code developed for NT might with short notice be required to be ported to Win98, etc etc. If you are using "bare" wchar_t, the porting can then be very troublesome.
This seems like a classic situation for using #ifdef:
#ifdef NO_UNICODE typedef char char_t; #else typedef wchar_t char_t; #endif typedef std::basic_string<char_t> string_t; // and so on
A few typedef's, and perhaps a few templates with specialization, definition of const's, etc, can make it decently easy to port.
The NT tricks are a bit less elegant (lot of preprocessor use, since C is supported as well as C++), so, more care may indeed be needed to avoid "subtle bugs", but the basic idea is the same, and quite usable (the advantage is that all of the boilerplate code has been written for you already, in <tchar.h> etc...). Still, rolling your own will ease possible future porting to non-Win32 platforms, and, since ease of porting is the whole idea here, the investment (which is not all that much) can easily repay itself.
Alex
[ Send an empty e-mail to c++-h...@netlab.cs.rpi.edu for info ] [ about comp.lang.c++.moderated. First time posters: do this! ]
> Are you developing for Windows NT? If so, you can write code which can be > rebuilt to cater for either ANSI or Unicode strings with the flip of a > couple of define's.
> You could start by typedef'ing some string classes like this:
> If _UNICODE is defined (e.g. project settings or hard-wired into the > source code), TString and TStringStream become Unicode string classes and > all template operations work a treat. If _UNICODE is undefined, TString > and TStringStream become 8-bit string classes as per usual.
I think you'd be better off using the preprocessor in the same way that the <tchar.h> header does. Create a header file all the character type C++ entities (example given at end).
> If _UNICODE is defined, UNICODE needs to be defined too so the Win32 > Unicode API gets called instead of the ANSI API.
You should not define the symbol UNICODE (with no underscore) yourself. You should only ever set the symbol _UNICODE (with underscore). The Win32 API headers will internally set UNICODE depending on whether _UNICODE is set.
-- +---------------------------------+ | Paul Grealish | | GEOPAK-TMS Limited | | Cambridge, England | | paul.greal...@uk.geopak-tms.com | +---------------------------------+
You wrote: > sdy...@cccgroup.co.uk (Sean Dynan) wrote:
> >Using the wchar_t type is fine, but hard-wires the finished binary for > >Unicode strings.
> >Are you developing for Windows NT? If so, you can write code which can be > >rebuilt to cater for either ANSI or Unicode strings with the flip of a > >couple of define's.
> But why bother? If you've coped with the concept of UTF-16 [1], why not > stick with it?
Because it's one less headache when the marketing dept., for example, decides the product should also run on Windows 95. -- __________ Sean Dynan Senior Software Analyst C-C-C Technology Ltd sdy...@cccgroup.co.uk
[ Send an empty e-mail to c++-h...@netlab.cs.rpi.edu for info ] [ about comp.lang.c++.moderated. First time posters: do this! ]
>But why bother? If you've coped with the concept of UTF-16 [1], why not >stick with it?
You must mean UCS-2. I don't believe that NT uses UTF-16 very much. NT tends to use UTF-8 for storage, so that the transformation to ANSI display terminals is relatively straightforward. Remember that Windows 95 supports ANSI and MBCS but not Unicode. Conversion from ANSI to UTF-8 is direct.
UCS-2 is the two-octet character set. UTF-16 is a 16-bit transformation format for all unicode encodings, including UCS-1, which does not support much internationalization.
-John
[ Send an empty e-mail to c++-h...@netlab.cs.rpi.edu for info ] [ about comp.lang.c++.moderated. First time posters: do this! ]
John Duncan wrote in message <78593a$lq...@usenet01.srv.cis.pitt.edu>... >>But why bother? If you've coped with the concept of UTF-16 [1], why not >>stick with it?
>You must mean UCS-2. I don't believe that NT uses UTF-16 very much. >NT tends to use UTF-8 for storage, so that the transformation to >ANSI display terminals is relatively straightforward. Remember that >Windows 95 supports ANSI and MBCS but not Unicode. Conversion from >ANSI to UTF-8 is direct.
No. Conversion from ASCII (with MSB unset) to UTF-8 is direct. Characters not fitting in there are mapped to an multi-byte representation.
Thiemo Seufer
[ Send an empty e-mail to c++-h...@netlab.cs.rpi.edu for info ] [ about comp.lang.c++.moderated. First time posters: do this! ]
> > > Our group is moving towards using unicode characters for all i/o in > > > our systems. We are looking for a standard implementation of a unicode > > > class which handles all of the basic string functions in the context > > > of unicode characters. Could any of you point us to such a class or to > > > any other information regarding this?
> > Have you looked at std::wstring? > > It's a specialization of template class > > basic_string for elements of type wchar_t. > > wchar_t is the 16-bit wide character (aka > > Unicode) data type.
> Correction: wchar_t may be a 16-bit wide character. It may also be an > 8-bit wide character. The standard makes no guarantees.
Even though you are discussing specifically Win32 platforms, I should still point out for the benefit of others reading this thread that most other 32-bit operating systems define wchar_t to be 32-bits wide, not 16-bits.
If you are considering writing cross-platform code, one of the great downfallings of standard C++ is the lack of a UNICODE character type (such as runes on Plan 9, for example). Of course, wchar_t pretty much precedes the gradual standardization on UNICODE.
Further, the issue of 16- v. 32-bit representation of UTF16 overlooks important issues such as support for surrogates (such as Klingon :-) and gaiji characters (Japanese), which fall outside of the first plane in UNICODE, and thus require the full UCS-4 (32-bit) representation (unless one is willing to use multi-character 16-bit sequences).
This particular problem has bitten my employer (Inso Corporation) in support XML across platforms. I finally recommended using a specialization of basic_string combined with an inhouse-defined unsigned 16-bit integral type.
Take a gander at the UNICODE Consortium's standard (version 2.1 is current) for more details at http://www.unicode.org.
--binkley
[ Send an empty e-mail to c++-h...@netlab.cs.rpi.edu for info ] [ about comp.lang.c++.moderated. First time posters: do this! ]