Our group is moving towards using unicode characters for all i/o in
our systems. We are looking for a standard implementation of a unicode
class which handles all of the basic string functions in the context
of unicode characters. Could any of you point us to such a class or to
any other information regarding this?
Thanks for all your help in advance.
Best regards,
Issac Alphonso
Institute for Signal and Information Processing
WWW: http://www.isip.msstate.edu/
[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
[ about comp.lang.c++.moderated. First time posters: do this! ]
Have you looked at std::wstring?
It's a specialization of template class
basic_string for elements of type wchar_t.
wchar_t is the 16-bit wide character (aka
Unicode) data type.
--
+---------------------------------+
| Paul Grealish |
| GEOPAK-TMS Limited |
| Cambridge, England |
| paul.g...@uk.geopak-tms.com |
+---------------------------------+
Correction: wchar_t may be a 16-bit wide character. It may also be an
8-bit wide character. The standard makes no guarantees.
Realistically, *if* the implementation supports Unicode, I would expect
it to use wchar_t to do so.
--
James Kanze GABI Software, Sàrl
Conseils en informatique orienté objet --
-- Beratung in industrieller Datenverarbeitung
mailto: ka...@gabi-soft.fr mailto: James...@dresdner-bank.com
-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Using the wchar_t type is fine, but hard-wires the finished binary for
Unicode strings.
Are you developing for Windows NT? If so, you can write code which can be
rebuilt to cater for either ANSI or Unicode strings with the flip of a
couple of define's.
You could start by typedef'ing some string classes like this:
typedef std::basic_string<_TCHAR> TString;
typedef std::basic_ostringstream<_TCHAR> TStringStream;
If _UNICODE is defined (e.g. project settings or hard-wired into the
source code), TString and TStringStream become Unicode string classes and
all template operations work a treat. If _UNICODE is undefined, TString
and TStringStream become 8-bit string classes as per usual.
If _UNICODE is defined, UNICODE needs to be defined too so the Win32
Unicode API gets called instead of the ANSI API.
The string handling C run time functions should be replaced with their
text-mapped equivalents (e.g, strlen() is replaced by _tcslen()).
Character variables should be declared using the _TCHAR type (e.g. _TCHAR
strBuf[64]). String literals should be wrapped in the _T() or _TEXT()
text-mapping macros (e.g. _TCHAR mystring = _T("This is a string")). On
occasion you may find yourself having to convert between 8-bit and 16-bit
character arrays (and vice-versa) using the "wcstombs()" and "mbstowcs()"
run time routines. Using the above approach, defining _UNICODE and
rebuilding will generate a Unicode binary. Undefining _UNICODE will
result in an ANSI string binary.
There is lots of help regarding this in the MSDN help libraries, although
you have to trawl through them for a while to make sense of it all.
Good luck.
__________
Sean Dynan
Senior Software Analyst
C-C-C Technology Ltd
sdy...@cccgroup.co.uk
Once you have everything represented internally in UCS-n, you have to
provide I/O in UTF-8 format for compatibility with legacy tools and
with other UTF-8 sources and sinks. UTF-8 is an encoding method that
is able to encode all 128 Latin-1 characters using one octet apiece
and contains escape bits to represent the remaining combinations,
with some characters reaching 5 or 6 octets (I can't remember). This
provides a compact and compatible solution. I'm not sure if it is
compatible with IBM's DBCS, but it would be nice if it were.
The unicode standard is on the web at:
and there is a C++ library called Rosette at:
I haven't evaluated it, but it supports UCS-2 and UTF-n
formats, and also conversion between a variety of character sets.
-John
For most projects, at any time you might find yourself faced with
a need to port the code to environments that do not support UTF-16
as well as one would wish; for example, code developed for NT might
with short notice be required to be ported to Win98, etc etc. If you
are using "bare" wchar_t, the porting can then be very troublesome.
This seems like a classic situation for using #ifdef:
#ifdef NO_UNICODE
typedef char char_t;
#else
typedef wchar_t char_t;
#endif
typedef std::basic_string<char_t> string_t;
// and so on
A few typedef's, and perhaps a few templates with specialization,
definition of const's, etc, can make it decently easy to port.
The NT tricks are a bit less elegant (lot of preprocessor use,
since C is supported as well as C++), so, more care may indeed
be needed to avoid "subtle bugs", but the basic idea is the same,
and quite usable (the advantage is that all of the boilerplate code
has been written for you already, in <tchar.h> etc...). Still, rolling
your own will ease possible future porting to non-Win32 platforms,
and, since ease of porting is the whole idea here, the investment
(which is not all that much) can easily repay itself.
Alex
I think you'd be better off using the preprocessor
in the same way that the <tchar.h> header does.
Create a header file all the character type C++
entities (example given at end).
> If _UNICODE is defined, UNICODE needs to be defined too so the Win32
> Unicode API gets called instead of the ANSI API.
You should not define the symbol UNICODE (with no
underscore) yourself. You should only ever set
the symbol _UNICODE (with underscore). The Win32
API headers will internally set UNICODE depending
on whether _UNICODE is set.
--
+---------------------------------+
| Paul Grealish |
| GEOPAK-TMS Limited |
| Cambridge, England |
| paul.g...@uk.geopak-tms.com |
+---------------------------------+
#ifdef _UNICODE
#define _tstring std::wstring
#define _tcin std::wcin
#define _tcout std::wcout
#define _tcerr std::wcerr
#define _tclog std::wclog
#define _tios std::wios
#define _tstreambuf std::wstreambuf
#define _tistream std::wistream
#define _tostream std::wostream
#define _tiostream std::wiostream
#define _tstringbuf std::wstringbuf
#define _tistringstream std::wistringstream
#define _tostringstream std::wostringstream
#define _tstringstream std::wstringstream
#define _tfilebuf std::wfilebuf
#define _tifstream std::wifstream
#define _tofstream std::wofstream
#define _tfstream std::wfstream
#else
#define _tstring std::string
#define _tcin std::cin
#define _tcout std::cout
#define _tcerr std::cerr
#define _tclog std::clog
#define _tios std::ios
#define _tstreambuf std::streambuf
#define _tistream std::istream
#define _tostream std::ostream
#define _tiostream std::iostream
#define _tstringbuf std::stringbuf
#define _tistringstream std::istringstream
#define _tostringstream std::ostringstream
#define _tstringstream std::stringstream
#define _tfilebuf std::filebuf
#define _tifstream std::ifstream
#define _tofstream std::ofstream
#define _tfstream std::fstream
#endif
Because it's one less headache when the marketing dept., for example,
decides the product should also run on Windows 95.
--
__________
Sean Dynan
Senior Software Analyst
C-C-C Technology Ltd
sdy...@cccgroup.co.uk
[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
You must mean UCS-2. I don't believe that NT uses UTF-16 very much.
NT tends to use UTF-8 for storage, so that the transformation to
ANSI display terminals is relatively straightforward. Remember that
Windows 95 supports ANSI and MBCS but not Unicode. Conversion from
ANSI to UTF-8 is direct.
UCS-2 is the two-octet character set. UTF-16 is a 16-bit transformation
format for all unicode encodings, including UCS-1, which does not
support much internationalization.
-John
No. Conversion from ASCII (with MSB unset) to UTF-8 is direct. Characters
not fitting in there are mapped to an multi-byte representation.
Thiemo Seufer
Even though you are discussing specifically Win32 platforms, I should
still point out for the benefit of others reading this thread that most
other 32-bit operating systems define wchar_t to be 32-bits wide, not
16-bits.
If you are considering writing cross-platform code, one of the great
downfallings of standard C++ is the lack of a UNICODE character type
(such as runes on Plan 9, for example). Of course, wchar_t pretty much
precedes the gradual standardization on UNICODE.
Further, the issue of 16- v. 32-bit representation of UTF16 overlooks
important issues such as support for surrogates (such as Klingon :-) and
gaiji characters (Japanese), which fall outside of the first plane in
UNICODE, and thus require the full UCS-4 (32-bit) representation (unless
one is willing to use multi-character 16-bit sequences).
This particular problem has bitten my employer (Inso Corporation) in
support XML across platforms. I finally recommended using a
specialization of basic_string combined with an inhouse-defined unsigned
16-bit integral type.
Take a gander at the UNICODE Consortium's standard (version 2.1 is
current) for more details at http://www.unicode.org.
--binkley