Unicode Implementation

Issac Alphonso

unread,

Jan 16, 1999, 3:00:00 AM1/16/99

to

Hi,

Our group is moving towards using unicode characters for all i/o in
our systems. We are looking for a standard implementation of a unicode
class which handles all of the basic string functions in the context
of unicode characters. Could any of you point us to such a class or to
any other information regarding this?

Thanks for all your help in advance.

Best regards,

Issac Alphonso
Institute for Signal and Information Processing
WWW: http://www.isip.msstate.edu/

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
[ about comp.lang.c++.moderated. First time posters: do this! ]

Paul Grealish

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to

Issac Alphonso wrote:
>
> Hi,
>
> Our group is moving towards using unicode characters for all i/o in
> our systems. We are looking for a standard implementation of a unicode
> class which handles all of the basic string functions in the context
> of unicode characters. Could any of you point us to such a class or to
> any other information regarding this?

Have you looked at std::wstring?
It's a specialization of template class
basic_string for elements of type wchar_t.
wchar_t is the 16-bit wide character (aka
Unicode) data type.

James...@dresdner-bank.com

unread,

Jan 18, 1999, 3:00:00 AM1/18/99

to

In article <36A32C...@uk.geopak-tms.com>,

paul.g...@uk.geopak-tms.com wrote:
> Issac Alphonso wrote:
> >
> > Hi,
> >
> > Our group is moving towards using unicode characters for all i/o in
> > our systems. We are looking for a standard implementation of a unicode
> > class which handles all of the basic string functions in the context
> > of unicode characters. Could any of you point us to such a class or to
> > any other information regarding this?
>
> Have you looked at std::wstring?
> It's a specialization of template class
> basic_string for elements of type wchar_t.
> wchar_t is the 16-bit wide character (aka
> Unicode) data type.

Correction: wchar_t may be a 16-bit wide character. It may also be an
8-bit wide character. The standard makes no guarantees.

Realistically, *if* the implementation supports Unicode, I would expect
it to use wchar_t to do so.

--
James Kanze GABI Software, Sàrl
Conseils en informatique orienté objet --
-- Beratung in industrieller Datenverarbeitung
mailto: ka...@gabi-soft.fr mailto: James...@dresdner-bank.com

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

Sean Dynan

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to

You wrote:
> > Our group is moving towards using unicode characters for all i/o in
> > our systems. We are looking for a standard implementation of a unicode
> > class which handles all of the basic string functions in the context
> > of unicode characters. Could any of you point us to such a class or to
> > any other information regarding this?
>
> Have you looked at std::wstring?
> It's a specialization of template class
> basic_string for elements of type wchar_t.
> wchar_t is the 16-bit wide character (aka
> Unicode) data type.

Using the wchar_t type is fine, but hard-wires the finished binary for
Unicode strings.

Are you developing for Windows NT? If so, you can write code which can be
rebuilt to cater for either ANSI or Unicode strings with the flip of a
couple of define's.

You could start by typedef'ing some string classes like this:

typedef std::basic_string<_TCHAR> TString;
typedef std::basic_ostringstream<_TCHAR> TStringStream;

If _UNICODE is defined (e.g. project settings or hard-wired into the
source code), TString and TStringStream become Unicode string classes and
all template operations work a treat. If _UNICODE is undefined, TString
and TStringStream become 8-bit string classes as per usual.

If _UNICODE is defined, UNICODE needs to be defined too so the Win32
Unicode API gets called instead of the ANSI API.

The string handling C run time functions should be replaced with their
text-mapped equivalents (e.g, strlen() is replaced by _tcslen()).
Character variables should be declared using the _TCHAR type (e.g. _TCHAR
strBuf[64]). String literals should be wrapped in the _T() or _TEXT()
text-mapping macros (e.g. _TCHAR mystring = _T("This is a string")). On
occasion you may find yourself having to convert between 8-bit and 16-bit
character arrays (and vice-versa) using the "wcstombs()" and "mbstowcs()"
run time routines. Using the above approach, defining _UNICODE and
rebuilding will generate a Unicode binary. Undefining _UNICODE will
result in an ANSI string binary.

There is lots of help regarding this in the MSDN help libraries, although
you have to trawl through them for a while to make sense of it all.

Good luck.
__________
Sean Dynan
Senior Software Analyst
C-C-C Technology Ltd
sdy...@cccgroup.co.uk

John Duncan

unread,

Jan 19, 1999, 3:00:00 AM1/19/99

to

Er, the typical wchar_t is actually UCS-2, or a two-octet character set,
which is able to represent most of the world's characters. The standard
also defines UCS-4, a four-octet character set, which does represent
all of the characters in the world.

Once you have everything represented internally in UCS-n, you have to
provide I/O in UTF-8 format for compatibility with legacy tools and
with other UTF-8 sources and sinks. UTF-8 is an encoding method that
is able to encode all 128 Latin-1 characters using one octet apiece
and contains escape bits to represent the remaining combinations,
with some characters reaching 5 or 6 octets (I can't remember). This
provides a compact and compatible solution. I'm not sure if it is
compatible with IBM's DBCS, but it would be nice if it were.

The unicode standard is on the web at:

http://www.unicode.org/

and there is a C++ library called Rosette at:

http://unicode.basistech.com/

I haven't evaluated it, but it supports UCS-2 and UTF-n
formats, and also conversion between a variety of character sets.

-John

Alex Martelli

unread,

Jan 20, 1999, 3:00:00 AM1/20/99

to

Alan Bellingham wrote in message <36aaafbb....@news.lspace.org>...
[snip]

>>Are you developing for Windows NT? If so, you can write code which can be
>>rebuilt to cater for either ANSI or Unicode strings with the flip of a
>>couple of define's.
>

>But why bother? If you've coped with the concept of UTF-16 [1], why not
>stick with it?
>
>If you've decided there's benefit in going up to 16-bit chars, why not
>stick with it. Allowing a compiler switch to flip between them is going
>to lead to some very subtle and hard to find bugs, if you're not very
>careful.

For most projects, at any time you might find yourself faced with
a need to port the code to environments that do not support UTF-16
as well as one would wish; for example, code developed for NT might
with short notice be required to be ported to Win98, etc etc. If you
are using "bare" wchar_t, the porting can then be very troublesome.

This seems like a classic situation for using #ifdef:

#ifdef NO_UNICODE
typedef char char_t;
#else
typedef wchar_t char_t;
#endif
typedef std::basic_string<char_t> string_t;
// and so on

A few typedef's, and perhaps a few templates with specialization,
definition of const's, etc, can make it decently easy to port.

The NT tricks are a bit less elegant (lot of preprocessor use,
since C is supported as well as C++), so, more care may indeed
be needed to avoid "subtle bugs", but the basic idea is the same,
and quite usable (the advantage is that all of the boilerplate code
has been written for you already, in <tchar.h> etc...). Still, rolling
your own will ease possible future porting to non-Win32 platforms,
and, since ease of porting is the whole idea here, the investment
(which is not all that much) can easily repay itself.

Alex

Paul Grealish

unread,

Jan 21, 1999, 3:00:00 AM1/21/99

to

Sean Dynan wrote:
>
> (snip)

>
> Are you developing for Windows NT? If so, you can write code which can be
> rebuilt to cater for either ANSI or Unicode strings with the flip of a
> couple of define's.
>

> You could start by typedef'ing some string classes like this:
>
> typedef std::basic_string<_TCHAR> TString;
> typedef std::basic_ostringstream<_TCHAR> TStringStream;
>
> If _UNICODE is defined (e.g. project settings or hard-wired into the
> source code), TString and TStringStream become Unicode string classes and
> all template operations work a treat. If _UNICODE is undefined, TString
> and TStringStream become 8-bit string classes as per usual.

I think you'd be better off using the preprocessor
in the same way that the <tchar.h> header does.
Create a header file all the character type C++
entities (example given at end).

> If _UNICODE is defined, UNICODE needs to be defined too so the Win32
> Unicode API gets called instead of the ANSI API.

You should not define the symbol UNICODE (with no
underscore) yourself. You should only ever set
the symbol _UNICODE (with underscore). The Win32
API headers will internally set UNICODE depending
on whether _UNICODE is set.

#ifdef _UNICODE
#define _tstring std::wstring
#define _tcin std::wcin
#define _tcout std::wcout
#define _tcerr std::wcerr
#define _tclog std::wclog
#define _tios std::wios
#define _tstreambuf std::wstreambuf
#define _tistream std::wistream
#define _tostream std::wostream
#define _tiostream std::wiostream
#define _tstringbuf std::wstringbuf
#define _tistringstream std::wistringstream
#define _tostringstream std::wostringstream
#define _tstringstream std::wstringstream
#define _tfilebuf std::wfilebuf
#define _tifstream std::wifstream
#define _tofstream std::wofstream
#define _tfstream std::wfstream
#else
#define _tstring std::string
#define _tcin std::cin
#define _tcout std::cout
#define _tcerr std::cerr
#define _tclog std::clog
#define _tios std::ios
#define _tstreambuf std::streambuf
#define _tistream std::istream
#define _tostream std::ostream
#define _tiostream std::iostream
#define _tstringbuf std::stringbuf
#define _tistringstream std::istringstream
#define _tostringstream std::ostringstream
#define _tstringstream std::stringstream
#define _tfilebuf std::filebuf
#define _tifstream std::ifstream
#define _tofstream std::ofstream
#define _tfstream std::fstream
#endif

Sean Dynan

unread,

Jan 21, 1999, 3:00:00 AM1/21/99

to

You wrote:

> sdy...@cccgroup.co.uk (Sean Dynan) wrote:
>
> >Using the wchar_t type is fine, but hard-wires the finished binary for
> >Unicode strings.
> >

> >Are you developing for Windows NT? If so, you can write code which can be
> >rebuilt to cater for either ANSI or Unicode strings with the flip of a
> >couple of define's.
>

> But why bother? If you've coped with the concept of UTF-16 [1], why not
> stick with it?

Because it's one less headache when the marketing dept., for example,
decides the product should also run on Windows 95.
--

__________
Sean Dynan
Senior Software Analyst
C-C-C Technology Ltd
sdy...@cccgroup.co.uk

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

John Duncan

unread,

Jan 21, 1999, 3:00:00 AM1/21/99

to

>But why bother? If you've coped with the concept of UTF-16 [1], why not
>stick with it?

You must mean UCS-2. I don't believe that NT uses UTF-16 very much.
NT tends to use UTF-8 for storage, so that the transformation to
ANSI display terminals is relatively straightforward. Remember that
Windows 95 supports ANSI and MBCS but not Unicode. Conversion from
ANSI to UTF-8 is direct.

UCS-2 is the two-octet character set. UTF-16 is a 16-bit transformation
format for all unicode encodings, including UCS-1, which does not
support much internationalization.

-John

Thiemo Seufer

unread,

Jan 22, 1999, 3:00:00 AM1/22/99

to

John Duncan wrote in message <78593a$lq6$1...@usenet01.srv.cis.pitt.edu>...

>>But why bother? If you've coped with the concept of UTF-16 [1], why not
>>stick with it?
>
>You must mean UCS-2. I don't believe that NT uses UTF-16 very much.
>NT tends to use UTF-8 for storage, so that the transformation to
>ANSI display terminals is relatively straightforward. Remember that
>Windows 95 supports ANSI and MBCS but not Unicode. Conversion from
>ANSI to UTF-8 is direct.

No. Conversion from ASCII (with MSB unset) to UTF-8 is direct. Characters
not fitting in there are mapped to an multi-byte representation.

Thiemo Seufer

B. K. Oxley (binkley) at Home

unread,

Jan 28, 1999, 3:00:00 AM1/28/99

to

James...@dresdner-bank.com wrote:
>
> In article <36A32C...@uk.geopak-tms.com>,
> paul.g...@uk.geopak-tms.com wrote:
> > Issac Alphonso wrote:
> > >
> > > Hi,
> > >

> > > Our group is moving towards using unicode characters for all i/o in
> > > our systems. We are looking for a standard implementation of a unicode
> > > class which handles all of the basic string functions in the context
> > > of unicode characters. Could any of you point us to such a class or to
> > > any other information regarding this?
> >
> > Have you looked at std::wstring?
> > It's a specialization of template class
> > basic_string for elements of type wchar_t.
> > wchar_t is the 16-bit wide character (aka
> > Unicode) data type.
>

> Correction: wchar_t may be a 16-bit wide character. It may also be an
> 8-bit wide character. The standard makes no guarantees.

Even though you are discussing specifically Win32 platforms, I should
still point out for the benefit of others reading this thread that most
other 32-bit operating systems define wchar_t to be 32-bits wide, not
16-bits.

If you are considering writing cross-platform code, one of the great
downfallings of standard C++ is the lack of a UNICODE character type
(such as runes on Plan 9, for example). Of course, wchar_t pretty much
precedes the gradual standardization on UNICODE.

Further, the issue of 16- v. 32-bit representation of UTF16 overlooks
important issues such as support for surrogates (such as Klingon :-) and
gaiji characters (Japanese), which fall outside of the first plane in
UNICODE, and thus require the full UCS-4 (32-bit) representation (unless
one is willing to use multi-character 16-bit sequences).

This particular problem has bitten my employer (Inso Corporation) in
support XML across platforms. I finally recommended using a
specialization of basic_string combined with an inhouse-defined unsigned
16-bit integral type.

Take a gander at the UNICODE Consortium's standard (version 2.1 is
current) for more details at http://www.unicode.org.

--binkley