Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

New project - unicode or not?

67 views
Skip to first unread message

JiiPee

unread,
Jan 12, 2016, 7:21:58 PM1/12/16
to
If the program uses string (in files, user interface etc), what is best
to be the default character set type? Unicode? How about if currently
all users are english, does it change anything? So should I chose string
or wstring by default in my projects? Does wstring work mostly the same
as string; do most of the STL functions work the same way as if I used
string? I mean if I use (only) english letters, does wstring version of
STL functions (like find and replace, insert, delete, parse) work the
same way as string versions?

Specifically for MFC:
And in Visual Studio MFC project I guess its better to use wstring
rather than CString to manipulate the texts in files?
Like lets say for example I have to split the text in file by commas:
22, 3, 55, 6, 7, 23

and find the integers what are there. Is better to use wstring or
CString to parse that? I kind of like wstring, but then I have to
convert wstring to CString later on in the code....because MFC uses
CStrings.

Nobody

unread,
Jan 12, 2016, 8:25:57 PM1/12/16
to
On Wed, 13 Jan 2016 00:21:40 +0000, JiiPee wrote:

> If the program uses string (in files, user interface etc), what is best
> to be the default character set type? Unicode? How about if currently
> all users are english, does it change anything? So should I chose string
> or wstring by default in my projects? Does wstring work mostly the same
> as string; do most of the STL functions work the same way as if I used
> string? I mean if I use (only) english letters, does wstring version of
> STL functions (like find and replace, insert, delete, parse) work the
> same way as string versions?

std::string is a typedef for std::basic_string<char>.

std::wstring is a typedef for std::basic_string<wchar_t>.

Both support the same methods. std::basic_string<T> is more or less
std::vector<T> with a few more methods.

The biggest difference is that if you use std::wstring, you will
inevitably find yourself having to convert to std::string and/or char*
occasionally or even frequently. If you use std::string, it's often
possible to never need to use std::wstring or wchar_t* for anything
(although this is somewhat less viable on Windows).

OTOH, the Windows OS functions all use wide strings (wchar_t*) as their
"string" type (filenames, registry keys, etc). The versions which take
char* are just shallow wrappers around the wchar_t* functions. If you want
to be able to open any file, regardless of the current locale, you need to
use the wide-string functions (and you'll need to use the non-standard
fstream constructors/methods to open such files as fstreams).

For files, the default choice should be UTF-8 if you actually need to
treat the data as text (e.g. you need to use <cctype> functions or
convert to wide strings or whatever).

If the data is almost entirely ASCII, you need to be able to "deal with"
whatever the user throws at it, and it doesn't matter if non-ASCII
characters aren't handled entirely correctly, using ISO-8859-* has the
advantage that decoding never fails (any sequence of bytes is valid). So
if the program reads a file that's actually some other "extended ASCII"
encoding, you get a few mojibake characters where UTF-8 would give you a
decoding error.

Jorgen Grahn

unread,
Jan 13, 2016, 4:26:44 AM1/13/16
to
On Wed, 2016-01-13, JiiPee wrote:
> If the program uses string (in files, user interface etc), what is best
> to be the default character set type? Unicode? How about if currently
> all users are english, does it change anything? So should I chose string
> or wstring by default in my projects? Does wstring work mostly the same
> as string; do most of the STL functions work the same way as if I used
> string? I mean if I use (only) english letters, does wstring version of
> STL functions (like find and replace, insert, delete, parse) work the
> same way as string versions?
>
> Specifically for MFC:
> And in Visual Studio MFC project [...]

I won't try to answer, but is this project a Windows-only thing?
I think the right answer will be different for Windows, or for other
environments.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

seeplus

unread,
Jan 13, 2016, 4:49:50 AM1/13/16
to
On Wednesday, January 13, 2016 at 11:21:58 AM UTC+11, JiiPee wrote:
>
> Specifically for MFC:
> And in Visual Studio MFC project I guess its better to use wstring
> rather than CString to manipulate the texts in files?

For any future projects in MS/MFC you MUST use Unicode.
As from VS2013 they dropped support for MCBS.
You can get a MCBS plugin for higher VS versions still, but doesn't seem
to be available for Community.

MS >> "The goal is to remove MBCS support entirely in a subsequent release.
MFC would then support only Unicode".

CString always works.. at least for English, and there is always a suitable converter available both ways for just about any other type.

Paavo Helde

unread,
Jan 13, 2016, 6:13:28 AM1/13/16
to
(A word of warning: most of the following is pretty Windows/MFC specific.)

If your projects are Windows-only, then always use UNICODE and use
std::wstring. In this case the CString and std::wstring ought to be
functionally the same, both should use UTF-16 encoding. The std::wstring
has a bit safer and nicer interface than CString IMO, but it's up to
your preferences which to use.

For better control it might be useful to suppress automatic ANSI-Unicode
conversions built into CString. For that
#define _CSTRING_DISABLE_NARROW_WIDE_CONVERSION
before including MFC headers. The default conversion can then be
restored by CA2W and CW2A where appropriate:
catch(const std::exception& e){
AfxMessageBox(CA2W(e.what()));

If your projects are portable e.g. to Linux, then use std::string with
UTF-8 encoding. CStrings cannot be made to use UTF-8, so they should
stay at UNICODE/UTF-16. Inside MFC this means that you cannot use
default conversions or CA2W, instead you need to define your own UTF-8
<--> UTF-16 helper functions and translate the strings immediately
before passing or retrieving them to/from MFC.

} catch(const std::exception& e) {
AfxMessageBox(UTF8To16(e.what()));

(implementing UTF8To16() left as an exercise to the reader). Note that
for avoiding accidental invalid conversions it is essential to #define
_CSTRING_DISABLE_NARROW_WIDE_CONVERSION if your narrow strings are in UTF-8.


HTH
Paavo









JiiPee

unread,
Jan 13, 2016, 8:14:10 AM1/13/16
to
yes this one is what am doing currently. But the same question applies
to all projects though.
oh.. well, I currently do only programs for Windows... so for me its
only Windows. But obviously there is a possiblitity that programs need
to run also on linux etc, so those things might need consideration. So
far all my programs run only in Windows

JiiPee

unread,
Jan 13, 2016, 8:21:59 AM1/13/16
to
yes it works, but just that is it the most efficient and better than
STL? STL might have better functionality?

JiiPee

unread,
Jan 13, 2016, 8:29:41 AM1/13/16
to
On 13/01/2016 11:13, Paavo Helde wrote:
> On 13.01.2016 2:21, JiiPee wrote:
>> If the program uses string (in files, user interface etc), what is best
>> to be the default character set type? Unicode? How about if currently
>> all users are english, does it change anything? So should I chose string
>> or wstring by default in my projects? Does wstring work mostly the same
>> as string; do most of the STL functions work the same way as if I used
>> string? I mean if I use (only) english letters, does wstring version of
>> STL functions (like find and replace, insert, delete, parse) work the
>> same way as string versions?
>>
>> Specifically for MFC:
>> And in Visual Studio MFC project I guess its better to use wstring
>> rather than CString to manipulate the texts in files?
>> Like lets say for example I have to split the text in file by commas:
>> 22, 3, 55, 6, 7, 23
>>
>> and find the integers what are there. Is better to use wstring or
>> CString to parse that? I kind of like wstring, but then I have to
>> convert wstring to CString later on in the code....because MFC uses
>> CStrings.
>
> (A word of warning: most of the following is pretty Windows/MFC
> specific.)
>
> If your projects are Windows-only, then always use UNICODE and use
> std::wstring. In this case the CString and std::wstring ought to be
> functionally the same, both should use UTF-16 encoding.

ok this was a good answer. I ll save this answer.

> The std::wstring has a bit safer and nicer interface than CString IMO,
> but it's up to your preferences which to use.

STL has also more functionality, isnt it? and it works with C++14.
ONe more thing: If I use wstring and manipulate file first with it and
then pass the result to CString, what is the best way to copy from STL
to CString? they both use UTF-16? -- so can I just copy the data byte by
byte or char by char?

JiiPee

unread,
Jan 13, 2016, 8:30:19 AM1/13/16
to
On 13/01/2016 11:13, Paavo Helde wrote:
> On 13.01.2016 2:21, JiiPee wrote:
>> If the program uses string (in files, user interface etc), what is best
>> to be the default character set type? Unicode? How about if currently
>> all users are english, does it change anything? So should I chose string
>> or wstring by default in my projects? Does wstring work mostly the same
>> as string; do most of the STL functions work the same way as if I used
>> string? I mean if I use (only) english letters, does wstring version of
>> STL functions (like find and replace, insert, delete, parse) work the
>> same way as string versions?
>>
>> Specifically for MFC:
>> And in Visual Studio MFC project I guess its better to use wstring
>> rather than CString to manipulate the texts in files?
>> Like lets say for example I have to split the text in file by commas:
>> 22, 3, 55, 6, 7, 23
>>
>> and find the integers what are there. Is better to use wstring or
>> CString to parse that? I kind of like wstring, but then I have to
>> convert wstring to CString later on in the code....because MFC uses
>> CStrings.
>
> (A word of warning: most of the following is pretty Windows/MFC
> specific.)
>
> If your projects are Windows-only, then always use UNICODE and use
> std::wstring. In this case the CString and std::wstring ought to be
> functionally the same, both should use UTF-16 encoding.

ok this was a good answer. I ll save this answer.

> The std::wstring has a bit safer and nicer interface than CString IMO,
> but it's up to your preferences which to use.

Alf P. Steinbach

unread,
Jan 13, 2016, 8:59:33 AM1/13/16
to
On 1/13/2016 1:21 AM, JiiPee wrote:
> If the program uses string (in files, user interface etc), what is best
> to be the default character set type? Unicode? How about if currently
> all users are english, does it change anything? So should I chose string
> or wstring by default in my projects? Does wstring work mostly the same
> as string; do most of the STL functions work the same way as if I used
> string? I mean if I use (only) english letters, does wstring version of
> STL functions (like find and replace, insert, delete, parse) work the
> same way as string versions?
>
> Specifically for MFC:
> And in Visual Studio MFC project I guess its better to use wstring
> rather than CString to manipulate the texts in files?

Not necessarily. As I recall CString provides practical printf-like
formatting functions and practical conversions. It's like, when in Rome,
do as the Romans, and when in Rama, do as the Ramans.


> Like lets say for example I have to split the text in file by commas:
> 22, 3, 55, 6, 7, 23
>
> and find the integers what are there. Is better to use wstring or
> CString to parse that? I kind of like wstring, but then I have to
> convert wstring to CString later on in the code....because MFC uses
> CStrings.

It may be that CString has some awful unpleasantness that means you
should avoid. I can't recall. But unless it's like that, I'd use the
framework's string type, same as if using e.g. Qt, use their strings.

Do note that Microsoft's documentation specifically states that the C
level `setlocale` does not accept UTF-8, that it will fail at any
attempt to set an UTF-8 locale. So UTF-8 is not a practical proposition
for narrow (`char`-based) strings in Windows. Even if a lot of Unix-land
programs have been ported to or co-developed for Windows with UTF-8
narrow string convention, so that there is absolutely no lack of people
who will recommend this and tell you that it works all fine & good, no
problemo. But I've never been particularly impressed by allegedly
Windows programs that e.g. don't manage to handle spaces in paths, or
paths with non-ASCII characters. I remember my early attempts to install
the Qt animal: the installation dialog was unable to handle the
backspace key (yes, it's true!), and I did NOT laugh. They just don't
test things for Windows, and then /believe/ that it works.

Cheers & hth.,

- Alf

Jorgen Grahn

unread,
Jan 13, 2016, 9:20:05 AM1/13/16
to
On Wed, 2016-01-13, JiiPee wrote:
> On 13/01/2016 09:26, Jorgen Grahn wrote:
>> On Wed, 2016-01-13, JiiPee wrote:
>>> If the program uses string (in files, user interface etc), what is best
>>> to be the default character set type? Unicode? How about if currently
>>> all users are english, does it change anything? So should I chose string
>>> or wstring by default in my projects? Does wstring work mostly the same
>>> as string; do most of the STL functions work the same way as if I used
>>> string? I mean if I use (only) english letters, does wstring version of
>>> STL functions (like find and replace, insert, delete, parse) work the
>>> same way as string versions?
>>>
>>> Specifically for MFC:
>>> And in Visual Studio MFC project [...]
>> I won't try to answer, but is this project a Windows-only thing?
>> I think the right answer will be different for Windows, or for other
>> environments.
>>
>>
>
> yes this one is what am doing currently. But the same question applies
> to all projects though.

Yes, but not the same /answer/, which was my point.

> oh.. well, I currently do only programs for Windows... so for me its
> only Windows. But obviously there is a possiblitity that programs need
> to run also on linux etc, so those things might need consideration. So
> far all my programs run only in Windows

It's possible that if you need to be portable between Windows
and the rest of the world, you need a third answer ... but I'm
no expert.

Öö Tiib

unread,
Jan 13, 2016, 10:24:59 AM1/13/16
to
Typically MSVC is configured so both contain exactly same UTF-16 so:

std::wstring ws;

// ... do stuff with ws

CString cs = ws.c_str();

// ... do stuff with cs

If It gives some warnings or errors about it then something is
apparently misconfigured.

Additionally, if you happen to use Windows COM (and that
likely happens sooner or later since lot of better things in
Windows are COM) then there is BSTR (again same UTF-16
text) and '_bstr_t' wrapper smart pointer for it.

JiiPee

unread,
Jan 13, 2016, 11:15:37 AM1/13/16
to
On 13/01/2016 15:24, Öö Tiib wrote:
> Typically MSVC is configured so both contain exactly same UTF-16 so:
>
> std::wstring ws;
>
> // ... do stuff with ws
>
> CString cs = ws.c_str();

oh ok, it works so easily. although if I remember correctly I tried to
copy before like this but it did not work... but maybe I remember
wrongly. I ll try this.

JiiPee

unread,
Jan 13, 2016, 11:24:58 AM1/13/16
to
On 13/01/2016 13:59, Alf P. Steinbach wrote:
> On 1/13/2016 1:21 AM, JiiPee wrote:
>> If the program uses string (in files, user interface etc), what is best
>> to be the default character set type? Unicode? How about if currently
>> all users are english, does it change anything? So should I chose string
>> or wstring by default in my projects? Does wstring work mostly the same
>> as string; do most of the STL functions work the same way as if I used
>> string? I mean if I use (only) english letters, does wstring version of
>> STL functions (like find and replace, insert, delete, parse) work the
>> same way as string versions?
>>
>> Specifically for MFC:
>> And in Visual Studio MFC project I guess its better to use wstring
>> rather than CString to manipulate the texts in files?
>
> Not necessarily. As I recall CString provides practical printf-like
> formatting functions and practical conversions. It's like, when in
> Rome, do as the Romans, and when in Rama, do as the Ramans.
>

I see, but am I right that wstring has more functionality? STL is richer
than MFC (CString)?
Also if I use third party libraries (like xml parser/writer to store
setting to a file) then they propably use STL, right?
Also am a bit more familiar currently with wstring & STL than
MFC/CString... although its easy to use CString also. But I mean my
helper functions/tools for strings I create they are all done in STL.
Are these things important to consider as well?

I am not talking about handling strings in dialog boxes.... of course
there I use CString... but am talking about handling files and text in
them. Reading text from files and storing .

Paavo Helde

unread,
Jan 13, 2016, 11:58:44 AM1/13/16
to
Make sure you configure your MS projects as "Use Unicode character set"
(that's MS-speak for UTF-16). You can add this somewhere in your code to
be sure:

#if !defined(_UNICODE) || !defined(UNICODE)
#error Something wrong!
#endif

HTH
Paavo

Öö Tiib

unread,
Jan 13, 2016, 12:11:38 PM1/13/16
to
Try to use modular architecture. Try to limit to only one class for representing
texts within module. Convert it to other kind only in interface with
other module that uses some other class for text (if there are any).

If later profiling shows that lot of conversions are taking place in some
interface between two modules then that indicates that responsibilities
are unclear and have diffused between the two modules.

Paavo Helde

unread,
Jan 13, 2016, 12:24:42 PM1/13/16
to
On 13.01.2016 18:24, JiiPee wrote:

> I see, but am I right that wstring has more functionality? STL is richer
> than MFC (CString)?

Depends on viewpoint. CString is better integrated into MFC,
std::wstring is better integrated into standard C++.

> Also if I use third party libraries (like xml parser/writer to store
> setting to a file) then they propably use STL, right?
> Also am a bit more familiar currently with wstring & STL than
> MFC/CString... although its easy to use CString also. But I mean my
> helper functions/tools for strings I create they are all done in STL.
> Are these things important to consider as well?
>
> I am not talking about handling strings in dialog boxes.... of course
> there I use CString... but am talking about handling files and text in
> them. Reading text from files and storing .

By all means, if you have already some code using
std::string/std::wstring, or there is any remote chance the code will be
ported to other platforms, use std::wstring as much as possible, and
CString as little as possible.

Beware that for files there are two topics here: the filename itself and
the file content. Unicode support for both of these topics is
non-trivial and full of gotchas. Standard C++ works pretty well only in
Linux world where everything is typically in UTF-8, including filenames
and file contents. But anyway, these issues are not related to whether
one uses CString or std::wstring, these both would work basically to the
same extent.

Cheers
Paavo


Alf P. Steinbach

unread,
Jan 13, 2016, 2:12:38 PM1/13/16
to
On 1/13/2016 6:15 PM, Stefan Ram wrote:
> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>> JiiPee <n...@notvalid.com> writes:
>>> Specifically for MFC:
>>> And in Visual Studio MFC project I guess its better to use wstring
>>> rather than CString to manipulate the texts in files?
>> I think with windows internally, one uses UTF-16 (formerly UCS-2
>> and TCHAR and before that Windows-1252), but to exchange data with
>> other programs, UTF-8.
>
> And an interesting web page in this regard is
>
> utf8everywhere.org

It's interesting -- psychologically. ;-)

It would also be nice if every OS was Linux-based.

Would simplify things, really.


Cheers,

- Alf :-p

(Reminds me of the time I proposed to ditch everything of the standard
library except the STL subset. I thought it was clearly a joke. It was
taken seriously.)

Jorgen Grahn

unread,
Jan 13, 2016, 2:41:43 PM1/13/16
to
On Wed, 2016-01-13, Alf P. Steinbach wrote:
> On 1/13/2016 6:15 PM, Stefan Ram wrote:
>> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>>> JiiPee <n...@notvalid.com> writes:
>>>> Specifically for MFC:
>>>> And in Visual Studio MFC project I guess its better to use wstring
>>>> rather than CString to manipulate the texts in files?
>>> I think with windows internally, one uses UTF-16 (formerly UCS-2
>>> and TCHAR and before that Windows-1252), but to exchange data with
>>> other programs, UTF-8.
>>
>> And an interesting web page in this regard is
>>
>> utf8everywhere.org
>
> It's interesting -- psychologically. ;-)
>
> It would also be nice if every OS was Linux-based.
>
> Would simplify things, really.

In the FAQ part of that page:

Q: Are you a linuxer? Is this a concealed religious fight against
Windows?
A: No, I grew up on Windows, and I am primarily a Windows
developer. I believe Microsoft made a wrong design choice in
the text domain, because they did it earlier than others.

Öö Tiib

unread,
Jan 13, 2016, 3:47:00 PM1/13/16
to
It is likely correct that Microsoft made a wrong choice (and did drag
along Java and Qt) but why should an application developer fix it?
Converting between UTF-8 and UTF-16 LE is cheap. If someone wants
to do that in their Windows app without clear need then it likely won't
matter either way. However it is pointless to pose like that will
somehow save the world from Microsoft's mistake.

Nobody

unread,
Jan 18, 2016, 10:58:13 PM1/18/16
to
On Wed, 13 Jan 2016 14:59:12 +0100, Alf P. Steinbach wrote:

> Do note that Microsoft's documentation specifically states that the C
> level `setlocale` does not accept UTF-8, that it will fail at any
> attempt to set an UTF-8 locale. So UTF-8 is not a practical proposition
> for narrow (`char`-based) strings in Windows.

UTF-8 is a perfectly practical proposition for internal storage or stream
I/O. Just don't try to pass it to "ANSI"-mode OS functions (which you
probably shouldn't be using anyhow; Windows 95/98/ME is extinct now).

Wide strings have the advantage that you can just pass a pointer to the
data to most OS functions without needing any data conversion.

Byte strings have the advantage that you can read/write them to files,
sockets, pipes, etc (as well as many third-party APIs) without needing any
data conversion.

0 new messages