"Marco Hung" <marco.h...@gmail.com> wrote in message news:ucAsGC37...@TK2MSFTNGP06.phx.gbl...
If you start now with a new project, there is no reason no to go Unicode.
The only reason for MBCS is to support Win 9x (with a new project?),
to learn about things that will be obsolete in 2-3 years,
or for being a masochist :-)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
> I've created a MFC project in MBCS. I need to show some set special
> characters
> ( ASCII code > 128) in a CStatic controls. It shows correctly in all
> English locale window.
> However all those special character becames "?" in non-English window.
> How to solve this problem?
Hi,
This is a classical example of the importance of using *Unicode* to store
characters and strings.
IMHO, you should forget about ANSI (or MBCS), and consider *Unicode* as the
type for characters and strings (like modern programming languages like
Java, Python, C#, etc. do).
Basically, Unicode provides *unique* number for every character, no matter
what the programming language, or the operating system, etc.
I don't know what character you want to display, but e.g. suppose that you
want to display a lower-case Greek "omega" (kind of "w").
In Unicode UTF-16 encoding, the "unique number" associated to this character
is 0x03C9 (hex, note that its 16 bits, not 8 bits like for ANSI).
The C++ code to display that character in a message-box is like so:
// Build a string of Unicode UTF-16 characters:
// "omega" (0x03C9), end-of-string (0x0000)
wchar_t omega[] = { 0x03C9, 0x0000 };
// Display Unicode text (note the W and the L)
MessageBoxW( NULL, omega, L"Unicode Test", MB_OK );
The L before "Unicode Test" string literal identifies this string as Unicode
and not ANSI.
The W after MessageBox is a Win32 API naming convention to identify the
Unicode (and not the ANSI) version of MessageBox API.
If you compile in Unicode mode, you can avoid the W and just write
MessageBox; the C/C++ preprocessor will expand MessageBox as MessageBoxW.
You might find the Unicode FAQ http://unicode.org/faq/ and Mihai's blog
http://www.mihai-nita.net/ to be both interesting.
Giovanni
TCHAR stringToShow[] = { 129, 130, 131, 132, 133, 134, 135, 136, 137 };
or
TCHAR stringToShow[] = _T("\x81\x82\x83\x84\x85\x86\x87\x88\x89");
You should NOT be using GetDlgItem; this should be considered as obsolete except in very
rare and exotic situations, which you do not have an instance of. Create member
variables.
Part of the problem is that you are using MBCS, which means that character codes >=128 are
not actual characters, but part of a multibyte encoding, and therefore they are going to
be misinterpreted in all kinds of fascinating ways.
As already pointed out, forget that MBCS exists. It is dead technology. Use Unicode.
There is no real choice these days.
joe
Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
For the most part I agree with what you say here, the only exception
being... if you are using a lot of strings and doing a lot of string
handling and don't need anything except English then using MBCS may be a bit
faster to execute, better in memory storage, and quicker to read and write
files since Unicode doubles all character sizes whether needed or not. I
wish Windows/MFC/all those good things had better handling for other methods
like UTF-8 that would give similar results as MBCS.
That say, the differences in most cases are not all that significant and
I've gone to using Unicode all the time.
Tom
"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns99A2712...@207.46.248.16...
But much of the inherent speed advantage of MBCS is negated by the native
API in Win2K/XP/Vista being Unicode, so having a Unicode app allows us to
call these API's directly and not go through thunks. But I've not done
speed tests.
-- David
Tom
"David Ching" <d...@remove-this.dcsoft.com> wrote in message
news:8dGDi.1727$Sd4...@nlpi061.nbdc.sbc.com...
Hi Tom,
I completely agree with this analysis by David, at least on the (real)
operating systems like Win2K/XP/Vista, that are Unicode-native.
(Win9x "toys" are a different thing, maybe there ANSI is faster than
Unicode, because they are ANSI/MBCS-native, but the Win9x family is not
interesting for me.)
G
"Tom Serface" <tom.n...@camaswood.com> wrote in message news:uUzAeJ37...@TK2MSFTNGP04.phx.gbl...
"Tom Serface" <tom.n...@camaswood.com> wrote in message news:uUzAeJ37...@TK2MSFTNGP04.phx.gbl...
I understand that Unicode is the best way of string operations for morden
application. However, my appliaction need to communicate with a "Old" system
thr some API calls, which will always return string in "single character"
format. I think MBCS may be the only choice for it.
I've tried to convert the string to unicode using function like
"MultiByteToWideChar" and "SetWindowTextW", but the same output in display.
Is there any way to make the conversion correctly in all language windows?
Marco
"Joseph M. Newcomer" <newc...@flounder.com> wrote in message
news:amatd3d0d2eu0chtp...@4ax.com...
Marco:
Please don't send HTML mail to the newsgroups. Text only.
--
David Wilkinson
Visual C++ MVP
Marco:
If you know the code page of the 8-bit strings, then
MultiByteToWideChar() should work. If you don't you are in trouble.
You can't just say "MultiByteToWideChar" since there are critical parameters that you have
omitted telling us about, such as what code page you specified, and whether or not you
have true MBCS (e.g., UTF-7, UTF-8) or just 8-bit characters. Certainly the example you
gave of 128, 129, 130, ...137 is not UTF-8, and in fact these code points are not defined
in most character sets (although 128 is the official Euro symbol in a lot of fonts), so
you have supplied rather incomplete information on what you are doing, trying to do, and
how you are doing it. MBCS is *not* a substitute for ANSI, since there are no APIs that
actually use it. So you need to say a lot more about what is going on here before the
question even begins to make sense.
joe
In an experiment I ran, Unicode is on the average slightly faster than ANSI, for something
as simple as a repeated SetWindowText, although the variance of the samples is high.
joe
My application will call an extranl dll, which will return a string as
result ( should be a list of ASCII code from 0~255 ). My application will
then display the result in an Edit box.
The result only consists of characters from A~Z plus 2 special characters
( ? (0x87) & ¤ (0xA4) ). The edit box display correct if I run my
application in English Windws. However in non-English system, all these 2
characters will display as "?"
Here's my exact coding in my application.
OnStart(CString strCommand)
{
CMyLiberaryObject MyLibObj;
char *strResult = MyLibObj.ProcessCommand( (LPCTSTR) strCommand ); //
return type is char*
BSTR bstr = NULL;
int nConvertedLen = MultiByteToWideChar(1252, MB_COMPOSITE, strResult
, -1, NULL, NULL);
bstr = ::SysAllocStringLen(NULL, nConvertedLen);
if (bstr != NULL)
MultiByteToWideChar(1252, MB_COMPOSITE, (LPCTSTR)strResult , -1,
bstr, nConvertedLen);
SetWindowTextW(GetDlgItem(IDC_ED_CMDRESULT)->GetSafeHwnd(), bstr);
SysFreeString(bstr);
MyLibObj.Complete();
}
Rgds,
Marco
"Joseph M. Newcomer" <newc...@flounder.com> wrote in message
news:novud35mmqurjajnn...@4ax.com...
Then maybe the best thing is to have the whole application Unicode, and
convert back and forth when you comunicate with the legacy part.
Marco:
If your characters are all ISO-8859-1 characters (as would seem to be
the case) then, as I said before, you should just be able to copy (not
convert) them into an array of wchar_t, and use SetWindowTextW. This is
because the first 256 code points of Unicode (and the UTF-16 encoding of
it) are the same as ISO-8859-1. Or you could use MultiByteToWideChar()
with the code page always set to English. You do not want to use
MultiByteToWideChar() with the local code page.
Actually, I am confused by your code. The only purpose to using TCHAR,
LPCTSTR, etc, is to have an app that will compile as both ANSI and
Unicode. This surely cannot be the case for you, as this would mean that
your legacy CLibrarayObject::ProcessCommand() would have to accept a
const whar_t* and return a char*.
I think you would be best to write your whole app in Unicode and do what
you have to to convert to and from 8-bit strings only when using your
legacy library.
Yes, Joe! The key point is the converstion ANSI -> Unicode made internally
by Windows, as you pointed.
Giovanni
>Sorry for my misleading question. Let me explain more in my problem.
>
>My application will call an extranl dll, which will return a string as
>result ( should be a list of ASCII code from 0~255 ). My application will
>then display the result in an Edit box.
>
>The result only consists of characters from A~Z plus 2 special characters
>( ? (0x87) & ¤ (0xA4) ). The edit box display correct if I run my
>application in English Windws. However in non-English system, all these 2
>characters will display as "?"
>
>Here's my exact coding in my application.
>
>OnStart(CString strCommand)
>{
> CMyLiberaryObject MyLibObj;
> char *strResult = MyLibObj.ProcessCommand( (LPCTSTR) strCommand ); //
>return type is char*
****
There's a problem here. What is the parameter of the function ProcessCommand? Is it
really LPCTSTR (8-bit or Unicode depending on compilation mode)? Or is it 8-bit? The
LPCTSTR cast would be dangerous in a Unicode build if the function takes char *.
Given it returns a char *, who is freeing it? This is inherently dangerous that it would
return a pointer to a fixed buffer, so it should really be returning a CStringA, or at the
very least a char * on the heap which needs to be freed.
Code that returns a pointer to a fixed buffer is not thread-safe, and should be considered
*dangerously obsolete* at this point (think Unicode, think multithreading, ALWAYS)
***
>
> BSTR bstr = NULL;
****
Why are you allocating a BSTR here? Why not an LPWSTR? BSTRs have additional overheads,
such as reference counting, and since you are not using any of those, an LPWSTR would be
fine.
****
> int nConvertedLen = MultiByteToWideChar(1252, MB_COMPOSITE, strResult
>, -1, NULL, NULL);
****
This tells you to convert the string using code page 1252, ISO-8859-1 (Latin-1). Given
that you have said that you only use A-Z and two special characters, MB_COMPOSITE has no
meaning here, and should be omitted.
****
> bstr = ::SysAllocStringLen(NULL, nConvertedLen);
****
LPWSTR bstr = new WCHAR[nConvertedLen];
there was no need to declare an initialize a pointer before it is used, and there is
certainly no need for a BSTR, so get rid of it
****
> if (bstr != NULL)
> MultiByteToWideChar(1252, MB_COMPOSITE, (LPCTSTR)strResult , -1,
>bstr, nConvertedLen);
****
Get rid of the MB_COMPOSITE
****
> SetWindowTextW(GetDlgItem(IDC_ED_CMDRESULT)->GetSafeHwnd(), bstr);
****
Create a control variable. Generally, assume that if you have written GetDlgItem, except
in EXTREMELY RARE CIRCUMSTANCES (of which this is not one) you have made a fundamental
design error. Because you are trying to write a Unicode string in an ANSI app, you would
need to write
::SetWindowTextW(c_Result.m_hWnd, bstr);
although it would make much more sense to compile this as a Unicode app (beware the
parameter issue already mentioned!) and just write
c_Result.SetWindowText(bstr);
****
> SysFreeString(bstr);
*****
delete [] bstr;
why use something as complicated as a BSTR for such a trivial purpose?
Now you've got some other issues here. For example, what font is loaded into the edit
control? Is the result of the MultiByteToWideChar correct, or does it already have the
erroneous '?' in it? There are too many variables here and you have not isolated the
problem adequately.
****
Marco:
I see you are already converting using code page 1252 (I didn't notice
that before). This should work if you do it correctly, but I'm not sure
you are (see Joe's reply).
Tom
"Joseph M. Newcomer" <newc...@flounder.com> wrote in message
news:q90vd3picmebnvab9...@4ax.com...
You can get this to work so long as you know the code page you need for the
language or you are running only on the machine where that language is
installed and the correct region is set. We tried this for years and could
never get it to work right since our software was installed in so many
configurations so we finally went to Unicode and we just convert the
external strings and files to Unicode to use them rather than trying to go
the other way. So far this approach has worked well. So to answer your
question, yes you can theoretically get it to work, but the number of
parameters involved is often difficult to control.
Tom
"Marco Hung" <marco.h...@gmail.com> wrote in message
news:OQmEQVC8...@TK2MSFTNGP03.phx.gbl...
Tom
"Giovanni Dicanio" <giovanni...@invalid.it> wrote in message
news:eJ81NXB8...@TK2MSFTNGP05.phx.gbl...
Huh?
> (a bad name choice).
Yes "ANSI" is a bad name choice, but the meaning is the same as MBCS.
> MBCS uses sequences of 8-bit characters to represent characters,
Yes.
> and as far as I know, there are no API calls that take MBCS strings.
The ones that end in "A" take MBCS strings. Most of them work by converting
to Unicode before calling NT internal routines and converting back to MBCS
before returning to the caller. Some such as WTSQuerySessionInformationA
don't work. (ANSI applications have to call WTSQuerySessionInformationW
explicitly, including the W, and do the conversions themselves.)
> They take either ANSI or Unicode.
Yes. The ones that end in A take "ANSI" i.e. MBCS, and the ones that end in
W take Unicode i.e. UTF-16.
"Joseph M. Newcomer" <newc...@flounder.com> wrote in message
news:novud35mmqurjajnn...@4ax.com...
ASCII codes are 0~127.
If you're having code page problems it's because you're dealing with ANSI
code pages other than ASCII. Some code pages (mostly European) are 0~255.
Some (Asian) are basically 0~65535, but of course some portions of that
range can't be used, so they use 0~127 and part of 32768~65535.
If a value isn't a valid character in your code page (for example number 529
in code page 1252 or number 129 in code page 932) then of course you get
garbage.
"Marco Hung" <marco.h...@gmail.com> wrote in message
news:%23u9Vf5E...@TK2MSFTNGP02.phx.gbl...
Suppose I have 10MB 'characters'. In ANSI, these would take 10MB of RAM; in Unicode they
would take 20MB of RAM. On a typical end-user machine of 1GB of memory, this means that I
would occupy 0.5% of physical RAM, or 0.25% of my virtual address space, with 8-bit
strings, and in Unicode, I'd use a whopping 1% of my physical address space and 0.5% of my
virtual address space. I somehow cannot get excited about this problem, given all the
additional problems of complex code, possibility of error, cost of development and
debugging, etc. that it would cost to use 8-bit characters.
joe
On Fri, 7 Sep 2007 00:04:10 +0200, "Giovanni Dicanio" <giovanni...@invalid.it> wrote:
>
>"Tom Serface" <tom.n...@camaswood.com> ha scritto nel messaggio
>news:Oxv$i6J8HH...@TK2MSFTNGP03.phx.gbl...
>
>> but it's difficult to quantify the difference and I suspect it is
>> negligible so Unicode seems a better way to go in my opinion.
>
>Hi Tom,
>
>I agree with you.
>
>And maybe if memory space saving is the main target, UTF-8 could be used as
>the encoding for Unicode, instead of UTF-16.
>But maybe for historical reasons, it seems that internal Windows format for
>Unicode is UTF-16 :(
>On the other side, IIRC Mac OS X and Linux tend to use UTF-8, but I may be
>in mistake...
>
>
>> If you really need to minimize memory space (like you're trying to run an
>> MFC application on your watch or something) then perhaps, but ...
>
>IIRC, Windows CE (which should be suited to embedded platforms and platforms
>with memory limits, not like the "huge" 1-2 GB of RAMs in current desktop
>PCs) uses Unicode (UTF-16) and not ANSI :)
>
>Giovanni
Tom
"Joseph M. Newcomer" <newc...@flounder.com> wrote in message
news:d6n1e3938ejv9bb7j...@4ax.com...
Almost.
ANSI can be SBCS or MBCS. But it is one of them.
The system has one ANSI code page and only one at a certain time
(the system code page), and changing it requires a reboot.
932 (Shift-JIS), 950 (Big5), etc, are all MBCS.
Any one of them can be the ANSI code page in a certain session.
But not all of them.
Then you have other code pages, like EUC-JP or GBK, that are DBCS,
but cannot be ANSI (they can never be used as system locale).
But this is just lingo.
For a programmer using Dev Studio the lingo means something else.
If you go in Dev Studio you only have 3 options for Character set
1. Not set (nothing defined)
2. Multi-Byte Character Set (_MBCS defined)
3. Unicode Character Set (_UNICODE and UNICODE defined)
In most cases there is no difference between 1. and 2.
If you use MessageBox for 1. and 2. will become MessageBoxA,
and for 3. it will become MessageBoxW.
But look at things like _tcsclen.
In case 1. will become strlen, in case 2. it becomes _mbslen,
and in case 3. it becomes wcslen.
This is why sometimes you have to be very carefull what you use
when you convert to generic text handling. Will you replace strlen
with _tcslen, or with _tcsclen?
(in most cases the answer is _tcslen, but there are exceptions)
As a general rule: UTF-16 for processing, UTF-8 for transfer/storage
(and like any general rule it has exceptions, but you have to know when
to do that)
Mac OS X string API uses UTf-16. Same for Apache Xerces
(XML parsing library), ICU (IBM's International Components for Unicode),
Qt, Java.
Here is a good read: http://unicode.org/notes/tn12/tn12-1.html
> As a general rule: UTF-16 for processing, UTF-8 for transfer/storage
> (and like any general rule it has exceptions, but you have to know when
> to do that)
Yes.
> Mac OS X string API uses UTf-16. Same for Apache Xerces
> (XML parsing library), ICU (IBM's International Components for Unicode),
> Qt, Java.
> Here is a good read: http://unicode.org/notes/tn12/tn12-1.html
Thank you for having corrected my wrong information about Mac OS X.
I'm going to read the web page you linked.
Giovanni
> I somehow cannot get excited about this problem, given all the
> additional problems of complex code, possibility of error, cost of
> development and
> debugging, etc. that it would cost to use 8-bit characters.
I believe that both you and me (and others, of course) use Unicode for
strings.
My point was about UTF-16 vs UTF-8 (both *Unicode*, not ANSI 8 bits).
Giovanni
Hi Tom,
Yes, ANSI is kind of computer archaeology in these days :)
G.
Of course any program I've ever done has either MBCS or UNICODE defined so
perhaps that where I'm getting it.
Tom
"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns99A4420...@207.46.248.16...
Tom
"Giovanni Dicanio" <giovanni...@invalid.it> wrote in message
news:u3sgcLT8...@TK2MSFTNGP05.phx.gbl...
Hi Tom,
VC6 has no problem with Unicode...
http://www.mihai-nita.net/article.php?artID=20060723a
...Am I missing something?
G
Tom
"Giovanni Dicanio" <giovanni...@invalid.it> wrote in message
news:uPRnQIW8...@TK2MSFTNGP06.phx.gbl...
Is there a way to override the system regional code page setting to force a
VB 6 application to use "English (United States)"?
On Fri, 7 Sep 2007 09:46:06 -0700, PackAddict <PackA...@discussions.microsoft.com>
wrote:
This is still true for VS 2003.
VS 2005 was the first one to switch (and still bugy at that).
I would agree that SBCS is just be a subset of DBCS,
and DBCS a subset of MBCS.
ANSI is the MBCS the that is currenty system code page :-)
The MS lingo in this area is a mess, so one should be pretty
flexible with the definitions here :-)
For a programmer the only important part is: what are the implications
of defining _MBCS / UNICODE / _UNICODE / nothing?
So, in VC6 or VS2003 Unicode-built app we can't have e.g. a string-table
resource with Japanese characters in Unicode?
Is there any workaround?
Should we use external custom file encoded e.g. in UTF-8 and read it and
convert it dynamically to UTF-16?
Thanks in advance,
Giovanni
I think people have more trouble updating from VC6 to VS.NET than they do
updating to any other version since then. I think it would make sense for
Microsoft to make a really easy upgrade path from VC6/VS6 to VS 2008 to
encourage people to move up.
Tom
"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns99A4E42A...@207.46.248.16...
Tom
"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns99A4E54C...@207.46.248.16...
In 2003 you can have a Unicode RC file, but it is initially created in MBCS
and you just have to open the .RC file in Notepad then save it back as
Unicode. The IDE will use it after that. I think 2005 creates them as
Unicode in the first place.
In VC6 and 2003 (using ANSI) it relies on the codepage and fonts to do the
correct characters so you can have Japanese, but it wouldn't be Unicode. I
think there are some characters that MBCS can't handle, but I don't know
what they are off hand.
Tom
"Giovanni Dicanio" <giovanni...@invalid.it> wrote in message
news:eRG6Umf8...@TK2MSFTNGP03.phx.gbl...
IDS_MU L"Gray Cats say \x03BC!"
will produce the right result. The problem is that I'm no longer sure how to produce the
L" form of the string short of hand-editing, and if you just type in \x it converts it to
\\x. But it works, and the correct result is displayed providing the font you have has
the Greek letter 'mu' in it.
joe
The compiled resource files are always Unicode.
The source resource files (.rc) can be Unicode, but you cannot edit them with
the resource editor in VS 6/2002/2003
Ok, in VS 2005 editor, if you know about some of the bugs:
- The RichEdit controls in dialogs are always ansi
(http://www.mihai-nita.net/article.php?artID=20050709b)
- The .rc is not Unicode unless you ask for it
(http://www.mihai-nita.net/article.php?artID=20051030a)
- The DLGINIT used for combo-boxes in MFC is always ansi
(I have reported it for Orcas, marked as fixed)
And only UTF-16LE is supported (no UTF-8!)
> Is there any workaround?
> Should we use external custom file encoded e.g. in UTF-8 and read it and
> convert it dynamically to UTF-16?
Set the system locale to Japanese and reboot.
(http://www.mihai-nita.net/article.php?artID=20050611a)
It is the best option, because you need WYSIWYG for proper resizing.
This might also come in handy:
http://www.mihai-nita.net/article.php?artID=20070503a
Giovanni
----
"Mihai N." <nmihai_y...@yahoo.com> ha scritto nel messaggio
news:Xns99A5C87C...@207.46.248.16...
It's the other way around. In VC6 or VS2003 Unicode-built apps, we can't
have e.g. a string-table resource with any NON-JAPANESE characters in
Unicode.
> Is there any workaround?
Use Notepad to edit the RC file. (Facts are funnier than jokes, eh?)
You'd be surprised how many times I do this sort of thing. The problem is
with pre-2005 versions if you edit the wrong resource by mistake the RC
editor would trash all of your other resources (yielding ???) unless you
were in the correct region (locale) while editing. Fortunately, this
doesn't seem to be a problem with Unicode RC files. Still, I use Notepad to
make some changes since the search and replace works so much nicer :o)
Tom
"Norman Diamond" <ndia...@community.nospam> wrote in message
news:OfFqHQ08...@TK2MSFTNGP03.phx.gbl...
A better description might be that there are Asc and Chr function calls
throughout the code, not duplication of algorithms throughout the code.
Those Asc and Chr function calls cause problems when a Chinese code page is
set as the default language. Each time that we hit a hex value with no
corresponding Ascii value in the code page, we get the "?' returned.
Needless to say, that causes some significant discrepences when
encrypting/decrypting a string of data.
I figured that I was going to have to move to byte arrays, but thought I'd
take a stab in the dark at a solution that would allow me to just overrid the
code page.
> You'd be surprised how many times I do this sort of thing.
Well, also the graphics/image-editing capabilities of Visual Studio are not
great, so also to edit images it is good to go to external "ad hoc"
programs...
G
On Mon, 10 Sep 2007 08:52:14 -0700, PackAddict <PackA...@discussions.microsoft.com>