MultiByteToWideChar and invalid characters

Chris Shearer Cooper

unread,

Jul 23, 2007, 1:41:20 PM7/23/07

to

Can anyone tell me what MultiByteToWideChar() does if you pass it an invalid
input character but do not set the MB_ERR_INVALID_CHARS flag?

By "invalid character" I have a specific test case, which is trying to
convert character 0xFA in code page 1255 (Hebrew) into Unicode. There is no
character at that position in that code page.

I'm trying to understand the documentation for MultiByteToWideChar(), it
says an invalid character is "a character that is not the default character
in the source string but translates to the default character when
MB_ERR_INVALID_CHARS is not set". How do I know what "the default
character" is in the source string, and I assume that "the default
character" could be a different value in the destination string?

What I see on my machine, is that MultiByteToWideChar() succeeds (as it
should, I didn't set the MB_ERR_INVALID_CHARS flag) and in my output string
it has placed the Unicode character 0xF894 which is in the "Private Use"
area of Unicode. Can I rely on MultiByteToWideChar() to convert invalid
8-bit characters to Unicode characters in the range E000-F8FF? Or is there
a Windows function to get the value of "the default character"?

Thanks,
CHris

Bob Eaton

unread,

Jul 23, 2007, 6:47:21 PM7/23/07

to

On my XP/SP2 system, the code page 1255 (Ansi--Hebrew) shows the following
information:

Max bytes/character: 1
Default legacy character: (?) (i.e. question mark = d63)
Default Unicode character: (?) (i.e. the same = u003f)

and it does have a Unicode equivalent for d250 (=0xFA) which is ת (u05ea).

But I think the answer to your question is if you were to pass it an invalid
input character (which I don't think 0xFA is), then it should return a
question mark character at that position.

The Windows function to get the default character information is:
GetCPInfoEx

Bob

"Chris Shearer Cooper" <chri...@sc3.net> wrote in message
news:13a9q28...@corp.supernews.com...

Michael S. Kaplan [MSFT]

unread,

Jul 24, 2007, 11:44:04 AM7/24/07

to

Actually, if you pass MB_ERR_INVALID_CHARS, the conversion will fail.

If you don't then the default character is inserted....

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Bob Eaton" <pete_de...@hotmail.com> wrote in message
news:ulVi4tXz...@TK2MSFTNGP06.phx.gbl...

> On my XP/SP2 system, the code page 1255 (Ansi--Hebrew) shows the following
> information:
>
> Max bytes/character: 1
> Default legacy character: (?) (i.e. question mark = d63)
> Default Unicode character: (?) (i.e. the same = u003f)
>

> and it does have a Unicode equivalent for d250 (=0xFA) which is ? (u05ea).

Skywing [MVP]

unread,

Jul 24, 2007, 11:50:24 AM7/24/07

to

<OffTopic>
Funny. I was just thinking "you know, this would be a great Michael Kaplan
blog question", and here you are following the newsgroups.
</OffTopic>

--
Ken Johnson (Skywing)
Windows SDK MVP
http://www.nynaeve.net
"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message
news:%23GLohlg...@TK2MSFTNGP02.phx.gbl...

Michael S. Kaplan [MSFT]

unread,

Jul 24, 2007, 1:12:51 PM7/24/07

to

Like I could resist THAT. :-)

Ok, what is behind the issue can be found here, in a blog post:

http://blogs.msdn.com/michkap/archive/2007/07/24/4031609.aspx

A classic off-by-one error on Chris's part, along with an under-documented
"feature" of code pages I will probably also blog about soon....

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Skywing [MVP]" <skywing_...@valhallalegends.com> wrote in message
news:Ob5tZpgz...@TK2MSFTNGP05.phx.gbl...

Chris Shearer Cooper

unread,

Jul 24, 2007, 4:41:05 PM7/24/07

to

I'm (mostly) honored - I've run into your blog several times recently, as I
look for answers to these sorts of questions, and now I'm mentioned by name
in there! I'm not sure I agree that 4 newsgroups is "a whole mess", though
I'm glad that at least one of them caught your attention.

You are of course right, I meant to type 0xFB instead of 0xFA. I had been
doing my testing using both characters (one valid, the other not) and typed
the wrong on into the post.

Is the 0xFB/0xFA confusion what you are calling the "classic off-by-one
error"?

So I'm thinking the documentation is just plain wrong ... it seems to imply
that invalid characters (like 0xFB) should map to the "default character"
when MB_INVALID_CHARS is not set, but in this case that's definitely not
what's happening.

Thanks,
Chris

"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message

news:O4LQ0Xhz...@TK2MSFTNGP05.phx.gbl...

Michael S. Kaplan [MSFT]

unread,

Jul 24, 2007, 10:09:25 PM7/24/07

to

I meant "whole mess of" just in the slang sense, though since it wasn't as
kernel mode issue or even a Hebrew language issue as opposed to a generic
code page one, one or more of those groups may be extraneous. :-)

Yes, the off-by-one error was the code point being wrong.

I do agree these EUDC mappings kind of suck, FWIW. I'll be blogging about
them soon. The doc situation is also an interesting one, I'll probably cover
those issues at the same time....

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Chris Shearer Cooper" <chri...@sc3.net> wrote in message

news:13acov3...@corp.supernews.com...

Michael S. Kaplan [MSFT]

unread,

Jul 25, 2007, 10:55:41 AM7/25/07

to

Ok, it's done. :-)

Explanation here:

http://blogs.msdn.com/michkap/archive/2007/07/25/4037646.aspx

Both the unexpected behavior and the undocumented mechanics explained....

I don't envy the doc writers on this one -- this is not going to be easy to
fix!

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message

news:%237Hw%23CmzH...@TK2MSFTNGP02.phx.gbl...

Chris Shearer Cooper

unread,

Jul 25, 2007, 11:30:09 AM7/25/07

to

Excellent - thanks!

At the very least, if they can document (and ensure) that if
MB_ERR_INVALID_CHARS is not set, invalid 8-bit characters get mapped into
the Unicode PUA, that would be great. I would hate to write code that
assumed that, and then find out that in Vista (or whatever), invalid 8-bit
characters actually do get mapped to the current "Unicode default character"
...

"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message

news:ub9XLvsz...@TK2MSFTNGP03.phx.gbl...

Michael S. Kaplan [MSFT]

unread,

Jul 25, 2007, 11:43:49 AM7/25/07

to

Well, the mappings in code pages cannot change -- and all of the mappings
are listed in that blog post.

The behavior with and without flags can't change, or that would alter the
behavior of the code page. Which can't happen.

Note that characters in the "C1 control" range are not mapped to the PUA and
do not fail with an error when the flag is passed. So the behavior is
designed to be mildly incomplete, unfortunately. :-(

The Unicode default character is clearly not for this purpose, it is
something for the code to shove in if it has problematic data (mainly CJK
invalid sequences, I believe). It is not as useful as the docs would tend to
imply, because of this....

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Chris Shearer Cooper" <chri...@sc3.net> wrote in message

news:13aer42...@corp.supernews.com...

Chris Shearer Cooper

unread,

Jul 25, 2007, 11:52:42 AM7/25/07

to

Why can't the mappings in code pages change? When I've seen the
documentation for the various code pages, characters like 0xFB in the Hebrew
code page (1255) are simply left blank, so there is no documentation there
that would limit what code page 1255 might do in the future.

Why can't Windows 2008 (or whatever) come with a slightly different version
of code page 1255 - one where position 0xFB maps to something totally
different? I guess I just don't understand the underlying code page
technology ...

Thanks,
Chris

"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message

news:updhEKtz...@TK2MSFTNGP04.phx.gbl...

Michael S. Kaplan [MSFT]

unread,

Jul 25, 2007, 12:13:55 PM7/25/07

to

Stability guarantees are not only about what is documented; they are also
about behavior. Code pages cannot change their behavior, EVER.

This particular behavior has stood for over a decade and a half without
anyone ever complaining (since fundamentally it makes no sense to use code
pages for data they are not meant to contain).

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Chris Shearer Cooper" <chri...@sc3.net> wrote in message

news:13aesea...@corp.supernews.com...

Bob Eaton

unread,

Jul 25, 2007, 1:47:34 PM7/25/07

to

Not that I would like to question Michael... but I don't think this is true.

Isn't it true that code page 1252 was modified to add the Euro symbol at
some point post-Win95?

I know that I had to do a number of extra steps when converting several
documents given to me by someone using Win95 (or maybe 98) because of this
very problem. The definition of cp1252 was different on his machine vs.
mine.

(or is that why you said "over a decade" :-)

Bob

"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message

news:e8ZQ5atz...@TK2MSFTNGP05.phx.gbl...

Michael S. Kaplan [MSFT]

unread,

Jul 25, 2007, 10:41:12 PM7/25/07

to

Actually, it was the Euro insert that convinced people how bad this ends up
when such changes happen:

http://blogs.msdn.com/michkap/archive/2005/02/06/368081.aspx

:-)

--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Bob Eaton" <pete_de...@hotmail.com> wrote in message

news:OUvymPuz...@TK2MSFTNGP05.phx.gbl...