By "invalid character" I have a specific test case, which is trying to
convert character 0xFA in code page 1255 (Hebrew) into Unicode. There is no
character at that position in that code page.
I'm trying to understand the documentation for MultiByteToWideChar(), it
says an invalid character is "a character that is not the default character
in the source string but translates to the default character when
MB_ERR_INVALID_CHARS is not set". How do I know what "the default
character" is in the source string, and I assume that "the default
character" could be a different value in the destination string?
What I see on my machine, is that MultiByteToWideChar() succeeds (as it
should, I didn't set the MB_ERR_INVALID_CHARS flag) and in my output string
it has placed the Unicode character 0xF894 which is in the "Private Use"
area of Unicode. Can I rely on MultiByteToWideChar() to convert invalid
8-bit characters to Unicode characters in the range E000-F8FF? Or is there
a Windows function to get the value of "the default character"?
Thanks,
CHris
Max bytes/character: 1
Default legacy character: (?) (i.e. question mark = d63)
Default Unicode character: (?) (i.e. the same = u003f)
and it does have a Unicode equivalent for d250 (=0xFA) which is ת (u05ea).
But I think the answer to your question is if you were to pass it an invalid
input character (which I don't think 0xFA is), then it should return a
question mark character at that position.
The Windows function to get the default character information is:
GetCPInfoEx
Bob
"Chris Shearer Cooper" <chri...@sc3.net> wrote in message
news:13a9q28...@corp.supernews.com...
If you don't then the default character is inserted....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Bob Eaton" <pete_de...@hotmail.com> wrote in message
news:ulVi4tXz...@TK2MSFTNGP06.phx.gbl...
> On my XP/SP2 system, the code page 1255 (Ansi--Hebrew) shows the following
> information:
>
> Max bytes/character: 1
> Default legacy character: (?) (i.e. question mark = d63)
> Default Unicode character: (?) (i.e. the same = u003f)
>
> and it does have a Unicode equivalent for d250 (=0xFA) which is ? (u05ea).
--
Ken Johnson (Skywing)
Windows SDK MVP
http://www.nynaeve.net
"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message
news:%23GLohlg...@TK2MSFTNGP02.phx.gbl...
Ok, what is behind the issue can be found here, in a blog post:
http://blogs.msdn.com/michkap/archive/2007/07/24/4031609.aspx
A classic off-by-one error on Chris's part, along with an under-documented
"feature" of code pages I will probably also blog about soon....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Skywing [MVP]" <skywing_...@valhallalegends.com> wrote in message
news:Ob5tZpgz...@TK2MSFTNGP05.phx.gbl...
You are of course right, I meant to type 0xFB instead of 0xFA. I had been
doing my testing using both characters (one valid, the other not) and typed
the wrong on into the post.
Is the 0xFB/0xFA confusion what you are calling the "classic off-by-one
error"?
So I'm thinking the documentation is just plain wrong ... it seems to imply
that invalid characters (like 0xFB) should map to the "default character"
when MB_INVALID_CHARS is not set, but in this case that's definitely not
what's happening.
Thanks,
Chris
"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message
news:O4LQ0Xhz...@TK2MSFTNGP05.phx.gbl...
Yes, the off-by-one error was the code point being wrong.
I do agree these EUDC mappings kind of suck, FWIW. I'll be blogging about
them soon. The doc situation is also an interesting one, I'll probably cover
those issues at the same time....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Chris Shearer Cooper" <chri...@sc3.net> wrote in message
news:13acov3...@corp.supernews.com...
Explanation here:
http://blogs.msdn.com/michkap/archive/2007/07/25/4037646.aspx
Both the unexpected behavior and the undocumented mechanics explained....
I don't envy the doc writers on this one -- this is not going to be easy to
fix!
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message
news:%237Hw%23CmzH...@TK2MSFTNGP02.phx.gbl...
At the very least, if they can document (and ensure) that if
MB_ERR_INVALID_CHARS is not set, invalid 8-bit characters get mapped into
the Unicode PUA, that would be great. I would hate to write code that
assumed that, and then find out that in Vista (or whatever), invalid 8-bit
characters actually do get mapped to the current "Unicode default character"
...
"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message
news:ub9XLvsz...@TK2MSFTNGP03.phx.gbl...
The behavior with and without flags can't change, or that would alter the
behavior of the code page. Which can't happen.
Note that characters in the "C1 control" range are not mapped to the PUA and
do not fail with an error when the flag is passed. So the behavior is
designed to be mildly incomplete, unfortunately. :-(
The Unicode default character is clearly not for this purpose, it is
something for the code to shove in if it has problematic data (mainly CJK
invalid sequences, I believe). It is not as useful as the docs would tend to
imply, because of this....
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Chris Shearer Cooper" <chri...@sc3.net> wrote in message
news:13aer42...@corp.supernews.com...
Why can't Windows 2008 (or whatever) come with a slightly different version
of code page 1255 - one where position 0xFB maps to something totally
different? I guess I just don't understand the underlying code page
technology ...
Thanks,
Chris
"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message
news:updhEKtz...@TK2MSFTNGP04.phx.gbl...
This particular behavior has stood for over a decade and a half without
anyone ever complaining (since fundamentally it makes no sense to use code
pages for data they are not meant to contain).
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Chris Shearer Cooper" <chri...@sc3.net> wrote in message
news:13aesea...@corp.supernews.com...
Isn't it true that code page 1252 was modified to add the Euro symbol at
some point post-Win95?
I know that I had to do a number of extra steps when converting several
documents given to me by someone using Win95 (or maybe 98) because of this
very problem. The definition of cp1252 was different on his machine vs.
mine.
(or is that why you said "over a decade" :-)
Bob
"Michael S. Kaplan [MSFT]" <mic...@online.microsoft.com> wrote in message
news:e8ZQ5atz...@TK2MSFTNGP05.phx.gbl...
http://blogs.msdn.com/michkap/archive/2005/02/06/368081.aspx
:-)
--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Bob Eaton" <pete_de...@hotmail.com> wrote in message
news:OUvymPuz...@TK2MSFTNGP05.phx.gbl...