Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

console - umlaut characters

59 views
Skip to first unread message

Gernot Frisch

unread,
Mar 17, 2009, 1:02:36 PM3/17/09
to

Hi,


nice problem here.

The user calls this program:
-------------------
int main()
{
char fn[1024];
gets(fn);
FILE* pF = fopen(fn,"wb"); fprintf(pF, "xx"); fclose(pF);
return 0;
}
--------------------

and types:
äöü.txt

In the windows explorer I see a file: " ".txt (or something).

The problem is the fact that the console uses IBM charset and my source code
ANSI. Fine.
fopen("äöü.txt", "wb"); creates a file "äöü.txt", thus the source editor is
correct. It's the console that's wrong.

Funnily, the CMD.exe:
ECHO "xx" > äöü.txt

will create a file "äöü.txt", which proves that the windows command line
tools somehow _are_ aware of this and can handle it properly.

Question: How can I do that?

Regards,
-Gernot


Gernot Frisch

unread,
Mar 17, 2009, 1:06:28 PM3/17/09
to

> Funnily, the CMD.exe:
> ECHO "xx" > äöü.txt
>
> will create a file "äöü.txt

Wrong - same false result!

Alf P. Steinbach

unread,
Mar 17, 2009, 1:12:45 PM3/17/09
to
* Gernot Frisch:

Well, there are several levels of solution.

First off, for your own convenience when using narrow character based command
line tools, you should set the command interpreter window to codepage 1252,

chcp 1252

That should make the example above work nicely.

Second, to avoid doing that all the time you should change the default OEM
codepage to 1252. That can be done via an undocumented registry value. See <url:
http://alfps.izfree.com/tutorials/w32cpptut/html/w32cpptut_01_02_06.html>.

Third, Windows ANSI narrow character set is better than what you had, but you
really want to access command line arguments as Unicode, for a much wider range
of characters. To do that you can either use the Windows API directly,
GetCommandLineW and CommandLineToArgsW (if memory serves me right, check), or if
your compiler supports it you can use wMain instead of main. The wMain thing is
a Microsoft language extension. Hence, it probably only works with Visual C++,
so I'd choose to use the Windows API -- it also gives better control.


Cheers & hth.,

- Alf

--
Due to hosting requirements I need visits to <url: http://alfps.izfree.com/>.
No ads, and there is some C++ stuff! :-) Just going there is good. Linking
to it is even better! Thanks in advance!

Codeplug

unread,
Mar 17, 2009, 3:11:32 PM3/17/09
to
Alf P. Steinbach wrote:
> * Gernot Frisch:
>>
>> The user calls this program:
>> -------------------
>> int main()
>> {
>> char fn[1024];
>> gets(fn);
>> FILE* pF = fopen(fn,"wb"); fprintf(pF, "xx"); fclose(pF);
>> return 0;
>> }
>> --------------------
>>
>> and types:
>> ���.txt

>>
>> In the windows explorer I see a file: " ".txt (or something).
>>
>> The problem is the fact that the console uses IBM charset and my
>> source code ANSI. Fine.
>> fopen("���.txt", "wb"); creates a file "���.txt", thus the source
>> editor is correct. It's the console that's wrong.
>>
>> Funnily, the CMD.exe:
>> ECHO "xx" > ���.txt
>>
>> will create a file "���.txt", which proves that the windows command
>> line tools somehow _are_ aware of this and can handle it properly.
>>
>> Question: How can I do that?
>
> Well, there are several levels of solution.
>
> First off, for your own convenience when using narrow character based
> command line tools, you should set the command interpreter window to
> codepage 1252,
>
> chcp 1252
>
> That should make the example above work nicely.

That depends on the system's Ansi codepage (ACP).

> Second, to avoid doing that all the time you should change the default
> OEM codepage to 1252. That can be done via an undocumented registry
> value. See <url:
> http://alfps.izfree.com/tutorials/w32cpptut/html/w32cpptut_01_02_06.html>.

This might be ok for *your* machine, but not the best solution for the
general case.

What you have to keep in mind is that there's the console input
codepage, and the system ACP - which may not be the same. The Ansi
versions of the Win32 API uses the ACP, and fopen eventually calls
CreateFileA(). So if the console input CP is different from the ACP,
then the input needs to be converted to the ACP:

#include <stdio.h>
#include <locale.h>
#include <windows.h>

int main()
{
char fn[1024];

const char *p;
FILE *pF;

/* not required for this example, but for consistency... */
setlocale(LC_ALL, "");

printf("incp = %u\n"
"outcp = %u\n"
"acp = %u\n",
GetConsoleCP(),
GetConsoleOutputCP(),
GetACP());

printf("filename: ");
gets(fn);

if (GetConsoleCP() != GetACP())
{
wchar_t wtmp[sizeof(fn)];
int srclen = strlen(fn);

/* "normalize" the input into Unicode */
int dstlen = MultiByteToWideChar(GetConsoleCP(), 0,
fn, srclen,
wtmp, sizeof(fn));
/* covert to ACP for fopen(), which
eventually calls CreateFileA() */
if (dstlen)
WideCharToMultiByte(GetACP(), 0,
wtmp, dstlen,
fn, sizeof(fn),
0, 0);
}

for (p = fn; *p; ++p)
printf("0x%X,", (unsigned int)(unsigned char)*p);
printf("\n");

pF = fopen(fn, "wb");

if (pF)
{
fprintf(pF, "xx");
fclose(pF);
}
else
printf("fopen failed\n");

return 0;
}

> Third, Windows ANSI narrow character set is better than what you had,
> but you really want to access command line arguments as Unicode, for a
> much wider range of characters. To do that you can either use the
> Windows API directly, GetCommandLineW and CommandLineToArgsW (if memory
> serves me right, check), or if your compiler supports it you can use
> wMain instead of main. The wMain thing is a Microsoft language
> extension. Hence, it probably only works with Visual C++, so I'd choose
> to use the Windows API -- it also gives better control.

Working with Unicode API's is another option.

gg

Alf P. Steinbach

unread,
Mar 17, 2009, 3:40:40 PM3/17/09
to
* Codeplug:

> Alf P. Steinbach wrote:
>> * Gernot Frisch:
>>>
>>> The user calls this program:
>>> -------------------
>>> int main()
>>> {
>>> char fn[1024];
>>> gets(fn);
>>> FILE* pF = fopen(fn,"wb"); fprintf(pF, "xx"); fclose(pF);
>>> return 0;
>>> }
>>> --------------------
>>>
>>> and types:
>>> äöü.txt

>>>
>>> In the windows explorer I see a file: " ".txt (or something).
>>>
>>> The problem is the fact that the console uses IBM charset and my
>>> source code ANSI. Fine.
>>> fopen("äöü.txt", "wb"); creates a file "äöü.txt", thus the source
>>> editor is correct. It's the console that's wrong.
>>>
>>> Funnily, the CMD.exe:
>>> ECHO "xx" > äöü.txt
>>>
>>> will create a file "äöü.txt", which proves that the windows command
>>> line tools somehow _are_ aware of this and can handle it properly.
>>>
>>> Question: How can I do that?
>>
>> Well, there are several levels of solution.
>>
>> First off, for your own convenience when using narrow character based
>> command line tools, you should set the command interpreter window to
>> codepage 1252,
>>
>> chcp 1252
>>
>> That should make the example above work nicely.
>
> That depends on the system's Ansi codepage (ACP).

Which I'm quite sure is 1252 for the case at hand.


>> Second, to avoid doing that all the time you should change the default
>> OEM codepage to 1252. That can be done via an undocumented registry
>> value. See <url:
>> http://alfps.izfree.com/tutorials/w32cpptut/html/w32cpptut_01_02_06.html>.
>>
>
> This might be ok for *your* machine, but not the best solution for the
> general case.
>
> What you have to keep in mind is that there's the console input
> codepage, and the system ACP - which may not be the same.

Where did you get the impression that I could be unaware of that, when that's
exactly what I helped with?


> The Ansi
> versions of the Win32 API uses the ACP, and fopen eventually calls
> CreateFileA(). So if the console input CP is different from the ACP,
> then the input needs to be converted to the ACP:

That's a very Very VERY ungood idea.

Please don't post bad advice in the group.

If you're going to deal with Unicode, then use Unicode, don't translate back up
from a previous lossy down-translation.

The term "lossy" in the previous sentence /should/ alert you to an aspect of the
above that's just plain silly, and anyway includes the problem that Gernot
wanted a solution for (in other words, the above is not a solution).

With that strong hint, can you see what the problem is?


>> Third, Windows ANSI narrow character set is better than what you had,
>> but you really want to access command line arguments as Unicode, for a
>> much wider range of characters. To do that you can either use the
>> Windows API directly, GetCommandLineW and CommandLineToArgsW (if
>> memory serves me right, check), or if your compiler supports it you
>> can use wMain instead of main. The wMain thing is a Microsoft language
>> extension. Hence, it probably only works with Visual C++, so I'd
>> choose to use the Windows API -- it also gives better control.
>
> Working with Unicode API's is another option.

GetCommandLineW etc. /are/ Unicode API functions.

Alf P. Steinbach

unread,
Mar 17, 2009, 3:36:32 PM3/17/09
to
* Codeplug:

> Alf P. Steinbach wrote:
>> * Gernot Frisch:
>>>
>>> The user calls this program:
>>> -------------------
>>> int main()
>>> {
>>> char fn[1024];
>>> gets(fn);
>>> FILE* pF = fopen(fn,"wb"); fprintf(pF, "xx"); fclose(pF);
>>> return 0;
>>> }
>>> --------------------
>>>
>>> and types:
>>> äöü.txt

>>>
>>> In the windows explorer I see a file: " ".txt (or something).
>>>
>>> The problem is the fact that the console uses IBM charset and my
>>> source code ANSI. Fine.
>>> fopen("äöü.txt", "wb"); creates a file "äöü.txt", thus the source
>>> editor is correct. It's the console that's wrong.
>>>
>>> Funnily, the CMD.exe:
>>> ECHO "xx" > äöü.txt
>>>
>>> will create a file "äöü.txt", which proves that the windows command
>>> line tools somehow _are_ aware of this and can handle it properly.
>>>
>>> Question: How can I do that?
>>
>> Well, there are several levels of solution.
>>
>> First off, for your own convenience when using narrow character based
>> command line tools, you should set the command interpreter window to
>> codepage 1252,
>>
>> chcp 1252
>>
>> That should make the example above work nicely.
>
> That depends on the system's Ansi codepage (ACP).

Which I'm quite sure is 1252 for the case at hand.


>> Second, to avoid doing that all the time you should change the default
>> OEM codepage to 1252. That can be done via an undocumented registry
>> value. See <url:
>> http://alfps.izfree.com/tutorials/w32cpptut/html/w32cpptut_01_02_06.html>.
>>
>
> This might be ok for *your* machine, but not the best solution for the
> general case.
>
> What you have to keep in mind is that there's the console input
> codepage, and the system ACP - which may not be the same.

Where did you get the impression that I could be unaware of that, when that's


exactly what I helped with?

> The Ansi
> versions of the Win32 API uses the ACP, and fopen eventually calls
> CreateFileA(). So if the console input CP is different from the ACP,
> then the input needs to be converted to the ACP:

That's a very Very VERY ungood idea.

Please don't post bad advice in the group.

If you're going to deal with Unicode, then use Unicode, don't translate back up


from a previous lossy down-translation.

The term "lossy" in the previous sentence /should/ alert you to an aspect of the
above that's just plain silly, and anyway includes the problem that Gernot
wanted a solution for (in other words, the above is not a solution).

With that strong hint, can you see what the problem is?

>> Third, Windows ANSI narrow character set is better than what you had,
>> but you really want to access command line arguments as Unicode, for a
>> much wider range of characters. To do that you can either use the
>> Windows API directly, GetCommandLineW and CommandLineToArgsW (if
>> memory serves me right, check), or if your compiler supports it you
>> can use wMain instead of main. The wMain thing is a Microsoft language
>> extension. Hence, it probably only works with Visual C++, so I'd
>> choose to use the Windows API -- it also gives better control.
>
> Working with Unicode API's is another option.

GetCommandLineW etc. /are/ Unicode API functions.


Alf P. Steinbach

unread,
Mar 17, 2009, 3:38:11 PM3/17/09
to
* Codeplug:

> Alf P. Steinbach wrote:
>> * Gernot Frisch:
>>>
>>> The user calls this program:
>>> -------------------
>>> int main()
>>> {
>>> char fn[1024];
>>> gets(fn);
>>> FILE* pF = fopen(fn,"wb"); fprintf(pF, "xx"); fclose(pF);
>>> return 0;
>>> }
>>> --------------------
>>>
>>> and types:
>>> äöü.txt

>>>
>>> In the windows explorer I see a file: " ".txt (or something).
>>>
>>> The problem is the fact that the console uses IBM charset and my
>>> source code ANSI. Fine.
>>> fopen("äöü.txt", "wb"); creates a file "äöü.txt", thus the source
>>> editor is correct. It's the console that's wrong.
>>>
>>> Funnily, the CMD.exe:
>>> ECHO "xx" > äöü.txt
>>>
>>> will create a file "äöü.txt", which proves that the windows command
>>> line tools somehow _are_ aware of this and can handle it properly.
>>>
>>> Question: How can I do that?
>>
>> Well, there are several levels of solution.
>>
>> First off, for your own convenience when using narrow character based
>> command line tools, you should set the command interpreter window to
>> codepage 1252,
>>
>> chcp 1252
>>
>> That should make the example above work nicely.
>
> That depends on the system's Ansi codepage (ACP).

Which I'm quite sure is 1252 for the case at hand.


>> Second, to avoid doing that all the time you should change the default
>> OEM codepage to 1252. That can be done via an undocumented registry
>> value. See <url:
>> http://alfps.izfree.com/tutorials/w32cpptut/html/w32cpptut_01_02_06.html>.
>>
>
> This might be ok for *your* machine, but not the best solution for the
> general case.
>
> What you have to keep in mind is that there's the console input
> codepage, and the system ACP - which may not be the same.

Where did you get the impression that I could be unaware of that, when that's


exactly what I helped with?

> The Ansi
> versions of the Win32 API uses the ACP, and fopen eventually calls
> CreateFileA(). So if the console input CP is different from the ACP,
> then the input needs to be converted to the ACP:

That's a very Very VERY ungood idea.

Please don't post bad advice in the group.

If you're going to deal with Unicode, then use Unicode, don't translate back up


from a previous lossy down-translation.

The term "lossy" in the previous sentence /should/ alert you to an aspect of the
above that's just plain silly, and anyway includes the problem that Gernot
wanted a solution for (in other words, the above is not a solution).

With that strong hint, can you see what the problem is?

>> Third, Windows ANSI narrow character set is better than what you had,
>> but you really want to access command line arguments as Unicode, for a
>> much wider range of characters. To do that you can either use the
>> Windows API directly, GetCommandLineW and CommandLineToArgsW (if
>> memory serves me right, check), or if your compiler supports it you
>> can use wMain instead of main. The wMain thing is a Microsoft language
>> extension. Hence, it probably only works with Visual C++, so I'd
>> choose to use the Windows API -- it also gives better control.
>
> Working with Unicode API's is another option.

GetCommandLineW etc. /are/ Unicode API functions.


Codeplug

unread,
Mar 17, 2009, 10:04:49 PM3/17/09
to

That doesn't change the fact that the example does NOT "work nicely" if
the ACP is different from the input-CP.

>>> Second, to avoid doing that all the time you should change the
>>> default OEM codepage to 1252. That can be done via an undocumented
>>> registry value. See <url:
>>> http://alfps.izfree.com/tutorials/w32cpptut/html/w32cpptut_01_02_06.html>.
>>>
>>
>> This might be ok for *your* machine, but not the best solution for the
>> general case.
>>
>> What you have to keep in mind is that there's the console input
>> codepage, and the system ACP - which may not be the same.
>
> Where did you get the impression that I could be unaware of that, when
> that's exactly what I helped with?

No where in your post, or in that link, do you mention CreateFileA or
the ACP. This being the root of the OP's problem, why would I expect you
to be aware of it if you didn't mention it?

The *point* is that changing the default in the registry is NO solution.
Bottom line is that fopen() eventually calls CreateFileA(), which in
turn uses the ACP. If the ACP is different from the input-CP, then a
conversion must be done. (Actually, you only need to convert if there
are any characters > 0x7F).

>> The Ansi versions of the Win32 API uses the ACP, and fopen eventually
>> calls CreateFileA(). So if the console input CP is different from the
>> ACP, then the input needs to be converted to the ACP:
>
> That's a very Very VERY ungood idea.
>
> Please don't post bad advice in the group.

3 "very"'s in a row, you must be convinced! Now how about trying to
convince someone else - with some actual evidence or explanations to
back up your statements. It's not even "advice", it's just a fact. Sure,
the CP-to-CP conversion may be "lossy", but that doesn't change the fact
that it has to be done for the characters to be anything meaningful.

> If you're going to deal with Unicode, then use Unicode, don't translate
> back up from a previous lossy down-translation.

Not sure what you mean here, but just to be clear, a CP-to-Unicode
conversion typically isn't "lossy" under Windows.

> The term "lossy" in the previous sentence /should/ alert you to an
> aspect of the
> above that's just plain silly, and anyway includes the problem that Gernot
> wanted a solution for (in other words, the above is not a solution).
>
> With that strong hint, can you see what the problem is?

I'm very impressed that you know CP-to-CP conversions can be lossy. But
that doesn't change the fact that the root of the OP's problem was
because the CP-to-CP conversion was NOT being done. Standard input uses
one CP value, and fopen() uses another CP value - if they aren't the
same, you must convert (or utilize another solution).

>>> Third, Windows ANSI narrow character set is better than what you had,
>>> but you really want to access command line arguments as Unicode, for
>>> a much wider range of characters. To do that you can either use the
>>> Windows API directly, GetCommandLineW and CommandLineToArgsW (if
>>> memory serves me right, check), or if your compiler supports it you
>>> can use wMain instead of main. The wMain thing is a Microsoft
>>> language extension. Hence, it probably only works with Visual C++, so
>>> I'd choose to use the Windows API -- it also gives better control.
>>
>> Working with Unicode API's is another option.
>
> GetCommandLineW etc. /are/ Unicode API functions.

Thanks for the clarification... I think my quoting style (or lack there
of) is adding meaning to my posts which I don't intend...

I agree that using Unicode from the onset is the better/best solution,
and avoids potential "round-trip" conversion loss. The point was to
inform the OP and readers of the two separate codepages and show a
correct "Ansi only" solution for the general case - ie. no /faulty/ ACP
== input-CP assumptions.

gg

Codeplug

unread,
Mar 17, 2009, 10:06:16 PM3/17/09
to
Code correction:
int srclen = strlen(fn);
Should be:
int srclen = strlen(fn) + 1;
to include the null terminator in the conversion.

gg

Alf P. Steinbach

unread,
Mar 17, 2009, 10:43:42 PM3/17/09
to
* Codeplug:
> [Idiocy]

I'm sorry I responded to Codeplug's article, I didn't mean to feed the troll.


Cheeers, & again, sorry,

Codeplug

unread,
Mar 17, 2009, 11:44:05 PM3/17/09
to
Alf P. Steinbach wrote:
> * Codeplug:
>> [Idiocy]

I welcome you, are anyone, to a [mature] debate on the fact that I've
provided correct and pertinent information.

gg

Gernot Frisch

unread,
Mar 18, 2009, 4:52:05 AM3/18/09
to


> First off, for your own convenience when using narrow character based
> command line tools, you should set the command interpreter window to
> codepage 1252,
>
> chcp 1252

I see now.
I have a few problems.
With chcp 1252, the input of "äöü" looks wrong.
I found that I can convert input/output with OemToAnsi and AnsiToOem - is
that a good idea?
I can't build a unicode appliction for compatibitily with a lot of old code.

Thank you for the quick replies.


Codeplug

unread,
Mar 18, 2009, 7:28:06 AM3/18/09
to
Gernot Frisch wrote:
> With chcp 1252, the input of "äöü" looks wrong.

You'll have to be more descriptive than "looks wrong".

> I found that I can convert input/output with OemToAnsi and AnsiToOem -
> is that a good idea?

No. It doesn't do what you need. The code I posted is a correct
"ansi-only" implementation. A better implementation is to use _wfopen()
to avoid a round-trip codepage conversion:

#include <stdio.h>
#include <windows.h>

int main()
{
char fn[1024];

wchar_t wfn[sizeof(fn)];
FILE *pF;

printf("filename: ");
gets(fn);

MultiByteToWideChar(GetConsoleCP(), 0,
fn, strlen(fn) + 1,
wfn, sizeof(fn));

pF = _wfopen(wfn, L"wb");


if (pF)
{
fprintf(pF, "xx");
fclose(pF);
}
else
printf("fopen failed\n");

return 0;
}

gg

Alf P. Steinbach

unread,
Mar 18, 2009, 10:16:47 AM3/18/09
to
* Gernot Frisch:

>
>> First off, for your own convenience when using narrow character based
>> command line tools, you should set the command interpreter window to
>> codepage 1252,
>>
>> chcp 1252
>
> I see now.
> I have a few problems.
> With chcp 1252, the input of "äöü" looks wrong.

You should also set a TrueType font for console windows, e.g. Lucida Console.

Right-click window title in console window, choose [Defaults]. This affects new
console windows, not the current one.


> I found that I can convert input/output with OemToAnsi and AnsiToOem -
> is that a good idea?

No, it isn't, regardless of which OemToAnsi you're talking about.

Converting OEM -> ANSI means that you already have the input converted down to
the OEM narrow character encoding, which is a lossy conversion, and you follow
that by another lossy conversion, i.e. you then have UNICODE -> OEM -> ANSI.

Instead, if you absolutely need to use ANSI Windows functions (e.g. via standard
library functions), and you absolutely need to have interactive console input,
change the console's OEM codepages programmatically to match the ANSI one.

Note that there are two of OEM codepages, one for input and one for output
(Microsoft thought that naturally, you'll use different codepages for input and
output).

Change both (and of course, restore originals on exit).


> I can't build a unicode appliction for compatibitily with a lot of old
> code.

OK, then you're limited to narrow characters, and the least bad solution is then
as outlined above, programmatically changing the OEM codepages to match internal
requirements (this way you have only /one/ lossy conversion, and it's the same
one as for most other narrow character based programs).

Actually there could have been a better solution, namely UTF-8 as in *nix-land,
except that the UTF-8 codepage in Windows XP Does Not Work.

It's so bad that when setting UTF-8 codepage the command interpreter refuses to
run batch files or e.g. the 'more' command. The latter complains that there's
not enough memory. The whole thing is very messed up, which is another reason to
aim for simplicity -- giving Windows fewer cracks to drop bugs in. :-)

Codeplug

unread,
Mar 18, 2009, 11:23:17 AM3/18/09
to
Alf P. Steinbach wrote:
> * Gernot Frisch:
[snip]

> Instead, if you absolutely need to use ANSI Windows functions (e.g. via
> standard library functions), and you absolutely need to have interactive
> console input, change the console's OEM codepages programmatically to
> match the ANSI one.

The whole purpose of 'chcp', and the previously mentioned registry key,
is so that the *user* can specify the codepage (CP) to use. What if you
wanted to pipe input from a file encoded with a CP different from the
ACP? What if the user wants to enter glyphs only belonging to a CP
different form the ACP? etc...

> Note that there are two of OEM codepages, one for input and one for
> output (Microsoft thought that naturally, you'll use different codepages
> for input and output).

The more correct terminology is "console input/output codepage". The
term "OEM codepage" is typically used to describe the default codepage
used by the console, which is also the value returned by GetOEMCP(),
which is also the value changed by the previously mentioned registry key.

>> I can't build a unicode appliction for compatibitily with a lot of old
>> code.
>
> OK, then you're limited to narrow characters, and the least bad solution
> is then as outlined above, programmatically changing the OEM codepages
> to match internal requirements (this way you have only /one/ lossy
> conversion, and it's the same one as for most other narrow character
> based programs).

The _wfopen() solution already provided is just as lossy, and doesn't
rudely change the users console input-CP setting.

gg

Alf P. Steinbach

unread,
Mar 18, 2009, 11:57:00 AM3/18/09
to
* Codeplug:
> [trolling]

> * Alf P. Steinbach wrote:
>> * Gernot Frisch:
> [snip]
>> Instead, if you absolutely need to use ANSI Windows functions (e.g.
>> via standard library functions), and you absolutely need to have
>> interactive console input, change the console's OEM codepages
>> programmatically to match the ANSI one.
>
> The whole purpose of 'chcp', and the previously mentioned registry key,
> is so that the *user* can specify the codepage (CP) to use.

In a way. Note that the registry key is largely undocumented and the chcp
command is full of bugs for use other than the concrete usage I showed. So it's
not exactly geared towards ordinary users: it's for programmers.


> What if you
> wanted to pipe input from a file encoded with a CP different from the
> ACP? What if the user wants to enter glyphs only belonging to a CP
> different form the ACP? etc...

Both examples you mention are examples of *different* problems, with different
solutions.

What if the user wants to color his or her hair red?

Then some other solution than what has been discussed here is called for, yes.


>> Note that there are two of OEM codepages, one for input and one for
>> output (Microsoft thought that naturally, you'll use different
>> codepages for input and output).
>
> The more correct terminology is "console input/output codepage". The
> term "OEM codepage" is typically used to describe the default codepage
> used by the console, which is also the value returned by GetOEMCP(),
> which is also the value changed by the previously mentioned registry key.

Microsoft has never managed to define a terminology for anything.

So in this context the only criterion for "more correct" is ease of understanding.

Measure it if you have any questions, and/or before providing opinions on the
matter.


>>> I can't build a unicode appliction for compatibitily with a lot of
>>> old code.
>>
>> OK, then you're limited to narrow characters, and the least bad
>> solution is then as outlined above, programmatically changing the OEM
>> codepages to match internal requirements (this way you have only /one/
>> lossy conversion, and it's the same one as for most other narrow
>> character based programs).
>
> The _wfopen() solution already provided is just as lossy, and doesn't
> rudely change the users console input-CP setting.

_wfopen /is/ Unicode.

Codeplug

unread,
Mar 18, 2009, 12:26:52 PM3/18/09
to
Alf P. Steinbach wrote:
>> What if you wanted to pipe input from a file encoded with a CP
>> different from the ACP? What if the user wants to enter glyphs only
>> belonging to a CP different form the ACP? etc...
>
> Both examples you mention are examples of *different* problems, with
> different solutions.
>
> What if the user wants to color his or her hair red?

You seem to be confused. The OP's problem relates to getting standard
input. The "what-if's" I mentioned above are related to the *usage* of
standard input. Part of getting [non-wide] standard input is the
input-CP used - which the user chooses. Your advice/suggestion of
changing the input-CP, ignoring the user's selection, is simply rude,
and prevents the "what-if" use-cases above from being possible.

> Microsoft has never managed to define a terminology for anything.
>
> So in this context the only criterion for "more correct" is ease of
> understanding.

In this case, anyone who's literate can read MSDN and see clearly used
terminology - of which you were obviously miss-using.
http://msdn.microsoft.com/en-us/library/ms683162(VS.85).aspx
http://msdn.microsoft.com/en-us/library/ms683169(VS.85).aspx

gg

Alf P. Steinbach

unread,
Mar 18, 2009, 12:55:49 PM3/18/09
to
* Codeplug:
> [trolling]

> Alf P. Steinbach wrote:
>>> What if you wanted to pipe input from a file encoded with a CP
>>> different from the ACP? What if the user wants to enter glyphs only
>>> belonging to a CP different form the ACP? etc...
>>
>> Both examples you mention are examples of *different* problems, with
>> different solutions.
>>
>> What if the user wants to color his or her hair red?
>
> You seem to be confused.

Thanks. :)


> The OP's problem relates to getting standard input.

Really.


> The "what-if's" I mentioned above are related to the *usage* of
> standard input.

Oh, I sort of ignored that, didn't comment on it. Sorry. They examples you gave
were indeed /related/ in some way to the OP's problem.

And so is the problem of coloring your hair red.

For that, you need money, and doing programming work is ultimately about someone
earning money.


> Part of getting [non-wide] standard input is the
> input-CP used - which the user chooses. Your advice/suggestion of
> changing the input-CP, ignoring the user's selection, is simply rude,

I'm sorry, but regarded as a technical posting (instead of the intentional
trolling that it surely is) that's pure plain idiocy.

However, for other readers, who unfortunately may be swayed by your trolling:

With conversion in your own code, from narrow to narrow, the conversions
involved when using a filename typed by the user are

Unicode -> OEM -> ANSI -> Unicode

With conversion by changing the OEM codepages (while the program runs) to the
ANSI one, the conversions are

Unicode -> ANSI -> Unicode

In case you wonder about the "-> Unicode", that's a conversion applied by the
Windows API when you, indirectly or directly, call an ANSI API routine.

The first conversion sequence, which you appear to champion (I'm writing "appear
to" because I'm sure you're trolling, you can't be that dumb!), will drop more
characters, that is, it's unable to represent them at every step of the
conversion chain, than the second.


> and prevents the "what-if" use-cases above from being possible.

Technically that's an unwarranted and incorrect conclusion.


>> Microsoft has never managed to define a terminology for anything.
>>
>> So in this context the only criterion for "more correct" is ease of
>> understanding.
>
> In this case, anyone who's literate can read MSDN and see clearly used
> terminology - of which you were obviously miss-using.
> http://msdn.microsoft.com/en-us/library/ms683162(VS.85).aspx
> http://msdn.microsoft.com/en-us/library/ms683169(VS.85).aspx

This latter idiocy is just pure trolling, no technical content.


Cheers, & hth.,

Codeplug

unread,
Mar 18, 2009, 3:27:45 PM3/18/09
to
I'm new to usenet, so I had to lookup this "trolling" you keep referring
to...I found this:
http://captaininfinity.us/rightloop/alttrollFAQ.htm

Based on the definitions provided there, aren't *you* the one who is
"trolling" me with all the name calling and "statements designed to
invoke some emotional response"?

You seem to be having a hard time participating in a mature,
intellectual debate of the subject at hand. If I make a statement that
you don't agree with, then state what it is you don't agree with and
why. That's all I've ever done towards you. All this name-calling, and
"you're a troll", immature nonsense is a *waste* of everyone's time.

If *any* other readers think that the content or style of my posts are
"troll-like" in any way, please point it out to me because it certainly
isn't my intention, and wouldn't want to repeat the offense.

In response to your on-topic statements...
> The first conversion sequence, which you appear to champion ... will


> drop more characters, that is, it's unable to represent them at every
> step of the conversion chain, than the second.

I understand that the conversion is lossy. I acknowledge that setting
the input-CP to the ACP, [SetConsoleCP(GetACP())], prevents this lossy
conversion.

The "better" solution I proposed, and have been referring to, used
_wfopen(). Here's both my solution, and your solution (as I understand
it) in code:

#include <stdio.h>
#include <windows.h>

/*#define USE_ACP_INPUT*/

int main()
{
char fn[1024];

wchar_t wfn[sizeof(fn)];
FILE *pF;

#ifdef USE_ACP_INPUT
SetConsoleCP(GetACP());
#endif

printf("filename: ");
gets(fn);

#ifdef USE_ACP_INPUT


pF = fopen(fn, "wb");

#else


MultiByteToWideChar(GetConsoleCP(), 0,
fn, strlen(fn) + 1,
wfn, sizeof(fn));

pF = _wfopen(wfn, L"wb");

#endif

if (pF)
{
fprintf(pF, "xx");
fclose(pF);
}
else
printf("fopen failed\n");

#ifdef USE_ACP_INPUT
/* todo - restore original input-CP */
#else

return 0;
}

My interpretation of your solution is the resulting code when
USE_ACP_INPUT is defined. My solution is with it un-defined. Please tell
me if I've missed something.

As previously stated, your conversion is:
Unicode -> ACP -> Unicode
And mine is:
Unicode -> OEM -> Unicode

As you can see, the conversion steps are identical, except I use the
user's chosen input-CP.

I claim that mine is "better" because it uses the user's input-CP
instead of ignoring it. Otherwise, I see no other differences between
the two methods.

gg

Alf P. Steinbach

unread,
Mar 18, 2009, 3:56:36 PM3/18/09
to
* Codeplug:
> I'm new to usenet

No, that's a lie.


>, so I had to lookup this "trolling" you keep referring
> to...I found this:
> http://captaininfinity.us/rightloop/alttrollFAQ.htm
>
> Based on the definitions provided there, aren't *you* the one who is
> "trolling" me with all the name calling and "statements designed to
> invoke some emotional response"?

Trolling is about trying to evoke emotional responses yes.

One of the most effective ways is to post idiotic but technical sounding (non-)
arguments in a technical group, as you have done and continue to do.

That's why you're anonymous.


[snip]


> If *any* other readers think that the content or style of my posts are
> "troll-like" in any way, please point it out to me because it certainly
> isn't my intention, and wouldn't want to repeat the offense.

Your offense is trolling and you're doing it right here.


> In response to your on-topic statements...
>> The first conversion sequence, which you appear to champion ... will
>> drop more characters, that is, it's unable to represent them at every
>> step of the conversion chain, than the second.
>
> I understand that the conversion is lossy. I acknowledge that setting
> the input-CP to the ACP, [SetConsoleCP(GetACP())], prevents this lossy
> conversion.

Good.

The argument above is a new one, comparing apples to mules (or whatever).

Indeed, as I stated at the start of this thread and you strenuously opposed,
it's better to use Unicode in Windows, so if anyone could own a method (that's
idiocy, by the way) it would be "my" method, not yours, as you claim.

However, your implementation here of Unicode handling is ungood, even when
disregarding that you're using unsafe C (allowing buffer overflows etc.) instead
of C++. When the program uses Unicode internally there's no need to go down to
the narrow character level at all. Instead do interactive i/o also in Unicode.

Codeplug

unread,
Mar 18, 2009, 5:51:59 PM3/18/09
to
Alf P. Steinbach wrote:
> * Codeplug:
>> I'm new to usenet
>
> No, that's a lie.

Sorry to disappoint. I've previously only posted at
http://www.codeguru.com/forum/ and http://cboard.cprogramming.com/,
under the user-name "Codeplug". All of my nntp postings to date are
viewable from Google groups.

> One of the most effective ways is to post idiotic but technical sounding
> (non-) arguments in a technical group, as you have done and continue to do.

I challenge you to identify and quote any of my postings, anywhere, and
prove them be "idiotic", technically unsound, or a "non-argument".

>> As previously stated, your conversion is:
>> Unicode -> ACP -> Unicode
>> And mine is:
>> Unicode -> OEM -> Unicode
>>
>> As you can see, the conversion steps are identical, except I use the
>> user's chosen input-CP.
>>
>> I claim that mine is "better" because it uses the user's input-CP
>> instead of ignoring it. Otherwise, I see no other differences between
>> the two methods.
>
> The argument above is a new one, comparing apples to mules (or whatever).

Please explain why you think this is a new "argument"? At the very
beginning of this sub-thread, I posted: "A better implementation is to
use _wfopen() to avoid a round-trip codepage conversion: " and the code
followed (this is post #13 on Google groups). You then replied with your
SetConsoleCP(GetACP()) solution. From that point on, I argued the
reasons why your solution was sub-par to mine - but we obviously were
not "on the same page". My last quote on post #15 is "The _wfopen()

solution already provided is just as lossy, and doesn't rudely change

the users console input-CP setting." This is exactly what I "claim"
above - my argument is not a new one.

> Indeed, as I stated at the start of this thread and you strenuously
> opposed, it's better to use Unicode in Windows

At no point did I oppose the use of Unicode. Please point out where in
my posts you got this impression.

Your first post had a "First", "Second", and "Third" parts...
First, I pointed out the fact that only changing the input-CP via 'chcp'
does not guarantee that the OP's example will "work nicely".
Second, I pointed out that setting the default input-CP via the registry
isn't a general solution - even if you weren't implying it as such.
Third, you mention using the Unicode command line - which is fine,
except the OP's example used standard input.

> However, your implementation here of Unicode handling is ungood

If you have any criticisms of the code, I would love to here them. Just
telling me it's "bad" doesn't point out where I may of gone wrong. I'm
always willing to learn from my mistakes, so please explain what part of
the code is bad.

Please tell me, does the "USE_ACP_INPUT code" accurately represent your
proposed solution? If not, would please post a compilable solution so we
can compare the code. Or do you concede that leaving USE_ACP_INPUT
undefined is better?

> even when disregarding that you're using unsafe C (allowing buffer
> overflows etc.) instead of C++.

Well, the OP's original example was in C, so I kept my code as C. And
yes, gets() is a horrible function. I actually started to change it to
fgets(), but then I would have to remove the newline...so I just kept
the gets() in the interest of brevity.

> When the program uses Unicode internally there's no need to go down
> to the narrow character level at all. Instead do interactive i/o also
> in Unicode.

I agree 100%, however, it's a bit more work since the MS-CRT doesn't
support true Unicode I/O from the console (stdin). Unicode input from
the console is only possible with ReadConsoleW() or ReadConsoleInputW().
The real root of this limitation of the MS-CRT is that all "read"
functions eventually call ReadFile, which does not support the reading
of Unicode from a console - as stated here:
http://msdn.microsoft.com/en-us/library/ms683458(VS.85).aspx

gg

Uwe Sieber

unread,
Mar 19, 2009, 2:04:24 AM3/19/09
to
Gernot Frisch wrote:
>
> The user calls this program:
> -------------------
> int main()
> {
> char fn[1024];
> gets(fn);
> FILE* pF = fopen(fn,"wb"); fprintf(pF, "xx"); fclose(pF);
> return 0;
> }
> --------------------
>
> and types:
> äöü.txt
>
> In the windows explorer I see a file: " ".txt (or something).
>
> The problem is the fact that the console uses IBM charset and my source
> code ANSI. Fine.
> fopen("äöü.txt", "wb"); creates a file "äöü.txt", thus the source editor
> is correct. It's the console that's wrong.

Windows can do the conversion for you. You can set
if a conversion is done or not by calling
SetFileApisToOEM and SetFileApisToANSI.

When dealing with file names from user's console input,
call SetFileApisToOEM before. When dealing with file names
from source code, call SetFileApisToANSI before.

Yes, it's lossy but good enough for umlauts. Since
the default ANSI and OEM codepages correlate, there
is no loss for 'real live characters' found on the
user's keyboard.

Changing the codepage is not good. It's a fact that
the deafault console font is a bitmap font which
codepage changes to ANSI lead to wrong display.
And I would not bet on that setting the console
codepage to the user's ANSI codepage will work
on all local Windows version. It's wrong to assume
that 1252 is the one and only ANSI codepage.

Using Unicode is more flexible but also has drawbacks.
While for umlauts there is no difference, there is
one for other charsets.
For instance when a russian enters kyrillc letters on
his CP 866 console, then he probably prefers to have
the kyrillic charcters in ANSI CP 1251 on his drive
instead of unicode, because many old applications cannot
deal with unicode.
No right or wrong here, just a decision.


Uwe

what


If you deal with

Uwe Sieber

unread,
Mar 19, 2009, 4:02:32 AM3/19/09
to
Uwe Sieber wrote:
> Using Unicode is more flexible but also has drawbacks.
> While for umlauts there is no difference, there is
> one for other charsets.
> For instance when a russian enters kyrillc letters on
> his CP 866 console, then he probably prefers to have
> the kyrillic charcters in ANSI CP 1251 on his drive
> instead of unicode, because many old applications cannot
> deal with unicode.

Nevermind this section, it's nonsense...

There is no decision of having ANSI or Unicode on disk.
Writing with A functions to a NTFS file system, it's
translated to unicode anyway. Writing with W functions
to a FAT file system, it's translated to ANSI.
Using Unicode is ok but more effort.


Uwe

Gernot Frisch

unread,
Mar 19, 2009, 4:31:27 AM3/19/09
to

Changing the codepage worked for me. Excellent idea.

Please stop arguing, it hurts. I wouldn't dare asking a question if that was
the result ;).

-Gernot

Uwe Sieber

unread,
Mar 19, 2009, 8:45:46 AM3/19/09
to

I was wrong again :-(
FAT files systems support Unicode by the VFAT extension,
so, there is no ANSI translation required.

Nevertheless just using SetFileApisToOEM and SetFileApisToANSI
is right. The char string is passed to CreateFileA which
translates it to WCHAR and calles CreateFileW.
CreateFileW translates it to UNICODE_STRING and calles
ZwCreateFile. Only one translation is done.


Uwe


Codeplug

unread,
Mar 19, 2009, 10:45:04 AM3/19/09
to
Uwe Sieber wrote:
[snip]

>
> Nevertheless just using SetFileApisToOEM and SetFileApisToANSI
> is right. The char string is passed to CreateFileA which
> translates it to WCHAR and calles CreateFileW.
> CreateFileW translates it to UNICODE_STRING and calles
> ZwCreateFile. Only one translation is done.

My only concern with using SetFileApisToOEM() is that it uses GetOEMCP()
internally for the conversion. If this is the case, then it could "do
the wrong thing" if GetOEMCP() != GetConsoleCP().

gg

Uwe Sieber

unread,
Mar 20, 2009, 2:18:37 AM3/20/09
to

Good point, something to check.

Uwe

Codeplug

unread,
Mar 20, 2009, 9:48:30 AM3/20/09
to

Thanks - I created a little experiment to check the behavior:

/* main.cpp */
#include <stdio.h>
#include <windows.h>

void uputc(wchar_t c)
{
// Note: Use Lucida Console font for best results
DWORD written;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), &c, 1, &written, 0);
}//uputc

int main()
{
// attempt to ensure GetOEMCP() and GetConsoleCP() are different
if (GetOEMCP() == GetConsoleCP())
SetConsoleCP(1252);

if (GetOEMCP() == GetConsoleCP())
{
puts("This experiment requires GetOEMCP() != GetConsoleCP()");
return 1;
}//if

const wchar_t o_diaeresis = L'\u00F6';
char o_diaeresis_oemcp;

// determine value for o_diaeresis_oemcp
BOOL bDefCharUsed = 0;
int res = WideCharToMultiByte(GetOEMCP(), WC_NO_BEST_FIT_CHARS,
&o_diaeresis, 1,
&o_diaeresis_oemcp, 1,
0, &bDefCharUsed);
if ((res != 1) || bDefCharUsed)
{
puts("small o with diaeresis not in OEMCP");
return 1;
}//if

printf("OEMCP = %u\n"
"InputCP = %u\n"
"o_diaeresis_oemcp = 0x%X\n",
GetOEMCP(), GetConsoleCP(),
(unsigned)(unsigned char)o_diaeresis_oemcp);

SetFileApisToOEM();

char fn[] = "o_diaeresis_X";
fn[12] = o_diaeresis_oemcp;

FILE* pF = fopen(fn, "wb");

if (!pF)
{
puts("fopen failed");
return 1;
}//if

fprintf(pF, "xx");
fclose(pF);

// determine what character would display if 'o_diaeresis_oemcp' were
// converted to Unicode using the input-CP
wchar_t wrong_src_cp_char;
res = MultiByteToWideChar(GetConsoleCP(), 0,
&o_diaeresis_oemcp, 1,
&wrong_src_cp_char, 1);
if (!res)
wrong_src_cp_char = L'?';

printf("Which filename do you see?\n"
"'o_diaeresis_");
uputc(o_diaeresis);
printf("' - SetFileApisToOEM() used GetOEMCP()\n"
"'o_diaeresis_");
uputc(wrong_src_cp_char);
printf("' - SetFileApisToOEM() used GetConsoleCP()\n");

return 0;
}//main
/* end main.cpp */

My output:

OEMCP = 437
InputCP = 1252
o_diaeresis_oemcp = 0x94
Which filename do you see?
'o_diaeresis_ö' - SetFileApisToOEM() used GetOEMCP()
'o_diaeresis_”' - SetFileApisToOEM() used GetConsoleCP()

The file created was in fact "o_diaeresis_ö", which implies that
SetFileApisToOEM() does use GetOEMCP().

gg

Uwe Sieber

unread,
Mar 20, 2009, 4:07:15 PM3/20/09
to
Codeplug wrote:
>>> Uwe Sieber wrote:
>>> [snip]
>>>>
>>>> Nevertheless just using SetFileApisToOEM and SetFileApisToANSI
>>>> is right. The char string is passed to CreateFileA which
>>>> translates it to WCHAR and calles CreateFileW.
>>>> CreateFileW translates it to UNICODE_STRING and calles
>>>> ZwCreateFile. Only one translation is done.
>>>
>>> My only concern with using SetFileApisToOEM() is that it uses
>>> GetOEMCP() internally for the conversion. If this is the case, then
>>> it could "do the wrong thing" if GetOEMCP() != GetConsoleCP().
>>
>> Good point, something to check.
>
> Thanks - I created a little experiment to check the behavior:
>
> /* main.cpp */
> [snip]

> /* end main.cpp */
>
> My output:
>
> OEMCP = 437
> InputCP = 1252
> o_diaeresis_oemcp = 0x94
> Which filename do you see?
> 'o_diaeresis_ö' - SetFileApisToOEM() used GetOEMCP()
> 'o_diaeresis_”' - SetFileApisToOEM() used GetConsoleCP()
>
> The file created was in fact "o_diaeresis_ö", which implies that
> SetFileApisToOEM() does use GetOEMCP().

Same result here and I agree with the conclusion.


Uwe

0 new messages