Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Reading UTF-8 input from the console

230 views
Skip to first unread message

Matthias Kluwe

unread,
Mar 26, 2009, 9:11:00 AM3/26/09
to
Hi!

If have a small programm working with text imput here. The inner
workings of the program are "8-bit-clean", which means it does not
care about the encoding of the text fed to it.

I just tried to work with console input and UTF-8, but it may be not
so easy as I thought. Basically, it boils down to this example:

#include "windows.h"

#include <iostream>

int main() {
SetConsoleCP( CP_UTF8 );
char line[100];
while ( std::cin.getline( line, 100 ) )
std::cout << line << std::endl;
}

getline does not read a line when the console input contains a
character like "é".

What am I missing here?

Regards,
Matthias

Mihai N.

unread,
Mar 27, 2009, 3:55:59 AM3/27/09
to
> What am I missing here?
The fact that Windows is not Linux and the console does not officialy
support anything outside the OEM code page.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Matthias Kluwe

unread,
Mar 27, 2009, 4:18:11 AM3/27/09
to
Hi!

On 27 Mrz., 08:55, "Mihai N." <nmihai_year_2...@yahoo.com> wrote:
> > What am I missing here?
>
> The fact that Windows is not Linux and the console does not officialy
> support anything outside the OEM code page.

Hmm, then what is "SetConsoleCP" all about? The documentation says

"Sets the input code page used by the console associated with the
calling process. A console uses its input code page to translate
keyboard input into the corresponding character value."

which means to me that the console does support different code pages,
officialy.

Regards,
Matthias

Mihai N.

unread,
Mar 27, 2009, 4:49:55 AM3/27/09
to
> Hmm, then what is "SetConsoleCP" all about? The documentation says
>
> "Sets the input code page used by the console associated with the
> calling process. A console uses its input code page to translate
> keyboard input into the corresponding character value."
>
> which means to me that the console does support different code pages,
> officialy.

And do you really believe everything you read in MSDN?
:-)


<<ANSI functions of Win32 API assume the text is encoded for the current
console code page, which the system locale defines by default. The
SetConsoleCP and SetConsoleOutputCP functions change the code page used in
these operations.>>
http://msdn.microsoft.com/en-us/goglobal/bb688114.aspx#E2F

But UTF-8 cannot be an ANSI code page.


<<There is also more information in the Windows XP documentation, which does
hint at a problem in its small list of "supported" code pages:>>
http://blogs.msdn.com/michkap/archive/2006/03/06/544251.aspx

Matthias Kluwe

unread,
Mar 27, 2009, 11:47:45 AM3/27/09
to
Hi!

On 27 Mrz., 09:49, "Mihai N." <nmihai_year_2...@yahoo.com> wrote:
> > Hmm, then what is "SetConsoleCP" all about? The documentation says
>
> > "Sets the input code page used by the console associated with the
> > calling process. A console uses its input code page to translate
> > keyboard input into the corresponding character value."
>
> > which means to me that the console does support different code pages,
> > officialy.
>
> And do you really believe everything you read in MSDN?
> :-)

Basically, yes. This is not the worst documentation I've seen yet.

> <<ANSI functions of Win32 API assume the text is encoded for the current
> console code page, which the system locale defines by default. The
> SetConsoleCP and SetConsoleOutputCP functions change the code page used in
> these operations.>>
>    http://msdn.microsoft.com/en-us/goglobal/bb688114.aspx#E2F

I've read this document, but I still cannot see how it applies to my
problem, as I'm not using the Win32 API. Ok, I may deep inside the MS
implementation of the C++ standard library, but this should be nothing
to worry about...

I just tried

> chcp 65001
> echo ä > out.txt

in a new console, and this writes 0xC3 0xA4 as expected in the
textfile.

If I enter the character 'ä' in my sample program, I would expect the
console to emit the same two bytes and std::cin.getline to read them.
Unfortunately, I'm still without an explanation of what is
_really_happening_ here.

Regards,
Matthias

Mihai N.

unread,
Mar 28, 2009, 6:12:41 AM3/28/09
to
> Unfortunately, I'm still without an explanation of what is
> _really_happening_ here.

The explanation is simple, but you don't want to accept it:
the console is an ancient technology, and does not properly
support what you want.

Matthias Kluwe

unread,
Mar 29, 2009, 2:03:20 PM3/29/09
to
Hi!

On 28 Mrz., 12:12, "Mihai N." <nmihai_year_2...@yahoo.com> wrote:

> > Unfortunately, I'm still without an explanation of what is
> > _really_happening_ here.
>
> The explanation is simple, but you don't want to accept it:
> the console is an ancient technology, and does not properly
> support what you want.

That may be the case, and if so, I can accept it. But I'm still
curious and
interested in an explanation, in technical terms. If the explanation
is simple,
all the better.

I'd be glad if you could point me to some more suitable place to ask,
if you
know one...

Regards,
Matthias

Mihai N.

unread,
Mar 30, 2009, 4:54:42 AM3/30/09
to
> That may be the case, and if so, I can accept it. But I'm still curious and
> interested in an explanation, in technical terms. If the explanation
> is simple, all the better.

I don't have an explanation.
Probably that the console is old, and considered too low priority to fix.
Since I don't work for MS, I don't have access to sources, or anything.


> I'd be glad if you could point me to some more suitable place to ask,
> if you know one...

Maybe Michael Kaplan (kind of slow with the posts lately)
http://blogs.msdn.com/michkap/

Or Raymond Chen (he deals with "OS history")
http://blogs.msdn.com/oldnewthing/

Anyway, the answer would only be for curriosity, nothing you can use.
There are better alternatives out there that are Unicode and fully
supported (see PowerShell).

0 new messages