UTF-8 and Windows Console

1,501 views
Skip to first unread message

Attila Tajti

unread,
Dec 2, 2010, 3:01:35 PM12/2/10
to golang-nuts
Some time back I written a app in Go that happened to use UTF-8 strings all over the place. Everything worked perfectly until I tried the app on Windows XP SP3.

Yesterday I got fed up with this and decided to see what could I do. The easiest way seemed to be to set the output to codepage 65001 for UTF-8. Unfortunately this breaks batch files on XP, even the one currently being executed is aborted after a command changes the codepage 65001. Even simple programs like more is broken once it is set.

Therefore I decided to try the WriteConsole API that has both ANSI and Unicode version.

I ended up writing a program to test my findings, that includes a replacement syscall.Write() for Windows:


func Write(fd int, p []byte) (n int, errno int) {
var mode uint32
var done uint32
if handleIsConsole, _ := GetConsoleMode(int32(fd), &mode); handleIsConsole {
buf16 := utf16.Encode([]int(string(p)))
//for _, c := range buf16 { print(c," ") } ; println()
if ok, e:= WriteConsole(int32(fd), buf16, &done); !ok {
return 0, e
}
// convert length of utf16 characters to number of bytes written
if done == uint32(len(buf16)) {
done = uint32(len(p))
} else {
done = 0
for _, rune := range utf16.Decode(buf16[:done]) {
done += uint32(utf8.RuneLen(rune))
}
}
} else {
if ok, e := syscall.WriteFile(int32(fd), p, &done, nil); !ok {
return 0, e
}
}
return int(done), 0
}

This can be used like the current syscall.Write(), eg.:

Write(stdcall.Stdout, []byte("Hello, 世界●\n"))

I wonder if anyone has a simpler/better solution with which one can actually print text into the console even on Windows ;)

-- Attila

rh

unread,
Dec 3, 2010, 11:37:29 AM12/3/10
to golang-nuts
Hello Attila,

> I wonder if anyone has a simpler/better solution with which one can actually print text into the console even on Windows ;)

The least painful way I know of is to use _setmode and wprintf.
However, this does introduce a dependency to your program.

You may find this article helpful: http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx

Unfortunately, there is another problem: the default raster console
font contains very few characters outside ASCII.

Hope this helped a little.
Robert

Russell Newquist

unread,
Dec 4, 2010, 11:43:42 AM12/4/10
to rh, golang-nuts
As it happens, we're dealing with this exact problem for erGo. UTF-8 under Windows is an annoying beast, for historical reasons. However, it turns out that the solution is a lot easier than you might think. There are two steps you have to take in order to get clean, beautiful UTF-8 printing in the console.

First, you have to change the console font to one that supports UTF-8 ("Lucida Console" should be available for this). This can be done manually by changing the console properties of an open console. If you're running Windows Vista or higher (Server 2008 or higher in the server line), you can do it programmatically using the system call SetCurrentConsoleFontEx. Unfortunately, there seems to be no way to do this programmatically on older versions of Windows, but it can still be done manually.

Second, you have to change the console code page to UTF-8. There seems to be no easy way to do this manually, but you can do it programattically with a call to SetConsoleOutputCP. 65001 is the UTF-8 code page.

Make these two calls at the beginning of your program and UTF-8 will look great. Unfortunately, it leaves the code page and font changed on your console, which might effect other programs that are expecting default behavior. The nice thing to do is change it back. However, as soon as you do that any text still displaying in the console will revert to the old font as well, and you'll lose all your beautiful UTF-8 text. A little bit of a catch-22 there.

We're adding code to erGo's runtime to change the font and code page automatically, but we're also going to add a way for users to disable this feature if they want to play nice with others. Another way to handle it would be to add an easy one step call to change it into a low level package such as syscall or OS. I leave it up to the MinGW port team as to how they'd like to handle it (or if they even want to worry about it at all), but provide the info so they can do as they like.

We'll have a blog post up about this and more UTF-8 issues sometime on Monday at http://ergolang.org/blog

The actual code to do this looks something like this (remember, this only works under Vists/Server 2008 or higher):

// Define some Win32 constants
const (
  UTF8_CODEPAGE = 65001
  LF_FACESIZE = 32
)

// Duplicate the Win32 CONSOLE_FONT_INFOEX structure
type CONSOLE_FONT_INFOEX struct {
  cbSize uint32
  nFont uint32;
  x, y uint16
  fontFamily uint32;
  fontWeight uint32;
  faceName [LF_FACESIZE]uint16
};

var lcString = []uint16{ 'L', 'u', 'c', 'i', 'd', 'a', ' ', 'C', 'o', 'n', 's', 'o', 'l', 'e', 0 };

// Define some system calls - the implementation for this will be different under the hood for existing Go compilers than it is for erGo.
// These could be good candidates to add to the syscall package.
// The actual system function definitions are here:
func SetConsoleOutputCP( uint32 ) int32;
func SetCurrentConsoleFontEx( uintptr, int32, *CONSOLE_FONT_INFOEX ) int32;

// Wherever you want to add the actual code
..
SetConsoleOutputCP( UTF8_CODEPAGE );

var lucidaConsole CONSOLE_FONT_INFOEX;
lucidaConsole.cbSize = 84;
copy( lucidaConsole.fontFace[0:], ([]byte)(lcString) );
SetCurrentConsoleFontEx( syscall.Stdout, 0, &lucidaConsole );

Michael Teichgräber

unread,
Dec 4, 2010, 8:38:49 PM12/4/10
to golang-nuts
I used a very similar approach for a port of plan9port to
Windows. Inside open(), or elsewhere when setting
up a file descriptor based on a Handler, GetConsoleMode() would
be called on the handler to check whether it should be
treated special, i.e. a translation using ReadConsoleW
and WriteConsoleW should take place.

Perhaps the Go runtime on Windows could do
something like this when setting
up os.Stdin, -out, and -err during init.

If a file `test.txt' contains the UTF-8 characters
'β ≥ ‰ б', and a program `cat.exe' exists
which works the way described, a command `cat.exe <
test.txt' would make these characters visible in the
console window (as I just checked), even if its codepage
is configured as `850'. Similarly, `cat.exe > test2.txt' would read
console (cp850) input and correctly write UTF-8 characters
to "test2.txt".

Michael

Attila Tajti

unread,
Dec 5, 2010, 12:45:41 PM12/5/10
to Russell Newquist, golang-nuts

On 4 Dec 2010, at 17:43, Russell Newquist wrote:

Make these two calls at the beginning of your program and UTF-8 will look great. Unfortunately, it leaves the code page and font changed on your console, which might effect other programs that are expecting default behavior. The nice thing to do is change it back. However, as soon as you do that any text still displaying in the console will revert to the old font as well, and you'll lose all your beautiful UTF-8 text. A little bit of a catch-22 there.

I would prefer if the app I run did not change my console font, especially if it is already set to a Unicode-capable one. I liked a warning  better. If someone uses non-ASCII characters at least some of the time, he would have changed the font anyway it anyway.

Additionally, I thought this needs some kind of Atexit() function in the runtime as well, isn't this the case? Otherwise the codepage may not be switched back from UTF-8 to something Windows XP compatible (see problems with batch files and even simple programs, with codepage 65001).

-- Attila

Attila Tajti

unread,
Dec 5, 2010, 12:59:14 PM12/5/10
to Michael Teichgräber, golang-nuts
On 5 Dec 2010, at 02:38, Michael Teichgräber wrote:

> I used a very similar approach for a port of plan9port to
> Windows. Inside open(), or elsewhere when setting
> up a file descriptor based on a Handler, GetConsoleMode() would
> be called on the handler to check whether it should be
> treated special, i.e. a translation using ReadConsoleW
> and WriteConsoleW should take place.
>
> Perhaps the Go runtime on Windows could do
> something like this when setting
> up os.Stdin, -out, and -err during init.

As far as I can tell file descriptors in Go for Windows are plain simple file handles. Therefore this could be only implemented in syscall.Write() and syscall.Read() in my opinion. I have an example for syscall.Write() but supposedly syscall.Read() would be just as easy to write. Only I could not actually test it, because I have not yet figured out how to compile Go on Windows. Perhaps I should simply cross compile it within a Linux Virtual Machine.

> If a file `test.txt' contains the UTF-8 characters
> 'β ≥ ‰ б', and a program `cat.exe' exists
> which works the way described, a command `cat.exe <
> test.txt' would make these characters visible in the
> console window (as I just checked), even if its codepage
> is configured as `850'. Similarly, `cat.exe > test2.txt' would read
> console (cp850) input and correctly write UTF-8 characters
> to "test2.txt".

I suspect cat.exe simply uses the ReadConsoleW/WriteConsoleW so it so it does not have to care about encodings at all. But thanks for the info, I will give it a try, too.

--Attila

brainman

unread,
Dec 5, 2010, 10:01:40 PM12/5/10
to golang-nuts
On Dec 6, 4:59 am, Attila Tajti <attila.ta...@gmail.com> wrote:
> On 5 Dec 2010, at 02:38, Michael Teichgräber wrote:
> ...

I only use ASCII chars myself, so I don't have that problem myself.
But if you have some ideas how to help yourself one way or the other,
I'll be happy to help you along (http://golang.org/doc/
contribute.html).

> ... Only I could not actually test it, because I have not yet figured out how to compile Go on Windows. Perhaps I should simply cross compile it within a Linux Virtual Machine.

You could build it on Windows, but the process is too slow for my
liking. Instead, I use linux/386 to build windows/386 executables and
test them by running on Windows. If you can use "unix" environment, it
should be easy for you.

Alex

Michael Teichgräber

unread,
Dec 7, 2010, 7:35:13 PM12/7/10
to golang-nuts
On 5 Dez., 18:59, Attila Tajti <attila.ta...@gmail.com> wrote:
>
> I suspect cat.exe simply uses the ReadConsoleW/WriteConsoleW so it so it does not have to care about encodings at all. But thanks for the info, I will give it a try, too.

(It is using `plain' read and write calls which -- only in case an fd
is a
console fd -- internally branch into a handler using Read/
WriteConsoleW.
Other programs, using printf which relies on write, would work the
same)

I have put together a prototype implementing console in- and output
without changing the codepage, located at
http://ib.wmipf.de/go-windows-console.tar.gz.
It contains the modified files (few additions in
syscall/syscall_windows.go, some changes in `os'), and also a
`gocat' example executable using the implementation, based on cat.go
from the Go tutorial.

The implementation is derived partly from your example program
ct.go, and partly based on a similar, MinGW based C implementation.

Do you think something like this would work for you?

Michael

Some notes about the prototype:

The implementation has been done in os/file*go, which seemed
less complicated than hiding it within package syscall (this
probably would have required a more complex Fd type than
`int', with a lot of changes in non-windows files, or some
kind of bookkeeping).

An item has been added to os.File, which contains function
references to syscalls read and write. This allows the windows
implementation of file functions (in file_windows.go) to
override read and write syscalls with console specific functions
in case a file is actually a console. The corresponding test
is done only once.

os.NewFile calls a method file.adjustSyscalls. The unix version
of that function is empty, the windows version contains the
code that checks whether a new fd (resp. handle) is a console
handle, or not. Depending on read or write access on of the
read / write syscall function pointers in the File structure
are overridden by closures.

The console read and write syscalls are written as closures to
allow carrying some state with them, mostly buffers that don't
need to be allocated on every file.Write or file.Read call.

The write call handles incomplete utf8 characters (characters
that partly are located at the end of a write buffer and
the beginning of the following), so that a unix-style `cat'
would work, as it simply copies binary data from one fd
to another. In such a case a fragment is stored in `frag',
which gets prepended to the byte buffer at the next write
call. For normal fmt.Print-style output no fragments will
occur, and output from different goroutines should not lead
to inconsistent console display.

The read call recognizes ^Z as end-of-file character.

Tajti Attila

unread,
Dec 8, 2010, 8:04:14 AM12/8/10
to Michael Teichgräber, golang-nuts
I like your solution much better, because it handles the problem in os.NewFile instead of messing up syscall.Read/Write.

The only minor objection is, that syscall.Write(syscall.Stdout, "hello 世界!") would not work with it, but I do not think this is a big problem.

For the record, I came up with a third possible solution: on Windows syscall.Stdin/out/err could point to pipes, that could be handled in goroutines implementing the ReadConsoleW/WriteConsoleW API. This way even syscall.Read/Write would work, but I guess this would complicate things too much.

So in the end it is perhaps you, who should follow the Contribution Guidelines as Alex suggested :)

Best regards

Michael Teichgräber

unread,
Dec 22, 2010, 7:44:21 PM12/22/10
to Tajti Attila, golang-nuts
Am 08.12.2010 14:04, schrieb Tajti Attila:
> For the record, I came up with a third possible solution: on Windows
> syscall.Stdin/out/err could point to pipes, that could be handled in
> goroutines implementing the ReadConsoleW/WriteConsoleW API. This way
> even syscall.Read/Write would work, but I guess this would complicate
> things too much.

I happened to implement the Write part using an anonymous
pipe before trying the os-based approach of my previous
posting. I liked the fact that it could be hidden within
pkg syscall -- the write handle of the pipe would simply
replace the original console handle.

A problem I observed with this approach is that some of
the output might still be in the pipe when a program exits.
If it is only printing a few lines (e.g. on os.Stderr),
and then exits, one would see no output. Some way of
synchronization code would be needed, making this
approach more complex.

> So in the end it is perhaps you, who should follow the Contribution
> Guidelines as Alex suggested :)

Nice. For the nonce I updated the tar file
specified in my previous posting. The console read
and write functions have been wrapped within
goroutines now to avoid problems in case of concurrent
access. I'll need to do some more tests for edge cases,
then, probably at the end of January, try the next
steps.

Regards,
Michael

Attila Tajti

unread,
Dec 23, 2010, 2:51:57 AM12/23/10
to Michael Teichgräber, golang-nuts

On 23 Dec 2010, at 01:44, Michael Teichgräber wrote:

> Am 08.12.2010 14:04, schrieb Tajti Attila:
>> For the record, I came up with a third possible solution: on Windows
>> syscall.Stdin/out/err could point to pipes, that could be handled in
>> goroutines implementing the ReadConsoleW/WriteConsoleW API. This way
>> even syscall.Read/Write would work, but I guess this would complicate
>> things too much.
>
> I happened to implement the Write part using an anonymous
> pipe before trying the os-based approach of my previous
> posting. I liked the fact that it could be hidden within
> pkg syscall -- the write handle of the pipe would simply
> replace the original console handle.
>
> A problem I observed with this approach is that some of
> the output might still be in the pipe when a program exits.
> If it is only printing a few lines (e.g. on os.Stderr),
> and then exits, one would see no output. Some way of
> synchronization code would be needed, making this
> approach more complex.

In the end I like the implementation being in the higher level os package. This way os.File always work in UTF8 mode when accessing the console, but one can still use syscall if a specific codepag is necessary.

>> So in the end it is perhaps you, who should follow the Contribution
>> Guidelines as Alex suggested :)
>
> Nice. For the nonce I updated the tar file
> specified in my previous posting. The console read
> and write functions have been wrapped within
> goroutines now to avoid problems in case of concurrent
> access. I'll need to do some more tests for edge cases,
> then, probably at the end of January, try the next
> steps.

I have no idea whether goroutines or closures are better in this case, I like them both.

There are only two things I noticed: the goto statement in consoleWrite() could be replaced with a break [1] if the outer for loop had a label.

The second thing is the introduction of the newFile function, is this really necessary? Is it unreasonable to replace both of the read and write functions if GetConsoleMode() returns TRUE? When relying on newFile() something like os.Open("$CONOUT", O_WRONLY) would not work.

[1] http://golang.org/doc/go_spec.html#Break_statements

Best regards
Attila

Reply all
Reply to author
Forward
0 new messages