How to stop Go trying to think for me?

1,800 views
Skip to first unread message

ceving

unread,
Dec 7, 2014, 6:16:32 AM12/7/14
to golan...@googlegroups.com
The way Go encodes stdout differently drives me crazy.

I have a program which reads an UTF-8 string from a web server. And I would like to run it on Windows. Windows uses UTF-16. Go seems to have some kind of magic build in which translates UTF-8 to UTF-16 if it writes to a console (how ever the definition of console might be):

  C:\>go run webget.go
  Der Himmel über Berlin (1987)


But when I write the output into a file this kind of magic seems to fail:

  C:\>go run webget.go > stdout
  C:\>type stdout
  Der Himmel ├╝ber Berlin (1987)

The same applies to Emacs' compile buffer. The output is broken.

  -*- mode: compilation; default-directory: "c:/" -*-
  Compilation started at Sun Dec 7 11:43:01

  go run webget.go
  Der Himmel über Berlin (1987)

  Compilation finished at Sun Dec 7 11:43:04

And the same for eshell:

  Welcome to the Emacs shell

  c: $ go run webget.go
  Der Himmel über Berlin (1987)


For Go everything but cmd.exe does not seem to be a console, although eshell is obviously a console from the users point of view.

I would say Go's default, doing sometimes this and sometimes that, is broken by design (tm).

So the question is either

  - How can I stop the UTF16 magic for all output in order to perform the UTF16 encoding by myself?
  - Or how can I enable the UTF-16 magic for all output, not just cmd.exe console?

Regards,
Sascha



Jan Mercl

unread,
Dec 7, 2014, 7:09:51 AM12/7/14
to ceving, golan...@googlegroups.com



On Sun Dec 07 2014 at 12:16:38 ceving <cev...@gmail.com> wrote:

> I would say Go's default, doing sometimes this and sometimes that, is broken by design (tm).

When a program talks to a terminal, it is expected to talk to it in the encoding the user set-up that terminal to (the output is reasonably assumed to be a text stream).

When a program's output is redirected to a pipe or a file, it is expected that such output is not touched by any [re]encoding thing (the output could not be reasonably assumed to be a text stream, it could be anything else, like a picture, program, etc.).


You might want to try to set your terminal to UTF-8 (eg. [0], not tested).

[0]: http://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8/269857#269857

-j

Tamás Gulácsi

unread,
Dec 7, 2014, 7:40:27 AM12/7/14
to golan...@googlegroups.com

If you have to live with non-utf8 terminal, you can do the following:

import (
    "os"
    "https://github.com/tgulacsi/go/loghlp"
    "https://github.com/tgulacsi/go/text"
)
var stdout, stderr = os.Stdout, os.Stderr
func init() {                                                                                                                                                                                   
    if ttyEncoding := loghlp.GetTTYEncoding();  ttyEncoding != nil {                                                                                                                                                                     
        stdout = text.NewWriter(stdout, ttyEncoding)                                                                                                                                            
        stderr = text.NewWriter(stderr, ttyEncoding)                                                                                                                                            
    }                                                                                                                                                                                           
}

and use stdout, stderr everywhere instead of os.Stderr, os.Stdout.

Manlio Perillo

unread,
Dec 7, 2014, 10:37:56 AM12/7/14
to golan...@googlegroups.com
Il giorno domenica 7 dicembre 2014 12:16:32 UTC+1, ceving ha scritto:
The way Go encodes stdout differently drives me crazy.

I have a program which reads an UTF-8 string from a web server. And I would like to run it on Windows. Windows uses UTF-16. Go seems to have some kind of magic build in which translates UTF-8 to UTF-16 if it writes to a console (how ever the definition of console might be):


It's not magic.
A terminal is an interactive device with it's own, well defined, encoding.

If you want to set encoding explicitly:

Note that the code seems to work incorrectly, but I'm not sure why.

> [...]


Regards  Manlio 

peterGo

unread,
Dec 7, 2014, 10:43:45 AM12/7/14
to golan...@googlegroups.com
ceving,

Go is open source so there is no magic. See

go/src/os/file_windows.go

Peter

Konstantin Khomoutov

unread,
Dec 7, 2014, 12:22:40 PM12/7/14
to ceving, golan...@googlegroups.com
On Sun, 7 Dec 2014 03:16:32 -0800 (PST)
ceving <cev...@gmail.com> wrote:

> The way Go encodes stdout differently drives me crazy.
>
> I have a program which reads an UTF-8 string from a web server. And I
> would like to run it on Windows. Windows uses UTF-16. Go seems to
> have some kind of magic build in which translates UTF-8 to UTF-16 if
> it writes to a console (how ever the definition of console might be):
[...]
> The same applies to Emacs' compile buffer. The output is broken.
[...]
> And the same for eshell:
[...]
> For Go everything but cmd.exe does not seem to be a console, although
> eshell is obviously a console from the users point of view.
>
> I would say Go's default, doing sometimes this and sometimes that, is
> broken by design (tm).
>
> So the question is either
>
> - How can I stop the UTF16 magic for all output in order to perform
> the UTF16 encoding by myself?
> - Or how can I enable the UTF-16 magic for all output, not just
> cmd.exe console?

You appear to have a wrong idea about how Windows console works, how
Windows programs have to deal with it and how Go deals with it.

Let's cover the Windows console first.
Windows inherited it from MS-DOS, and hence "the console" (initially
only in the form of cmd.exe) was non-Unicode and operated in the "DOS
codepage" matching the current user's locale settings. At the API
level this means that if you would write raw bytes to the standard
output stream of your program while it's being connected to a console,
that stream would be interpreted as a series of bytes in that DOS
codepage.
At some point the console gained Unicode support (dunno about
the Windows NT lineage -- it might have had Unicode support on the
console right from the start), but only in the form of special API
call WriteConsoleW() [1]; should you send raw bytes to the stdout, the
rules above apply no matter if your system supports Unicode on the
console or not. So, to make Unicode output on the console possible, one
must detect if our standard streams are connected to the console and use
special API to handle output on those streams.

So, different programming languages/libraries which need to handle
cases of the standard output/error streams being connected to either
console or a file or a pipe take different approaches to this.

Some pieces of software just ignore the fact they might be connected to
a console. Typical culprits are programs written in C++ and relying on
its std::cout. These programs are just broken: they work until they
try to output non-ASCII text: since the possibility this text will be
encoded using a matching MS-DOS codepage is close to zero these days,
the console output will be garbled.

Better written software uses Unicode internally and switches its output
routines depending on whether its streams are connected to the console.
Now there's a choice of what to do with the encoding of the streams
when they are not connected to the console. Here, approaches vary.
To name two examples, the Tcl [2] runtime by default inserts a
re-encoding filter into the stream's channel which converts the
internal encoding of the strings (which is Unicode) to the active
Windows codepage.
Powershell by default uses UTF-16 (or UCS-2, I dunno) for streams
redirected to files and pipes (thus causing major pain in the necks of
those who's unaware of this).

Now let's move to Go.
Go postulates its strings are encoded using UTF-8. Now if you have a
fmt.Println("привет, мир!") call in your program, there are two
possible cases:
1) Your program's stdout is connected to the console. The Go runtime
will switch on using Unicode-aware console API and make sure you
actually see Unicode output on your console (if you have correct
fonts there, of course).
2) Your program's stdout is redirected to a file or to a pipe.
Now Go just adheres to its statement its strings are UTF-8-encoded
and outputs plain UTF-8.

If you think about this second case a bit more, you'll see that this is
true for regular files created with os.OpenFile() as well: if you'll
output strings containing non-ASCII characters into such streams,
you'll get UTF-8 in the files. If you want something else, use
external packages providing support for non-UTF-8 encodings.

If you need to differentiate between the console and regular
files/pipes, you'll have to do that yourself, calling out into Win32
API. I've attached a simple program doing so.

TL;DR
1) There's no written standard on which encoding a program on
Windows have to use for streams redirected to files/pipes.
"Active Windows codepage" was a de-facto standard for some time.
2) Go outputs strings "as is", in their internal encoding when
outputting them to files/pipes.
3) This internal encoding is specified to be UTF-8.
4) Outputting to the console is a special case on Windows; Go uses
special Unicode-aware API to make things just work in this case.

1. http://msdn.microsoft.com/en-us/library/windows/desktop/ms687401
2. http://www.tcl.tk
wincon.go

ceving

unread,
Dec 7, 2014, 3:09:49 PM12/7/14
to golan...@googlegroups.com, cev...@gmail.com
Thanks for the detailed reply.


Am Sonntag, 7. Dezember 2014 18:22:40 UTC+1 schrieb Konstantin Khomoutov:

1) There's no written standard on which encoding a program on
   Windows have to use for streams redirected to files/pipes.
   "Active Windows codepage" was a de-facto standard for some time.


It seems to be still the case.

C:\>echo ü > ü

C:\>PrintHex.exe -h ü
81 20 0D 0A

0x81 is ü in the old 850 codepage.

I understand that it is hard to do anything right on Windows and that is the reason why they need the BOM.

So the right behavior for Go would be to write a BOM and UTF16 while writing to a file or pipe on Windows?

 

Benjamin Measures

unread,
Dec 7, 2014, 4:40:29 PM12/7/14
to golan...@googlegroups.com, cev...@gmail.com
On Sunday, 7 December 2014 20:09:49 UTC, ceving wrote:
So the right behavior for Go would be to write a BOM and UTF16 while writing to a file or pipe on Windows?

No, to do so would force all files written to be in UTF-16.

Given an arbitrary byte sequence, the correct behaviour is to write them as is.

Since you're reading from a UTF-8 source, writing this will result in a UTF-8 file[1].

[1] Observe that ü encoded in UTF-8 is C3 BC. If these bytes are output directly to a Windows' console as is (as "type" does), it'll be displayed in whatever codepage the console is set to. For cp437, this is ├╝.

Konstantin Khomoutov

unread,
Dec 8, 2014, 7:10:51 AM12/8/14
to ceving, golan...@googlegroups.com
On Sun, 7 Dec 2014 12:09:49 -0800 (PST)
ceving <cev...@gmail.com> wrote:

> Thanks for the detailed reply.
> > 1) There's no written standard on which encoding a program on
> > Windows have to use for streams redirected to files/pipes.
> > "Active Windows codepage" was a de-facto standard for some time.
> >
> >
> It seems to be still the case.
>
> C:\>echo ü > ü
>
> C:\>PrintHex.exe -h ü
> 81 20 0D 0A
>
> 0x81 is ü in the old 850 codepage.

Unfortunately, basically you've tested nothing with this: the `echo`
command is the cmd.exe's built-in so yeah, it behaves like the same
command did in MS-DOS's command.com back then.

I'm also afraid you might be confusing "active DOS codepage" (that one
used by default in cmd.exe) and "active Windows codepage". That one is
a different thing -- a code page for non-Unicode GUI applications.
They are different. You appear to use German locale, so your "OEM
code-page" is 850 [1] but your Windows codepage is 1252 [2], and they
are incompatible even though they encode mostly the same repertoire of
characters.

So, to reiterate, I stated that the de-facto standard for files/pipes
was to use the active *Windows* codepage, not OEM/DOS one.

> I understand that it is hard to do anything right on Windows and that
> is the reason why they need the BOM.

No, the reason for writing out BOMs is to make the reader of the data
stream be able to guess what Unicode encoding the stream is in.
Note that BOMs do not really solve the problem anyway: for instance,
it's impossible to tell UTF-16 from UCS-2 using BOM while they are
incompatible to a degree.

> So the right behavior for Go would be to write a BOM and UTF16 while
> writing to a file or pipe on Windows?

I tried to convey that there was (and is) no written standard on this.
Most programs seemed to use the active Windows code page for writing
plain text files, as I've said. My guess is that a criticall mass of
the Windows software has been reached before Windows started to offer
Unicode support across all their flavour then in active use. So even
when most libraries/toolkits/IDEs for writing Windows software advanced
to use Unicode internally (and on Windows this most of the time means
using UTF-16), that wasn't probably too wise to just start using UTF-16
when writing plain text. Here, we're progressing to a rather
philosophical domain a fair bit because it's hard to define precisely
what does "plain text" mean anyway; but the seemingly accepted view is
that this implies something sort of ASCII-backwards-compatible; code
pages and UTF-8 fullfill this, UTF-16 doesn't.

So, Go uses Unicode internally for its strings, which is great.
What should it do when writing text to plain text files then?
Inserting a filter converting the text to an active Windows code page
(as Tcl does) is clever and most of the time provides sensible results
but there are problems with this approach:
* Tcl is way higher-level than Go. One of Go's strenthes is that
the programmer can easily estimate costs of operations.
Transparently inserting a re-encoding filter is not correct.
* What should happen if a character not covered by the repertoire of
the target encoding occurs in the text being written out?
Should it be silently replaced by some character?
Should the writing fail? Or even panic?
IMO, all the possibilities are unacceptable in default mode of
operation -- simply because the defaults must be as dead simple
as possible, and they are.

Considered that, IMO, Go does the right thing possible: it clearly
states in the docs its strings are encoded as UTF-8, and just writing
them to byte streams will produce byte-exact representations of those
strings as they were kept in memory. Simple, predictable, consistent.

Don't like this? That's okay. Then assess the required changes in
behaviour of your program and implement them. For instance, if you want
to make your program output text in active Windows code page on its
stdout when it's redirected to a file or a pipe, write the code which
detects such a case (I've provided it), and if the stdout is
redirected, insert a text re-encoding filter before actual stdout --
thanks to Go's interfaces, it's trivial because all the filter has to
do is to implement an io.Writer interface.
For instance, you cold use go-charset [3].
Other implementations exist.

With go-charset, you'd so something like

import (
"code.google.com/p/go-charset/charset"
"log"
)
...
ww := charset.NewWriter("windows-1252", os.Stdout)
log.SetOutput(ww)
...
log.Println("Tschüß Welt!")

Now if you run this code in a program which has its stdout redirected
to a file, you should get Windows-1252-encoded text there.

To reiterate, one approach to understanding the progrem with encodings
better is to postulate that the language runtime keeps the strings it
manages in some really internal encoding you don't know, and the only
thing about strings you know is that the characters in it are in the
Unicode codeset. Then it follows, that you can't just output these
strings -- no matter what the method or destination is -- because on
media and on the wire strings never exist without some strictly
specified encoding. Hence, before outputting a text, you have to make
sure you convert it to that encoding. Go's approach is just a little
bit different as it fixes the internal encoding for its strings, so if
you need to output UTF-8 you just don't need to re-encode your text.

1. http://en.wikipedia.org/wiki/Code_page_850
2. http://en.wikipedia.org/wiki/Windows-1252
3. http://godoc.org/code.google.com/p/go-charset/charset

Manlio Perillo

unread,
Dec 8, 2014, 8:58:23 AM12/8/14
to golan...@googlegroups.com, cev...@gmail.com
Il giorno domenica 7 dicembre 2014 21:09:49 UTC+1, ceving ha scritto:
Thanks for the detailed reply.

Am Sonntag, 7. Dezember 2014 18:22:40 UTC+1 schrieb Konstantin Khomoutov:

1) There's no written standard on which encoding a program on
   Windows have to use for streams redirected to files/pipes.
   "Active Windows codepage" was a de-facto standard for some time.


It seems to be still the case.

C:\>echo ü > ü

C:\>PrintHex.exe -h ü
81 20 0D 0A

0x81 is ü in the old 850 codepage.


850 codepage is the default encoding for the cmd.exe program.
You can change the encoding; see

> [...]

Regards  Manlio 

ceving

unread,
Dec 8, 2014, 9:39:32 AM12/8/14
to golan...@googlegroups.com


Am Montag, 8. Dezember 2014 13:10:51 UTC+1 schrieb Konstantin Khomoutov:

With go-charset, you'd so something like

import (
    "code.google.com/p/go-charset/charset"
    "log"
)
...
ww := charset.NewWriter("windows-1252", os.Stdout)
log.SetOutput(ww)
...
log.Println("Tschüß Welt!")

Now if you run this code in a program which has its stdout redirected
to a file, you should get Windows-1252-encoded text there.


Does not seem to work on Windows. This:

package main

import (
    "code.google.com/p/go-charset/charset"
    "log"
    "os"
)

func main() {
    ww, _ := charset.NewWriter("windows-1252", os.Stdout)
    log.SetOutput(ww)
    log.Println("Tschüß Welt!")
}

// Local Variables:
// compile-command: "go run out.go"
// End:

panics with the following error:

go run out.go
charset: cannot open "charsets.json": open \usr\local\lib\go-charset\datafiles\charsets.json: Das System kann den angegebenen Pfad nicht finden.
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x14 pc=0x42daee]

goroutine 16 [running]:
runtime.panic(0x4d49a0, 0x571aa2)
    c:/go/src/pkg/runtime/panic.c:279 +0xe9
log.(*Logger).Output(0x114c4210, 0x2, 0x114c01b0, 0xf, 0x0, 0x0)
    c:/go/src/pkg/log/log.go:153 +0x3ae
log.Println(0x343f84, 0x1, 0x1)
    c:/go/src/pkg/log/log.go:282 +0x6c
main.main()

I think I will give it up writing anything to the console on Windows.

Opening a messages box seems to be much easier. This works fine:

    var mod = syscall.NewLazyDLL("user32.dll")
    var proc = mod.NewProc("MessageBoxW")
    var MB_ICONQUESTION = 0x00000020
    var MB_YESNO = 0x00000004
    var IDYES uint = 6
    var headline string = "Rename the following directories?"
    ret, _, _ := proc.Call(0,
        uintptr(unsafe.Pointer(syscall.StringToUTF16Ptr(question))),
        uintptr(unsafe.Pointer(syscall.StringToUTF16Ptr(headline))),
        uintptr(MB_ICONQUESTION | MB_YESNO));
    if uint(ret) == IDYES {

No need to bother about code pages and Go's missionary zeal to generate UTF-8, which works fine on Unix but does not work on Windows.


Konstantin Khomoutov

unread,
Dec 8, 2014, 10:20:36 AM12/8/14
to ceving, golan...@googlegroups.com
On Mon, 8 Dec 2014 06:39:32 -0800 (PST)
ceving <cev...@gmail.com> wrote:

> Does not seem to work on Windows. This:
>
> package main
>
> import (
> "code.google.com/p/go-charset/charset"
> "log"
> "os"
> )
>
> func main() {
> ww, _ := charset.NewWriter("windows-1252", os.Stdout)
> log.SetOutput(ww)
> log.Println("Tschüß Welt!")
> }
>
> // Local Variables:
> // compile-command: "go run out.go"
> // End:
>
> panics with the following error:
>
> go run out.go
> charset: cannot open "charsets.json": open
> \usr\local\lib\go-charset\datafiles\charsets.json: Das System kann
> den angegebenen Pfad nicht finden.
[...]

That's because the package by default expects you to ship the encoding
files along with the binary. If you want then built-in you need to
import its "data" subpackage, via

import _ "code.google.com/p/go-charset/data"

I've attached the minimal program which does this, and works,
when its stdout is redirected to a file (the screenshot of the file's
contents interpreted using the cp1252 is attached).

> I think I will give it up writing anything to the console on Windows.
[...]
> No need to bother about code pages and Go's missionary zeal to
> generate UTF-8, which works fine on Unix but does not work on Windows.

Please next time consider advertising your religious beleifs about
software up front so that I don't spend time explaining things to
someone whose non-technical inclinations impede discussing purely
technical matters.

I've decided to reply to this message anyway in the hope it can be
useful to the next guy who will google for such problem. Have fun.
cset.go
tschuess-welt-cp1252.png

Konstantin Khomoutov

unread,
Dec 8, 2014, 10:23:29 AM12/8/14
to Manlio Perillo, golan...@googlegroups.com, cev...@gmail.com
On Mon, 8 Dec 2014 05:58:23 -0800 (PST)
Manlio Perillo <manlio....@gmail.com> wrote:

[...]
> > It seems to be still the case.
> >
> > C:\>echo ü > ü
> >
> > C:\>PrintHex.exe -h ü
> > 81 20 0D 0A
> >
> > 0x81 is ü in the old 850 codepage.
> >
> >
> 850 codepage is the default encoding for the cmd.exe program.

That's incorrect. It's only the default on Windows with appropriate
locale of the current user (Western European, that is). For example,
Russian locale has 866 as the "OEM" codepage and 1251 as the "ANSI" one.

ceving

unread,
Dec 8, 2014, 11:18:12 AM12/8/14
to golan...@googlegroups.com
Am Montag, 8. Dezember 2014 16:20:36 UTC+1 schrieb Konstantin Khomoutov:

I've attached the minimal program which does this, and works,
when its stdout is redirected to a file (the screenshot of the file's
contents interpreted using the cp1252 is attached).


Wow. Really it works in Emacs' compile buffer. I did not expect this to happen any more. So this seems to be the "right way" for Windows' pipes although it breaks cmd.exe now. But that might be, because the support for the dedicated terminal API is lost now. But I can imagine that both might work together.

But now we came exactly to the point: if this is the way it works on Windows (it may be ugly but that does not help, it is how it is), why is Go's default on Windows to do something different, something which does not work on Windows?

And I can not find anything religious in the wish to use a program working out of the box on a Windows system. But promoting UTF-8 on Windows, where it is virtually a foreign object, is religious.

Thanks anyway!

Manlio Perillo

unread,
Dec 8, 2014, 11:25:09 AM12/8/14
to Konstantin Khomoutov, golan...@googlegroups.com, cev...@gmail.com
On Mon, Dec 8, 2014 at 4:22 PM, Konstantin Khomoutov <flat...@users.sourceforge.net> wrote:
On Mon, 8 Dec 2014 05:58:23 -0800 (PST)
Manlio Perillo <manlio....@gmail.com> wrote:

> [...]

> 850 codepage is the default encoding for the cmd.exe program.

That's incorrect.  It's only the default on Windows with appropriate
locale of the current user (Western European, that is).  For example,
Russian locale has 866 as the "OEM" codepage and 1251 as the "ANSI" one.

Thanks.

Manlio Perillo

Andre Polykanine

unread,
Dec 8, 2014, 11:39:03 AM12/8/14
to Konstantin Khomoutov, ceving, golan...@googlegroups.com
Hello Konstantin,

First of all, thanks for your thorough answers. Encoding issues
interest me very much, because I always tend to make my apps
localizable as my main work is actually the i18n of hardware and
software.
I could successfully output to Windows console the characters in
different languages in Utf-8, but I'm still puzzled about why we
need to import the "data" subpackage. What are the encoding files you
talked about?
Thanks a lot!



--
With best regards from Ukraine,
Andre
Skype: Francophile
Twitter: @m_elensule; Facebook: menelion
My blog: http://menelion.oire.org/
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Manlio Perillo

unread,
Dec 8, 2014, 11:40:27 AM12/8/14
to golan...@googlegroups.com
Il giorno lunedì 8 dicembre 2014 17:18:12 UTC+1, ceving ha scritto:
Am Montag, 8. Dezember 2014 16:20:36 UTC+1 schrieb Konstantin Khomoutov:

> [...] 
But now we came exactly to the point: if this is the way it works on Windows (it may be ugly but that does not help, it is how it is), why is Go's default on Windows to do something different, something which does not work on Windows?


Because it is not reliable, IMHO.

The encoding is know and reliable only for interactive devices, like the console.

Since the user can change the system default codepage, it is more pratical to
just write the data to a file using go internal UTF-8 encoding, since UTF-8 is universal
and it is recognized by several editors.


Regards  Manlio Perillo

Konstantin Khomoutov

unread,
Dec 8, 2014, 12:45:49 PM12/8/14
to ceving, golan...@googlegroups.com
On Mon, 8 Dec 2014 08:18:12 -0800 (PST)
ceving <cev...@gmail.com> wrote:

> > I've attached the minimal program which does this, and works,
> > when its stdout is redirected to a file (the screenshot of the
> > file's contents interpreted using the cp1252 is attached).
> >
> Wow. Really it works in Emacs' compile buffer. I did not expect this
> to happen any more. So this seems to be the "right way" for Windows'
> pipes although it breaks cmd.exe now. But that might be, because the
> support for the dedicated terminal API is lost now. But I can imagine
> that both might work together.
>
> But now we came exactly to the point: if this is the way it works on
> Windows (it may be ugly but that does not help, it is how it is), why
> is Go's default on Windows to do something different, something which
> does not work on Windows?

OK, let's try to reiterate once again. ;-)

Workng with console on Windows is either broken or weird, depending
on how you're looking at it (not that, say, on Linux it's absolutely
brilliant -- there, it has its own quirks, though not that gross), and,
again, there's no written standard on how to do that. Please please
read this bit carefully until it fully sinks in! I mean, the fact
Emacs' compile buffer *now* happens to show sensible non-ASCII text to
you does not mean we've fixed your program of successfully fought some
Go's idiocy, it just means we have fulfilled the expectations of a
particular program, Emacs. The pain of encodings in the command line
is almost everywhere on Windows. If you used Mercurial with non-ASCII
commit messages, you experienced it, if you used Git < 1.8 with
non-ASCII pathnames and/or commit messages, you experienced it.
The Subversion's command-line client sort-of escapes the problem but it
uses GNU iconv library internally to do all sorts of text re-encodings
-- just like we did in your program. I've just named a bunch of
well-visible programs.

For a fun example, let's take .NET, the currently reigning programming
solution on Windows. Here's a sample program; its text is encoded in
UTF-8 -- the encoding .NET tools assume by default for their source
files:

----8<----
using System;
using System.IO;

public static class Program
{
[STAThread]
public static void Main(string[] args)
{
Console.WriteLine("Tschüß Welt!");
foreach (var s in args) {
Console.WriteLine(s);
}
}
}
----8<----

Now let's build it (`csc.exe foo.cs`, I've used .NET 4.0 on Windows XP)
and run. What you'll experience is that the text "Tschüß Welt!" will
be output re-encoded in the current *OEM* code page, so to unbotch its
appearance on my console I had to switch its code page to 850 because
mine is 866 by default (Cyrillic).

The fun starts when we add text supplied on the command-line into the
picture: on my system I have Russian locale, and so I've supplied the
program a piece of Cyrillic text ("hello, world" in Russian).
The end result is that this text is (correctly) pulled in via
Unicode-aware API but on output it's re-encoded to whatever code page
is active currently on the console. Now if I have 850 active, I can
see the German text but the Russian text is botched; if I have 866
active, I see the reverse of this.

If I redirect the output to a file, its encoding again depends on what
code page was active on the console, and there is simply no way to have
*both* pieces of text readable in the resulting file. I've attached a
screenshot of this console session for your pleasure. Now let's add
more fun and open the resulting file in Notepad. As expected, there
both pieces of text are unreadable: Notepad, if it fails to guess the
file's content is encoded using some Unicode encoding, interprets its
content as being encoded in the currently active "ANSI" code page.
Screenshot attached.

Is this behaviour correct and expected? Yes? Please look me in the
eyes and say that with a straight face. ;-)

And no, that wouldn't make your Emacs compile buffer happy either
because Emacs expects the text encoded in "ANSI" code page there,
not "OEM".

> And I can not find anything religious in the wish to use a program
> working out of the box on a Windows system. But promoting UTF-8 on
> Windows, where it is virtually a foreign object, is religious.

IMO, religious is sticking tags without thorough and cold-headed
technical consideration. The whole issue of text encodings is
complicated enough, and when it comes to CLI and console on Windows,
these matters get considerably more complicated.

I have tried to show you that you just can't get this "Right" for
everyone because different programs have different expectations about
the data you generate. Go does the right thing when outputting Unicode
text to the console window, and uses UTF-8 when writing strings to
files. Doing that is as good and bad as using any other encoding: you
always will find a program or person which will be unhappy about that
particular encoding.

Maybe Go would benefit from a special library wrapping Windows console
in one way or another. Maybe modelling what .NET does would be OK
for some. I dunno. You could try implementing that. Then people
maintating your stance on this subject would use your library.
cmd-csharp40-console.png
notepad.png

Konstantin Khomoutov

unread,
Dec 8, 2014, 1:22:36 PM12/8/14
to ceving, golan...@googlegroups.com
On Mon, 8 Dec 2014 08:18:12 -0800 (PST)
ceving <cev...@gmail.com> wrote:

[...]
> Wow. Really it works in Emacs' compile buffer. I did not expect this
> to happen any more. So this seems to be the "right way" for Windows'
> pipes although it breaks cmd.exe now. But that might be, because the
> support for the dedicated terminal API is lost now. But I can imagine
> that both might work together.

And by the way, all that encodings mess is the reason why I always
have a port of GNU iconv [1] installed on my Windows systems: it allows
to bolt something like

| iconv -f cp850 -t utf-8 >foo.txt

onto any call to a command-line utility and get readable text no matter
how its designer decided to handle the problem of encodings.

And while we're at it, you could have your original program unmodified
and pipe its output to

| iconv -f utf-8 -t cp1252

to keep Emacs compile buffer happy.

1. http://gnuwin32.sourceforge.net/packages/libiconv.htm

Konstantin Khomoutov

unread,
Dec 8, 2014, 1:37:21 PM12/8/14
to Andre Polykanine, Konstantin Khomoutov, ceving, golan...@googlegroups.com
On Mon, 8 Dec 2014 18:38:24 +0200
Andre Polykanine <an...@oire.org> wrote:

Hi!

Your questions appear to be a bit strange to me as I've explicitly
touched the issues you asked about in my messages to this thread.
So please consider re-reading these messages. I'm afraid, my wording
is not an exemplar of lucidity but at least I'm trying to write as
precise and comprehensive as possible.

[...]

As to your questions as stated...

> I could successfully output to Windows console the characters in
> different languages in Utf-8,

If you're taking about using unadorned calls to fmt.Print* or
log.Print* here then they are not outputting UTF-8. They end up calling
Unicode-aware console-specific Windows API and re-encode strings from
Go's internal UTF-8 encoding to UTF-16 before passing them to those API
calls.

I'm nitpicking because you actually *can* enable an UTF-8 code page
in the Windows console, using

chcp 65001

but it's hardly useful since no locale on Windows uses UTF-8 as the
encoding for non-Unicode-aware programs.

> but I'm still puzzled about why we need to import the "data"
> subpackage. What are the encoding files you talked about?

This is a completely separate issue. I advised the OP to use a special
third-party Go package named "go-charset". It provides means to
re-encode strings from UTF-8 to a user-specified non-Unicode 8-bit
charset/encoding among those supported by the package. Re-encoding is
done using static translation tables which map Unicode code points to
code points of the target charset/encoding. By default the package
expects you'll ship the encoding table files it provides along with your
program -- as a set of files in some directory. If you instead want to
compile these tables in, the package provides a special sub-package of
itself which you may import and have all the charset tables built in
and registered with the package automatically so the resulting program
has no dependencies.

Just in case it's not clear, you only might want to use a package such
as go-charset if you really need to support explicit re-encoding of
some textual data you output or input to/from an encoding other than
UTF-8.

Nigel Tao

unread,
Dec 8, 2014, 6:30:34 PM12/8/14
to Konstantin Khomoutov, ceving, golang-nuts
On Tue, Dec 9, 2014 at 2:19 AM, Konstantin Khomoutov
<flat...@users.sourceforge.net> wrote:
> That's because the package by default expects you to ship the encoding
> files along with the binary. If you want then built-in you need to
> import its "data" subpackage, via

The golang.org/x/text/encoding/charmap package speaks Windows-1252 and
doesn't require shipping separate encoding files.

http://godoc.org/golang.org/x/text/encoding/charmap

That golang.org/x/text/encoding/charmap package's design was reviewed
by Roger Peppe, the author of go-charset, amongst others:
https://groups.google.com/forum/#!topic/golang-dev/UfT00vJBW8Y

ceving

unread,
Dec 9, 2014, 3:52:47 AM12/9/14
to golan...@googlegroups.com
Am Montag, 8. Dezember 2014 18:45:49 UTC+1 schrieb Konstantin Khomoutov:

----8<----
using System;
using System.IO;

public static class Program
{
        [STAThread]
        public static void Main(string[] args)
        {
                Console.WriteLine("Tschüß Welt!");
                foreach (var s in args) {
                        Console.WriteLine(s);
                }
        }
}
----8<----

Now let's build it (`csc.exe foo.cs`, I've used .NET 4.0 on Windows XP)
and run.  What you'll experience is that the text "Tschüß Welt!" will
be output re-encoded in the current *OEM* code page, so to unbotch its
appearance on my console I had to switch its code page to 850 because
mine is 866 by default (Cyrillic).


There is a difference between the system language and the user language on Windows:
https://github.com/ceving/rename-after-movie-title/blob/master/rename-after-movie-title.go#L222

Maybe that causes the problem?

ceving

unread,
Dec 9, 2014, 3:54:10 AM12/9/14
to golan...@googlegroups.com, cev...@gmail.com


Am Montag, 8. Dezember 2014 19:22:36 UTC+1 schrieb Konstantin Khomoutov:

And while we're at it, you could have your original program unmodified
and pipe its output to

    | iconv -f utf-8 -t cp1252

to keep Emacs compile buffer happy.

1. http://gnuwin32.sourceforge.net/packages/libiconv.htm

Thanks! I will add this to my Windows survival kit.
 

Konstantin Khomoutov

unread,
Dec 12, 2014, 2:28:00 PM12/12/14
to ceving, golan...@googlegroups.com
On Tue, 9 Dec 2014 00:52:46 -0800 (PST)
ceving <cev...@gmail.com> wrote:

[...]
> > Now let's build it (`csc.exe foo.cs`, I've used .NET 4.0 on Windows
> > XP) and run. What you'll experience is that the text "Tschüß
> > Welt!" will be output re-encoded in the current *OEM* code page, so
> > to unbotch its appearance on my console I had to switch its code
> > page to 850 because mine is 866 by default (Cyrillic).
> >
> >
> There is a difference between the system language and the user
> language on Windows:
[...]
> Maybe that causes the problem?

No, I think no.
The concept of OEM vs ANSI code pages is orthogonal to MUI packs.
It has to do with keeping backwards compatibility with MS-DOS programs
(which used OEM code pages).

The system vs user language is more about locales, not encodings.
MUI-enabled systems just allow you to have a user's "language" (locale,
if fact) different from the default one, dubbed "system".

Sure, the pair of codepages (OEM and ANSI) depend on the locale (say,
yours, German, havs 850 and 1252 while mine, Russian, has 866 and 1251)
but that's the only distinction.

What I demonstrated is that the implementation of the Console.Write*
methods in .NET seem to query the current (user's) OEM codepage, and
use it for output, performing lossless best-effort conversion (you
might have observed in my example that it converted u-umlaut to just u,
and replaced characters which could not be converted with question
marks).

Look up GetACP() and GetOEMCP() on MSDN for more info.
Reply all
Reply to author
Forward
0 new messages