On Sun, 7 Dec 2014 12:09:49 -0800 (PST)
ceving <
cev...@gmail.com> wrote:
> Thanks for the detailed reply.
> > 1) There's no written standard on which encoding a program on
> > Windows have to use for streams redirected to files/pipes.
> > "Active Windows codepage" was a de-facto standard for some time.
> >
> >
> It seems to be still the case.
>
> C:\>echo ü > ü
>
> C:\>PrintHex.exe -h ü
> 81 20 0D 0A
>
> 0x81 is ü in the old 850 codepage.
Unfortunately, basically you've tested nothing with this: the `echo`
command is the cmd.exe's built-in so yeah, it behaves like the same
command did in MS-DOS's
command.com back then.
I'm also afraid you might be confusing "active DOS codepage" (that one
used by default in cmd.exe) and "active Windows codepage". That one is
a different thing -- a code page for non-Unicode GUI applications.
They are different. You appear to use German locale, so your "OEM
code-page" is 850 [1] but your Windows codepage is 1252 [2], and they
are incompatible even though they encode mostly the same repertoire of
characters.
So, to reiterate, I stated that the de-facto standard for files/pipes
was to use the active *Windows* codepage, not OEM/DOS one.
> I understand that it is hard to do anything right on Windows and that
> is the reason why they need the BOM.
No, the reason for writing out BOMs is to make the reader of the data
stream be able to guess what Unicode encoding the stream is in.
Note that BOMs do not really solve the problem anyway: for instance,
it's impossible to tell UTF-16 from UCS-2 using BOM while they are
incompatible to a degree.
> So the right behavior for Go would be to write a BOM and UTF16 while
> writing to a file or pipe on Windows?
I tried to convey that there was (and is) no written standard on this.
Most programs seemed to use the active Windows code page for writing
plain text files, as I've said. My guess is that a criticall mass of
the Windows software has been reached before Windows started to offer
Unicode support across all their flavour then in active use. So even
when most libraries/toolkits/IDEs for writing Windows software advanced
to use Unicode internally (and on Windows this most of the time means
using UTF-16), that wasn't probably too wise to just start using UTF-16
when writing plain text. Here, we're progressing to a rather
philosophical domain a fair bit because it's hard to define precisely
what does "plain text" mean anyway; but the seemingly accepted view is
that this implies something sort of ASCII-backwards-compatible; code
pages and UTF-8 fullfill this, UTF-16 doesn't.
So, Go uses Unicode internally for its strings, which is great.
What should it do when writing text to plain text files then?
Inserting a filter converting the text to an active Windows code page
(as Tcl does) is clever and most of the time provides sensible results
but there are problems with this approach:
* Tcl is way higher-level than Go. One of Go's strenthes is that
the programmer can easily estimate costs of operations.
Transparently inserting a re-encoding filter is not correct.
* What should happen if a character not covered by the repertoire of
the target encoding occurs in the text being written out?
Should it be silently replaced by some character?
Should the writing fail? Or even panic?
IMO, all the possibilities are unacceptable in default mode of
operation -- simply because the defaults must be as dead simple
as possible, and they are.
Considered that, IMO, Go does the right thing possible: it clearly
states in the docs its strings are encoded as UTF-8, and just writing
them to byte streams will produce byte-exact representations of those
strings as they were kept in memory. Simple, predictable, consistent.
Don't like this? That's okay. Then assess the required changes in
behaviour of your program and implement them. For instance, if you want
to make your program output text in active Windows code page on its
stdout when it's redirected to a file or a pipe, write the code which
detects such a case (I've provided it), and if the stdout is
redirected, insert a text re-encoding filter before actual stdout --
thanks to Go's interfaces, it's trivial because all the filter has to
do is to implement an io.Writer interface.
For instance, you cold use go-charset [3].
Other implementations exist.
With go-charset, you'd so something like
import (
"
code.google.com/p/go-charset/charset"
"log"
)
...
ww := charset.NewWriter("windows-1252", os.Stdout)
log.SetOutput(ww)
...
log.Println("Tschüß Welt!")
Now if you run this code in a program which has its stdout redirected
to a file, you should get Windows-1252-encoded text there.
To reiterate, one approach to understanding the progrem with encodings
better is to postulate that the language runtime keeps the strings it
manages in some really internal encoding you don't know, and the only
thing about strings you know is that the characters in it are in the
Unicode codeset. Then it follows, that you can't just output these
strings -- no matter what the method or destination is -- because on
media and on the wire strings never exist without some strictly
specified encoding. Hence, before outputting a text, you have to make
sure you convert it to that encoding. Go's approach is just a little
bit different as it fixes the internal encoding for its strings, so if
you need to output UTF-8 you just don't need to re-encode your text.
1.
http://en.wikipedia.org/wiki/Code_page_850
2.
http://en.wikipedia.org/wiki/Windows-1252
3.
http://godoc.org/code.google.com/p/go-charset/charset