Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

on eshell's encoding

28 views
Skip to first unread message

Daniel Bastos

unread,
Jul 26, 2016, 10:25:59 AM7/26/16
to
I'm running eshell. My current modeline is

U\--- *eshell* [...]

But after a git commit, I get garbage out from my utf-8 string given in
the command line. It must be git's fault. Do you confirm? (I don't
have the same problem if I input the string in a file.)

%gc -a -m 'Função pra esvaziar a fila.'
[cooper 95bca82] Função pra esvaziar a fila.
2 files changed, 5 insertions(+), 1 deletion(-)
%

(*) My encoding in details

U -- utf-8-dos (alias: mule-utf-8-dos)

UTF-8 (no signature (BOM))
Type: utf-8 (UTF-8: Emacs internal multibyte form)
EOL type: CRLF
This coding system encodes the following charsets:
unicode

Eli Zaretskii

unread,
Jul 26, 2016, 11:05:32 AM7/26/16
to help-gn...@gnu.org
> From: Daniel Bastos <dba...@toledo.com>
> Date: Tue, 26 Jul 2016 11:25:55 -0300
>
> I'm running eshell. My current modeline is
>
> U\--- *eshell* [...]
>
> But after a git commit, I get garbage out from my utf-8 string given in
> the command line. It must be git's fault. Do you confirm? (I don't
> have the same problem if I input the string in a file.)
>
> %gc -a -m 'Função pra esvaziar a fila.'
> [cooper 95bca82] Função pra esvaziar a fila.
> 2 files changed, 5 insertions(+), 1 deletion(-)
> %

Is this on MS-Windows? If so, you cannot invoke programs from Emacs
with command-line arguments encoded in anything but the system
codepage. And UTF-8 cannot be a system codepage on Windows.

I suggest to put the commit message in a file and use the -F switch to
"git commit". Or use the built-in VC commands, they will do this
automatically for you (if you have Emacs 25).

Daniel Bastos

unread,
Jul 26, 2016, 12:49:23 PM7/26/16
to
Hi, Eli.

Eli Zaretskii <el...@gnu.org> writes:

>> From: Daniel Bastos <dba...@toledo.com>
>> Date: Tue, 26 Jul 2016 11:25:55 -0300
>>
>> I'm running eshell. My current modeline is
>>
>> U\--- *eshell* [...]
>>
>> But after a git commit, I get garbage out from my utf-8 string given in
>> the command line. It must be git's fault. Do you confirm? (I don't
>> have the same problem if I input the string in a file.)
>>
>> %gc -a -m 'Função pra esvaziar a fila.'
>> [cooper 95bca82] Função pra esvaziar a fila.
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>> %
>
> Is this on MS-Windows? If so, you cannot invoke programs from Emacs
> with command-line arguments encoded in anything but the system
> codepage. And UTF-8 cannot be a system codepage on Windows.

You're right. This is MS-Windows. But I thought MS-Windows would not
interfere here. Why does it interfere? I thought the messages would go
straight into git's ARGV. Does Windows read() and write() interpret the
bytes?

> I suggest to put the commit message in a file and use the -F switch to
> "git commit". Or use the built-in VC commands, they will do this
> automatically for you (if you have Emacs 25).

If I put the commit message in a file, even without using -F switch, it
works as expected.

(*) Version

GNU Emacs 24.3.1 (i386-mingw-nt6.2.9200) of 2013-03-17 on MARVIN

Eli Zaretskii

unread,
Jul 26, 2016, 1:17:31 PM7/26/16
to help-gn...@gnu.org
> From: Daniel Bastos <dba...@toledo.com>
> Date: Tue, 26 Jul 2016 13:49:15 -0300
>
> > Is this on MS-Windows? If so, you cannot invoke programs from Emacs
> > with command-line arguments encoded in anything but the system
> > codepage. And UTF-8 cannot be a system codepage on Windows.
>
> You're right. This is MS-Windows. But I thought MS-Windows would not
> interfere here. Why does it interfere? I thought the messages would go
> straight into git's ARGV.

How can it go "straight"? Eshell is not a real shell, it's a Lisp
program that pretends to be a shell. When you type RET at the end of
a command line, Eshell takes the command and calls a Windows API that
invokes programs, passing it the command you typed. But the API that
Emacs calls accepts strings encoded in the system codepage. So the
UTF-8 string you typed is interpreted as encoded in that codepage, and
that's why you get it back garbled.

If the characters you typed can be encoded by your system codepage,
then what you do should still work, if you tell Git that log messages
are encoded in that codepage. Read about the i18n.commitEncoding
configuration parameter in the Git documentation. However, I don't
recommend doing that, because you (and whoever else participates in
that project) will have then confine yourself to that encoding.

There's no way of safely passing UTF-8 encoded command-line arguments
to a Windows program. The only way to break the limitations of the
system codepage is to use the Unicode (a.k.a. "wide") APIs, which
expect strings in UTF-16 encoding. But that is not currently
supported in Emacs, due to boring technical problems.

> > I suggest to put the commit message in a file and use the -F switch to
> > "git commit". Or use the built-in VC commands, they will do this
> > automatically for you (if you have Emacs 25).
>
> If I put the commit message in a file, even without using -F switch, it
> works as expected.

It will always work from a file, because file I/O doesn't have this
limitation.

Yuri Khan

unread,
Jul 26, 2016, 2:27:12 PM7/26/16
to Eli Zaretskii, help-gn...@gnu.org
On Wed, Jul 27, 2016 at 12:17 AM, Eli Zaretskii <el...@gnu.org> wrote:

> The only way to break the limitations of the
> system codepage is to use the Unicode (a.k.a. "wide") APIs, which
> expect strings in UTF-16 encoding. But that is not currently
> supported in Emacs, due to boring technical problems.

It’s not even clear if using the wide API on the caller side will
suffice. The callee also needs to cooperate, by using the
corresponding wide API to retrieve the command line arguments.

Eli Zaretskii

unread,
Jul 26, 2016, 2:36:05 PM7/26/16
to help-gn...@gnu.org
> From: Yuri Khan <yuri....@gmail.com>
> Date: Wed, 27 Jul 2016 00:26:42 +0600
> Cc: "help-gn...@gnu.org" <help-gn...@gnu.org>
Yes, and that's one of the few reasons why Emacs on Windows doesn't
bother to use the wide APIs: too few programs Emacs users normally
invoke can cooperate like that. But if Emacs did use the wide APIs,
it wouldn't have been a loss, because programs that use ANSI APIs to
access their command-line arguments would have them converted to the
system codepage by Windows, and so it would have worked or not exactly
as it does or doesn't now.

Daniel Bastos

unread,
Jul 27, 2016, 7:56:35 AM7/27/16
to
Eli Zaretskii <el...@gnu.org> writes:

>> From: Daniel Bastos <dba...@toledo.com>
>> Date: Tue, 26 Jul 2016 13:49:15 -0300
>>
>> > Is this on MS-Windows? If so, you cannot invoke programs from Emacs
>> > with command-line arguments encoded in anything but the system
>> > codepage. And UTF-8 cannot be a system codepage on Windows.
>>
>> You're right. This is MS-Windows. But I thought MS-Windows would not
>> interfere here. Why does it interfere? I thought the messages would go
>> straight into git's ARGV.
>
> How can it go "straight"?

I meant not being messed with. I don't know anything about MS-Windows.
In UNIX the creation of a new process by a shell is likely to call
execve, which won't touch the caller strings passed in through the
argv-argument.

Yuri Khan

unread,
Jul 27, 2016, 9:16:12 AM7/27/16
to Daniel Bastos, help-gn...@gnu.org
On Wed, Jul 27, 2016 at 6:56 PM, Daniel Bastos <dba...@toledo.com> wrote:

> I meant not being messed with. I don't know anything about MS-Windows.
> In UNIX the creation of a new process by a shell is likely to call
> execve, which won't touch the caller strings passed in through the
> argv-argument.

Well Windows is a different beast entirely. The basic premise is the
same, in that the parent invokes CreateProcessW, passing a
UTF-16-encoded command line, and the child process invokes
GetCommandLineW and then optionally CommandLineToArgvW to split the
command line into arguments.

Problem is, most programs prefer to work internally with 8-bit-based
encodings, and the Win32 API makes it very easy by providing backward
compatibility wrapper functions CreateProcessA and GetCommandLineA,
which unfortunately convert from/to the ANSI or OEM encoding defined
by the locale. And there is no Win32 locale for which UTF-8 is either
the ANSI or the OEM encoding.

This one point makes it very difficult to use Windows in the Unix Way:
you get to worry about encodings on every process boundary.

Eli Zaretskii

unread,
Jul 27, 2016, 12:14:41 PM7/27/16
to help-gn...@gnu.org
> From: Daniel Bastos <dba...@toledo.com>
> Date: Wed, 27 Jul 2016 08:56:31 -0300
>
> >> You're right. This is MS-Windows. But I thought MS-Windows would not
> >> interfere here. Why does it interfere? I thought the messages would go
> >> straight into git's ARGV.
> >
> > How can it go "straight"?
>
> I meant not being messed with. I don't know anything about MS-Windows.
> In UNIX the creation of a new process by a shell is likely to call
> execve, which won't touch the caller strings passed in through the
> argv-argument.

Like I said, Eshell is not a shell, it just pretends to be one. It
will eventually cause execve, or something like it, to be called, but
before it, the command-line arguments will be encoded in the locale's
encoding, since that's what execve expects. This is true on Windows
and on Unix alike. So in this case, the command-line arguments are
always "messed with" in Emacs. If your locale happens to use UTF-8,
then it will _almost_ look as if the arguments were passed to execve
untouched, but that's an illusion, and is certainly incorrect when the
locale's codeset is not UTF-8 (which is always true on Windows).

Eli Zaretskii

unread,
Jul 27, 2016, 12:22:25 PM7/27/16
to help-gn...@gnu.org
> From: Yuri Khan <yuri....@gmail.com>
> Date: Wed, 27 Jul 2016 19:15:45 +0600
> Cc: "help-gn...@gnu.org" <help-gn...@gnu.org>
>
> On Wed, Jul 27, 2016 at 6:56 PM, Daniel Bastos <dba...@toledo.com> wrote:
>
> > I meant not being messed with. I don't know anything about MS-Windows.
> > In UNIX the creation of a new process by a shell is likely to call
> > execve, which won't touch the caller strings passed in through the
> > argv-argument.
>
> Well Windows is a different beast entirely. The basic premise is the
> same, in that the parent invokes CreateProcessW, passing a
> UTF-16-encoded command line, and the child process invokes
> GetCommandLineW and then optionally CommandLineToArgvW to split the
> command line into arguments.

So it isn't a different beast, really. Both on Unix and on Windows,
Emacs encodes the command line before passing it to system APIs. The
details differ, but not the basic idea.

> Problem is, most programs prefer to work internally with 8-bit-based
> encodings, and the Win32 API makes it very easy by providing backward
> compatibility wrapper functions CreateProcessA and GetCommandLineA,
> which unfortunately convert from/to the ANSI or OEM encoding defined
> by the locale.

Nitpicking: always ANSI, never the OEM.

> And there is no Win32 locale for which UTF-8 is either the ANSI or
> the OEM encoding.

It's actually worse than that: the Windows locale implementation
doesn't support variable-length encodings, so UTF-8 cannot be a
locale's encoding, unless MS change their related runtime libraries in
a radical way.

> This one point makes it very difficult to use Windows in the Unix Way:
> you get to worry about encodings on every process boundary.

Same on Unix, unless you are willing to bet on UTF-8 being the
locale's codeset.

Yuri Khan

unread,
Jul 27, 2016, 12:47:29 PM7/27/16
to Eli Zaretskii, help-gn...@gnu.org
On Wed, Jul 27, 2016 at 11:22 PM, Eli Zaretskii <el...@gnu.org> wrote:

> It's actually worse than that: the Windows locale implementation
> doesn't support variable-length encodings

It sort of does, as long as the variable in question never exceeds 2.
See, for example, cp932.

Eli Zaretskii

unread,
Jul 27, 2016, 1:13:10 PM7/27/16
to help-gn...@gnu.org
> From: Yuri Khan <yuri....@gmail.com>
> Date: Wed, 27 Jul 2016 22:47:01 +0600
> Cc: "help-gn...@gnu.org" <help-gn...@gnu.org>
>
> On Wed, Jul 27, 2016 at 11:22 PM, Eli Zaretskii <el...@gnu.org> wrote:
>
> > It's actually worse than that: the Windows locale implementation
> > doesn't support variable-length encodings
>
> It sort of does, as long as the variable in question never exceeds 2.
> See, for example, cp932.

cp939 is a DBCS character set, so not relevant to the above.

Daniel Bastos

unread,
Aug 2, 2016, 9:24:37 AM8/2/16
to
Eli Zaretskii <el...@gnu.org> writes:

>> From: Daniel Bastos <dba...@toledo.com>
>> Date: Wed, 27 Jul 2016 08:56:31 -0300
>>
>> >> You're right. This is MS-Windows. But I thought MS-Windows would not
>> >> interfere here. Why does it interfere? I thought the messages would go
>> >> straight into git's ARGV.
>> >
>> > How can it go "straight"?
>>
>> I meant not being messed with. I don't know anything about MS-Windows.
>> In UNIX the creation of a new process by a shell is likely to call
>> execve, which won't touch the caller strings passed in through the
>> argv-argument.
>
> Like I said, Eshell is not a shell, it just pretends to be one. It
> will eventually cause execve, or something like it, to be called, but
> before it, the command-line arguments will be encoded in the locale's
> encoding, since that's what execve expects. This is true on Windows
> and on Unix alike.

That's true of EMACS. You're saying EMACS always encodes the command
line arguments. But what I said about UNIX is that whatever execve
receives in argv[] will remain as such, which apparently is not the
MS-Windows behavior.

Precisely: if on UNIX I use EMACS to call /program/ with argv[] encoded
in X, then /program/ will definitely receive its argv[] as prepared by
EMACS. That does not happen on MS-Windows. EMACS encodes the command
line in utf-8, but /program/ receives it in another encoding.

This surprises me. MS-Windows should not care what a program puts in
argv[]. I think it violates an important principle: an operating system
should help programs to communicate, but it should not care what they're
saying to each other. That's an important principle UNIX has given us.

Even if I'm not totally correct now, I'm certainly better educated.
Thank you.
0 new messages