Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

NT Emacs 20.7 and Coding Systems ?

2 views
Skip to first unread message

David Masterson

unread,
Nov 9, 2000, 3:00:00 AM11/9/00
to
This has probably been answered before, but I haven't seen the answer...

How does Emacs decide on what "coding system" to display a buffer in? In
particular, I work on an NT and many, but not all, files that I visit with Emacs
come up with a "(Unix)" coding system which, of course, leads to all the "^M"
chars being displayed. These files are created by NT programs and have never
been previously touched by Emacs. They also don't appear (I could be wrong) to
have any strange characters in them. So how does Emacs decide what "coding
system" to use in displaying the file?

Next question -- if Emacs chooses the "wrong" coding system, can I force the
buffer to be reinterpretted under a different coding system without changing the
file? Using set-buffer-file-coding-system seems to change the buffer which, in
turn, means that the file will be updated which I'd like to avoid (I don't own
some of the files). Is there another mechanism?

Finally -- what is the difference betweeen set-buffer-file-coding-system and
set-buffer-process-coding-system? Is the "process" function only to be used
with buffers associated with a process (as in a "shell" buffer)?

--
David Masterson
dmas...@rational.com
* I speak for myself

Kai Großjohann

unread,
Nov 9, 2000, 3:00:00 AM11/9/00
to
On Thu, 9 Nov 2000, David Masterson wrote:

> How does Emacs decide on what "coding system" to display a buffer
> in?

I think it looks at the beginning of the file.

> In particular, I work on an NT and many, but not all, files that I
> visit with Emacs come up with a "(Unix)" coding system which, of
> course, leads to all the "^M" chars being displayed. These files
> are created by NT programs and have never been previously touched by
> Emacs. They also don't appear (I could be wrong) to have any
> strange characters in them. So how does Emacs decide what "coding
> system" to use in displaying the file?

Maybe not all lines have ^M on them, just most of them? Then it might
help to add ^M to the lines where it is missing. Does that work?

> Next question -- if Emacs chooses the "wrong" coding system, can I
> force the buffer to be reinterpretted under a different coding
> system without changing the file? Using
> set-buffer-file-coding-system seems to change the buffer which, in
> turn, means that the file will be updated which I'd like to avoid (I
> don't own some of the files). Is there another mechanism?

C-x RET c allows you to select a coding system for the next file
operation, including C-x C-v and M-x revert-buffer RET

> Finally -- what is the difference betweeen
> set-buffer-file-coding-system and set-buffer-process-coding-system?
> Is the "process" function only to be used with buffers associated
> with a process (as in a "shell" buffer)?

When Emacs starts a process, output from the process goes in a
buffer. Other buffers are visiting a file. The buffer file coding
system says how to interpret the file contents, the process coding
system says how to interpret the output from the process.

kai
--
The arms should be held in a natural and unaffected way and never
be conspicuous. -- Revised Technique of Latin American Dancing

David Masterson

unread,
Nov 9, 2000, 3:00:00 AM11/9/00
to
"Kai Großjohann" <Kai.Gro...@CS.Uni-Dortmund.DE> wrote...

> On Thu, 9 Nov 2000, David Masterson wrote:
>
> > How does Emacs decide on what "coding system" to display a buffer
> > in?
>
> I think it looks at the beginning of the file.

By that, I take it you mean that it looks at a number of characters at the
beginning of the file to see if there are any special characters there (as
opposed to looking for the magic "#!" code). Do you know how many characters?

I just found the find-file-buffer-coding-system function. In the "undecided"
case, I'm not sure I understand the following statements from its doc string:

=========
As the file is read in the DOS case, the coding system will be changed to
`undecided-dos' as CR/LFs are detected. As the file is read in the Unix case,
the coding system will be changed to `undecided-unix' as LFs are detected.
=========

In particular, this doesn't describe how this function decides if its a "DOS
case" or a "Unix case". Obviously, just being on a DOS system doesn't
automatically make it a DOS case. Does this mean the following?

* Assume an "undecided" find-file.
* Begin reading the file specified.
** if a bare LF is encountered, set to "undecided-unix".
** if a CRLF is encountered, set to "undecided-dos"
* Read in the rest of the file under new coding system.

MShe...@compuserve.com

unread,
Nov 10, 2000, 2:35:42 AM11/10/00
to
On Thu, 9 Nov 2000 12:14:03 -0500 , "David Masterson"
<dmas...@rational.com> wrote:

>How does Emacs decide on what "coding system" to display a buffer in? In


>particular, I work on an NT and many, but not all, files that I visit with Emacs
>come up with a "(Unix)" coding system which, of course, leads to all the "^M"
>chars being displayed. These files are created by NT programs and have never
>been previously touched by Emacs.

Command-line redirection of some programs can cause those symptoms.
For example, if I run ipconfig, I get this.

$ ipconfig

Windows NT IP Configuration

Ethernet adapter HPTX7:

IP Address. . . . . . . . . : 192.168.1.20
Subnet Mask . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . :
[snip]

But if I redirect that to a file, and do a hex dump, I get this.
(Question marks replace the unprintable ASCII chars.)

0D 0D 0A 57 69 6E 64 6F 77 73 20 4E 54 20 49 50 ???Windows NT IP
20 43 6F 6E 66 69 67 75 72 61 74 69 6F 6E 0D 0D Configuration??
0A 0D 0D 0A 45 74 68 65 72 6E 65 74 20 61 64 61 ????Ethernet ada
70 74 65 72 20 48 50 54 58 37 3A 0D 0D 0A 0D 0D pter HPTX7:?????
0A 09 49 50 20 41 64 64 72 65 73 73 2E 20 2E 20 ??IP Address. .
[snip]

Note the (broken) end-of-line sequence 0D 0D 0A.

--
Mike Sherrill
Information Management Systems

Kai Großjohann

unread,
Nov 10, 2000, 3:00:00 AM11/10/00
to
On Thu, 9 Nov 2000, David Masterson wrote:

> By that, I take it you mean that it looks at a number of characters
> at the beginning of the file to see if there are any special
> characters there (as opposed to looking for the magic "#!" code).
> Do you know how many characters?

No. I wish I really grokked Mule, but I don't. Maybe it looks at the
whole file.

> I just found the find-file-buffer-coding-system function.

I couldn't find that function. In fact, M-x apropos RET
find.*coding-system RET didn't find a likely candidate, either. Hm.

David Masterson

unread,
Nov 10, 2000, 3:00:00 AM11/10/00
to

<MShe...@compuserve.com> wrote...

> On Thu, 9 Nov 2000 12:14:03 -0500 , "David Masterson"
> <dmas...@rational.com> wrote:
>
> >How does Emacs decide on what "coding system" to display a buffer in? In
> >particular, I work on an NT and many, but not all, files that I visit with
Emacs
> >come up with a "(Unix)" coding system which, of course, leads to all the "^M"
> >chars being displayed. These files are created by NT programs and have never
> >been previously touched by Emacs.
>
> Command-line redirection of some programs can cause those symptoms.

That could explain it! 8-)

> For example, if I run ipconfig, I get this.
>
> $ ipconfig
>

> But if I redirect that to a file, and do a hex dump, I get this.
> (Question marks replace the unprintable ASCII chars.)
>
> 0D 0D 0A 57 69 6E 64 6F 77 73 20 4E 54 20 49 50 ???Windows NT IP
>

> Note the (broken) end-of-line sequence 0D 0D 0A.

Whoops, how do you "hex dump"? Inside of Emacs? Could Emacs be adding the
extra CR?

David Masterson

unread,
Nov 10, 2000, 3:00:00 AM11/10/00
to

"Kai Großjohann" <Kai.Gro...@CS.Uni-Dortmund.DE> wrote...

> On Thu, 9 Nov 2000, David Masterson wrote:
> > I just found the find-file-buffer-coding-system function.
>
> I couldn't find that function. In fact, M-x apropos RET
> find.*coding-system RET didn't find a likely candidate, either. Hm.

Whoops, my bad. That should have been find-buffer-file-type-coding-system.

Kai Großjohann

unread,
Nov 10, 2000, 6:26:43 PM11/10/00
to
On Fri, 10 Nov 2000, David Masterson wrote:

>> I couldn't find that function. In fact, M-x apropos RET
>> find.*coding-system RET didn't find a likely candidate, either.
>

> Whoops, my bad. That should have been
> find-buffer-file-type-coding-system.

Don't have that, either (neither Emacs 20.6 nor 21.0). Are you
running XEmacs?

MShe...@compuserve.com

unread,
Nov 11, 2000, 3:00:00 AM11/11/00
to
On Fri, 10 Nov 2000 17:12:02 -0500, "David Masterson"
<dmas...@rational.com> wrote:

[snip]


>Whoops, how do you "hex dump"? Inside of Emacs? Could Emacs be adding the
>extra CR?

Emacs isn't adding the CF.

I use a port of od. (od, short for "octal dump", is a venerable Unix
utility.) But emacs has a hex editing mode. (I didn't know this
until today. One of the annoying things about emacs is that I find it
hard to figure out what it *does* have in it.)

M-x hexl-find-file RET

David Masterson

unread,
Nov 11, 2000, 3:00:00 AM11/11/00
to
"Kai Großjohann" <Kai.Gro...@CS.Uni-Dortmund.DE> wrote...

> On Fri, 10 Nov 2000, David Masterson wrote:
>
> >> I couldn't find that function. In fact, M-x apropos RET
> >> find.*coding-system RET didn't find a likely candidate, either.
> >
> > Whoops, my bad. That should have been
> > find-buffer-file-type-coding-system.
>
> Don't have that, either (neither Emacs 20.6 nor 21.0). Are you
> running XEmacs?

Hmmm. Nope, this is NT Emacs 20.7.1. It was a binary tarball I downloaded from
one of the mirrors (McAfee VirusScan thinks it has a virus whenever I do an
"ls -l" on the tarball in Cygwin's Bash, but I haven't had any problems). The
defun for this function is in lisp/dos-w32.el. I'm not sure I understand it
all, but the doc string for it is pretty informative.

Galen Boyer

unread,
Nov 11, 2000, 3:00:00 AM11/11/00
to
On Sat, 11 Nov 2000, MShe...@compuserve.com wrote:

> One of the annoying things about emacs is that I find it hard
> to figure out what it *does* have in it

Ouch! I love finding something that I didn't know about since I
started using Emacs. That amazes me to no end, and makes me just
anticipate the next greatest gem.

Like this past week, I learned about occur. I had been using
occur when a folded buffer got out of wack only, then all of the
sudden, I read a newsgroup posting, thought for a second, and now
I use occur to hop around my buffer all the time. "Where is this
variable used throughout this file?" Hm, \M-x occur var, vhalla.

Well, I never knew that until last week. I certainly didn't get
annoyed, I just fell in love once again.

Okay, I'll get off my praising soapbox now.
--
Galen Boyer
New Orleans is sink'n man and I don't want to swim.

Kai Großjohann

unread,
Nov 12, 2000, 3:00:00 AM11/12/00
to
On Thu, 9 Nov 2000, David Masterson wrote:

> I just found the find-file-buffer-coding-system function. In the


> "undecided" case, I'm not sure I understand the following statements
> from its doc string:
>
> =========
> As the file is read in the DOS case, the coding system will be
> changed to `undecided-dos' as CR/LFs are detected. As the file is
> read in the Unix case, the coding system will be changed to
> `undecided-unix' as LFs are detected.
> =========
>
> In particular, this doesn't describe how this function decides if
> its a "DOS case" or a "Unix case".

Well, the function doesn't appear to do any detecting based on the
file contents itself. It looks at the file name, and if it is
possible to determine the EOL convention based on that, it does so.
Else it returns an `undecided' EOL convention, such that the normal
MULE machinery gets to decide.

I don't know what the normal MULE machinery does, though.

Anyone?

Alex Schroeder

unread,
Nov 14, 2000, 3:00:00 AM11/14/00
to
"David Masterson" <dmas...@rational.com> writes:
> "Kai Großjohann" <Kai.Gro...@CS.Uni-Dortmund.DE> wrote...

> > Don't have that, either (neither Emacs 20.6 nor 21.0). Are you
> > running XEmacs?
> Hmmm. Nope, this is NT Emacs 20.7.1.

I guess that Kai needs to load dos-w32 before being able to read the
doc string:

find-buffer-file-type-coding-system is a compiled Lisp function in `dos-w32'.
(find-buffer-file-type-coding-system COMMAND)

Alex.
--
http://www.geocities.com/kensanata/emacs.html
"Use M-x customize-face to change the colors used for syntax coloring."

Alex Schroeder

unread,
Nov 14, 2000, 3:00:00 AM11/14/00
to
"David Masterson" <dmas...@rational.com> writes:
> "Kai Großjohann" <Kai.Gro...@CS.Uni-Dortmund.DE> wrote...
> > Don't have that, either (neither Emacs 20.6 nor 21.0). Are you
> > running XEmacs?

Eli Zaretskii

unread,
Nov 16, 2000, 3:00:00 AM11/16/00
to
David Masterson wrote:
>
> > > How does Emacs decide on what "coding system" to display a buffer
> > > in?
> >
> > I think it looks at the beginning of the file.

>
> By that, I take it you mean that it looks at a number of characters at the
> beginning of the file to see if there are any special characters there (as
> opposed to looking for the magic "#!" code).

Yes.

> Do you know how many characters?

It depends on what it sees. For the particular case that you are asking
about, the end-of-line (EOL) format detection, Emacs examines 3 lines and,
if they all have the same EOL format, Emacs makes the decision.

> As the file is read in the DOS case, the coding system will be changed to
> `undecided-dos' as CR/LFs are detected. As the file is read in the Unix case,
> the coding system will be changed to `undecided-unix' as LFs are detected.
> =========
>
> In particular, this doesn't describe how this function decides if its a "DOS
> case" or a "Unix case".

It actually looks at how each line ends, i.e. what character(s) are at the
line's end.

> * Assume an "undecided" find-file.
> * Begin reading the file specified.
> ** if a bare LF is encountered, set to "undecided-unix".
> ** if a CRLF is encountered, set to "undecided-dos"
> * Read in the rest of the file under new coding system.

Yes, except that 3 lines are examined before the decision about the EOL
format is made. Also, if, after the decision is made, Emacs finds a line
with a different EOL format, it ``panics'' and reverts the buffer to its
original form on disk (that's when you see ^M characters, if any).

David Masterson

unread,
Nov 16, 2000, 3:00:00 AM11/16/00
to

"Eli Zaretskii" <el...@is.elta.co.il> wrote...

> David Masterson wrote:
> > * Assume an "undecided" find-file.
> > * Begin reading the file specified.
> > ** if a bare LF is encountered, set to "undecided-unix".
> > ** if a CRLF is encountered, set to "undecided-dos"
> > * Read in the rest of the file under new coding system.
>
> Yes, except that 3 lines are examined before the decision about the EOL
> format is made.

Reasonable.

> Also, if, after the decision is made, Emacs finds a line
> with a different EOL format, it ``panics'' and reverts the buffer to its
> original form on disk (that's when you see ^M characters, if any).

What do you mean by "panics"?

* If EOL of new line doesn't agree with picked-coding-system,
** set picked-coding-system to "unix" (ie. binary) format
** (re-)display the buffer in picked-coding-system format

Can this happen on Windows simply because it ran into a bare LF while reading
the file? This seems to be what happens to me on Windows. If so, then this
looks like a bug to me and I would suggest that the algorithm should be:

* If EOL of new line doesn't agree with picked-coding-system,
** set picked-coding-system based upon system-type
*** ie. "dos" for windows systems and "unix" for unix systems
** (re-)display the buffer in picked-coding-system format

I would bet that this would be right in more (but not all) cases than simply
falling back to binary presentation.

David Masterson

unread,
Nov 16, 2000, 3:00:00 AM11/16/00
to
"Kai Großjohann" <Kai.Gro...@CS.Uni-Dortmund.DE> wrote...

> On Thu, 16 Nov 2000, David Masterson wrote:
>
> > * If EOL of new line doesn't agree with picked-coding-system,
> > ** set picked-coding-system based upon system-type
> > *** ie. "dos" for windows systems and "unix" for unix systems
> > ** (re-)display the buffer in picked-coding-system format
>
> How should Emacs display a bare linefeed in DOS files?

Hmmm. I'd forgotten about that. I see two options:

* Display bare LF as "^J" when EOL==CRLF.
* Force LF to CRLF when EOL==CRLF.

I'd prefer the former, but the latter may be what Emacs code requires.

What happens when you read a file with a bare LF (ie. a UNIX file) using the
following?

C-x RET c RET dos RET C-x f

After doing this and saving the file, the new file had LF translated to CRLF.
This may be the best approach given the current architecture. However, going
back to the previous message, if Emacs "panics", it should default to the
natural coding-system for the system-type and not to a binary format (ie.
"unix").

Stefan Monnier <foo@acm.com>

unread,
Nov 16, 2000, 3:00:00 AM11/16/00
to
>>>>> "David" == David Masterson <dmas...@rational.com> writes:
> After doing this and saving the file, the new file had LF translated to CRLF.
> This may be the best approach given the current architecture. However, going
> back to the previous message, if Emacs "panics", it should default to the
> natural coding-system for the system-type and not to a binary format (ie.
> "unix").

But if that means "silently changes LF to CRLF", you should replace your
"it should" with a "I'd rather that". Other people would prefer Emacs not
to silently change their files.


Stefan

David Masterson

unread,
Nov 16, 2000, 3:00:00 AM11/16/00
to
Stefan Monnier <monnier+comp.emacs/news/@flint.cs.yale.edu> wrote...

Hmmm. I think defaulting to the "natural" coding system on a "panic" would be
right in a majority of the cases, but your point is well taken. This is why I
would prefer that Emacs display (on Windows) the bare LF as "^J" in the buffer
rather than force the buffer into an unnatural coding-system, but apparently
Emacs internals don't allow this. On Windows, I'd much rather see the one bare
"^J" in the buffer rather than see the "^M" of every CRLF in the file. Too many
files that I don't own (ie. can't change) seem to cause the "panic".

Perhaps, if Emacs "panics", it could go with the natural coding-system and mark
the buffer as modified and read-only...? This would force people to
toggle-read-only to make changes to the file (signal 1 that something has
changed) and/or force them to save the buffer if they try to kill-buffer (signal
2 something has changed). Just an off-the-cuff idea.

Kai Großjohann

unread,
Nov 16, 2000, 6:05:51 PM11/16/00
to
On Thu, 16 Nov 2000, David Masterson wrote:

> * If EOL of new line doesn't agree with picked-coding-system,
> ** set picked-coding-system based upon system-type
> *** ie. "dos" for windows systems and "unix" for unix systems
> ** (re-)display the buffer in picked-coding-system format

How should Emacs display a bare linefeed in DOS files?

Emacs always displays a linefeed as a newline, in DOS mode the
`superfluous' CR characters are just removed when reading a file into
a buffer and re-added for writing the buffer to a file.

It would be a fair amount of work to change Emacs' behavior to
distinguish between linefeed and newline. (I think.)

Kai Großjohann

unread,
Nov 17, 2000, 3:00:00 AM11/17/00
to
On Thu, 16 Nov 2000, David Masterson wrote:

> Hmmm. I think defaulting to the "natural" coding system on a
> "panic" would be right in a majority of the cases, but your point is
> well taken. This is why I would prefer that Emacs display (on
> Windows) the bare LF as "^J" in the buffer rather than force the
> buffer into an unnatural coding-system, but apparently Emacs
> internals don't allow this.

If your opinion is based on what I'm saying -- I know nothing about
the Emacs internals. If you looked at the code -- thanks for the
information.

Another problem with displaying LF as ^J is: how do we search for it?
I mean, a newline in a buffer is represented as a LF, so C-s C-q C-j
searches for a newline. But then you can't use that to search for
^J. Hm.

This is opening a can of worms, I'm afraid.

David Masterson

unread,
Nov 17, 2000, 3:00:00 AM11/17/00
to
"Kai Großjohann" <Kai.Gro...@CS.Uni-Dortmund.DE> wrote...

> On Thu, 16 Nov 2000, David Masterson wrote:
> > Hmmm. I think defaulting to the "natural" coding system on a
> > "panic" would be right in a majority of the cases, but your point is
> > well taken. This is why I would prefer that Emacs display (on
> > Windows) the bare LF as "^J" in the buffer rather than force the
> > buffer into an unnatural coding-system, but apparently Emacs
> > internals don't allow this.
>
> If your opinion is based on what I'm saying -- I know nothing about
> the Emacs internals. If you looked at the code -- thanks for the
> information.

Only partly based upon what you said, but I think I've seen it elsewhere (but
memory is the second thing to go as you get older... ;-)

> Another problem with displaying LF as ^J is: how do we search for it?
> I mean, a newline in a buffer is represented as a LF, so C-s C-q C-j
> searches for a newline. But then you can't use that to search for
> ^J. Hm.

Hmm. You do understand that when I say display "^J", this is a single character
as far as Emacs is concerned (like "^M" now). Also, bare LFs would only be
displayed when it is *not* the representation of EOL. In this case, searches
would still be natural as follows:

Unix coding system
* C-s C-q C-j -- finds EOL (which is an LF).
* C-s C-q C-m -- finds any CR.

DOS coding system
* C-s C-q C-m C-q C-j -- finds EOL.
* C-s C-q C-m -- finds any CR (including in EOL).
* C-s C-q C-j -- finds any LF (including in EOL).

Both
* M-x isearch-forward-regexp RET X$ RET -- finds X at EOL
* M-x isearch-forward-regexp RET [^ C-q C-m ] C-q C-j RET -- finds bare LF
* M-x isearch-forward-regexp RET C-q C-m [^ C-q C-j ] RET -- finds bare CR

> This is opening a can of worms, I'm afraid.

Seems natural to me. (Of course, "worms" are a product of the natural world
:-)

If I remember correctly now, this was discussed a few years ago (give or take a
decade). The basic problem is that, internally, Emacs needed to represent the
characters in a file as an array of single bytes. This meant that the special
character EOL that Emacs used to determine when to start a new line also had to
be represented in a single character. On UNIX, this was no problem as LF (a
single byte) was also being used to represent EOL. However, when Emacs moved to
other systems with different EOLs, it converted the EOL to LF on reading (and
convert back on writing) so that the rest of the code internally continued to
work. This leads to the problems we're discussing now where Emacs can't
distinguish between EOL and a bare LF.

I don't know how much (if any) the internals of Emacs changed when Mule and
coding systems were added to the mix. If Emacs now represents a file as an
array of multi-byte characters, perhaps it would be possible to have Emacs
process the characters as it reads the file and convert system EOL into a better
multi-byte EOL so that it could be treated distinctly from LF. I don't know how
much coding that would involve, though...

Jason Rumney

unread,
Nov 17, 2000, 3:00:00 AM11/17/00
to
"David Masterson" <dmas...@rational.com> writes:

> * If EOL of new line doesn't agree with picked-coding-system,

> ** set picked-coding-system to "unix" (ie. binary) format

Yes, binary is the ONLY safe option in these cases.

> ** (re-)display the buffer in picked-coding-system format
>

> Can this happen on Windows simply because it ran into a bare LF while reading
> the file? This seems to be what happens to me on Windows. If so, then this
> looks like a bug to me and I would suggest that the algorithm should be:
>

> * If EOL of new line doesn't agree with picked-coding-system,
> ** set picked-coding-system based upon system-type
> *** ie. "dos" for windows systems and "unix" for unix systems
> ** (re-)display the buffer in picked-coding-system format

How would you suggest we display a line that has a bare LF in
DOS/Windows?

> I would bet that this would be right in more (but not all) cases than simply
> falling back to binary presentation.

It is NEVER right for a line in a DOS text file to contain a bare LF
(or CR). They always appear as a pair. Such files are either binary or
broken.


--
Jason Rumney <jas...@altavista.net>

Kai Großjohann

unread,
Nov 17, 2000, 3:00:00 AM11/17/00
to
On Fri, 17 Nov 2000, David Masterson wrote:

> I don't know how much (if any) the internals of Emacs changed when
> Mule and coding systems were added to the mix. If Emacs now
> represents a file as an array of multi-byte characters, perhaps it
> would be possible to have Emacs process the characters as it reads
> the file and convert system EOL into a better multi-byte EOL so that
> it could be treated distinctly from LF. I don't know how much
> coding that would involve, though...

That sounds like it could work.

For some reason I don't like it, but oh, well.

David Masterson

unread,
Nov 17, 2000, 3:00:00 AM11/17/00
to
"Jason Rumney" <jas...@altavista.net> wrote...

> "David Masterson" <dmas...@rational.com> writes:
> > * If EOL of new line doesn't agree with picked-coding-system,
> > ** set picked-coding-system to "unix" (ie. binary) format
>
> Yes, binary is the ONLY safe option in these cases.

No, its the only safe option because of the way Emacs works internally.
However, that doesn't mean that its the natural and correct option.

> > * If EOL of new line doesn't agree with picked-coding-system,
> > ** set picked-coding-system based upon system-type
> > *** ie. "dos" for windows systems and "unix" for unix systems
> > ** (re-)display the buffer in picked-coding-system format
>
> How would you suggest we display a line that has a bare LF in
> DOS/Windows?

As a "^J"? (like the "^M" is displayed when in "Unix" coding system).

> It is NEVER right for a line in a DOS text file to contain a bare LF
> (or CR). They always appear as a pair. Such files are either binary or
> broken.

Whether its right or not, such files do occur. The thing is that Emacs defaults
to displaying such binary files in "Unix" style (ie. CRs are displayed as "^M")
whereas other DOS programs (like Notepad) display such files in "Dos" style (ie.
bare LFs are displayed as an unprintable character).

The question is not whether these are text files or binary files. The question
is how should Emacs display them by default. Currently, the default is "Unix"
style -- I think it should be the base style for the operating system (ie.
"unix" on Unix systems and "dos" on Windows systems).

Will Mengarini

unread,
Nov 18, 2000, 3:00:00 AM11/18/00
to
Jason Rumney <jas...@altavista.net> writes:

>How would you suggest we display a line that has a bare LF in
>DOS/Windows?

As the original meaning of linefeed, somet
hing like that.

--
Will Mengarini <sel...@eskimo.com>
Free software: the Source will be with you, always.

Eli Zaretskii

unread,
Nov 19, 2000, 3:00:00 AM11/19/00
to
David Masterson wrote:
>
> > Also, if, after the decision is made, Emacs finds a line
> > with a different EOL format, it ``panics'' and reverts the buffer to its
> > original form on disk (that's when you see ^M characters, if any).
>
> What do you mean by "panics"?

I mean that Emacs abandons its guesswork and doesn't convert the EOLs
anymore.

> * If EOL of new line doesn't agree with picked-coding-system,
> ** set picked-coding-system to "unix" (ie. binary) format

> ** (re-)display the buffer in picked-coding-system format

Yes. (Except that "-unix" and "binary" are not the same thing; but that's
irrelevant for this discussion.)

> Can this happen on Windows simply because it ran into a bare LF while reading
> the file?

Yes.

> This seems to be what happens to me on Windows. If so, then this
> looks like a bug to me

No, it's a feature.

> and I would suggest that the algorithm should be:
>

> * If EOL of new line doesn't agree with picked-coding-system,
> ** set picked-coding-system based upon system-type
> *** ie. "dos" for windows systems and "unix" for unix systems

This is wrong: the main motivation to revert to no-EOL-conversion is that
the file is not a text file at all. In that case, removing CR characters
will have a disastrous effect, and adding a CR to each LF when saving the
buffer could have much more disastrous effect.

> I would bet that this would be right in more (but not all) cases than simply
> falling back to binary presentation.

If you know that the file you are about to visit is a DOS-style text file,
simply use "C-x C-m c undecided-dos RET" before "C-x C-f". That will do
what you want without introducing misfeatures for others.

Eli Zaretskii

unread,
Nov 19, 2000, 3:00:00 AM11/19/00
to
Kai Großjohann wrote:
>
> It would be a fair amount of work to change Emacs' behavior to
> distinguish between linefeed and newline. (I think.)

It is simply unimaginable to change the visual effect of a newline
character, as it will break all the display engine. It will also screw up
display of Unix-style text files on DOS/Windows, which is a REALLY Bad
Thing.

I don't understand why such a simple problem requires such complicated
solutions.

Eli Zaretskii

unread,
Nov 19, 2000, 3:00:00 AM11/19/00
to
David Masterson wrote:
>
> What happens when you read a file with a bare LF (ie. a UNIX file)
> using the following?
>
> C-x RET c RET dos RET C-x f
>
> After doing this and saving the file, the new file had LF translated to CRLF.

yes.

> This may be the best approach given the current architecture.

No, this is a Bad Idea as well: it means that one cannot edit files that are
physically located on Unix filesystem (via the network) without screwing up
those files for Unix users.

> However, going
> back to the previous message, if Emacs "panics", it should default to the
> natural coding-system for the system-type and not to a binary format (ie.
> "unix").

As I explained elsewhere, this could cause more disaster than the current
behavior.

Eli Zaretskii

unread,
Nov 19, 2000, 3:00:00 AM11/19/00
to
David Masterson wrote:
>
> This is why I
> would prefer that Emacs display (on Windows) the bare LF as "^J" in the buffer
> rather than force the buffer into an unnatural coding-system

This would render Emacs virtually unable to edit files in Unix format, like
Notepad and other ``editors''. Please realize that many people who work in
mixed Unix/Windows environment edit Unix-style files as a matter of habit!

> Too many
> files that I don't own (ie. can't change) seem to cause the "panic".

This indicates that there's some problem with your Emacs customization.
Please describe the particulars, and maybe post one or two files which cause
this problem. I think this would allow to identify the problem(s) and suggest
solution(s) that don't interfere with what the majority of Emacs users want,
both on Unix and on DOS/Windows.

FWIW, the current behavior didn't change since Emacs 20.1 went into pretest,
more than 3 years ago. Evidently, most of the users don't see this as a
problem.

Eli Zaretskii

unread,
Nov 19, 2000, 3:00:00 AM11/19/00
to
David Masterson wrote:
>
> I don't know how much (if any) the internals of Emacs changed when Mule and
> coding systems were added to the mix. If Emacs now represents a file as an
> array of multi-byte characters, perhaps it would be possible to have Emacs
> process the characters as it reads the file and convert system EOL into a better
> multi-byte EOL so that it could be treated distinctly from LF. I don't know how
> much coding that would involve, though...

It won't help.

The problem here is only remotely related to display. The *real* problem is
that Emacs doesn't know where each line in the file ends! Without knowing
that, Emacs will not be able to convert system EOL to the internal EOL, and
you have your original problem again, e.g. because a single CR character (a
valid EOL on Macintosh) will be converted to the multibyte EOL marker and
displayed as a newline.

As things are now, Emacs guesses what is the file's EOL format, and that
allows it to display the lines as users expect. The problem is that this
guesswork is highly suspect when an inconsistent EOL format is found in the
file. When that happens, Emacs decides to play it safe and shows the user the
original, unconverted file contents. It is then up to the human to decide
what to do with that mess.

Now, where's the flaw in this design?

Eli Zaretskii

unread,
Nov 19, 2000, 3:00:00 AM11/19/00
to
David Masterson wrote:
>
> The thing is that Emacs defaults
> to displaying such binary files in "Unix" style (ie. CRs are displayed as "^M")
> whereas other DOS programs (like Notepad) display such files in "Dos" style (ie.
> bare LFs are displayed as an unprintable character).

Do you really mean that Emacs should behave like that God-awful shame called
Notepad?? Notepad also chokes on files larger than 64KB---do you want Emacs
to do the same?

> The question is not whether these are text files or binary files. The question
> is how should Emacs display them by default. Currently, the default is "Unix"
> style -- I think it should be the base style for the operating system (ie.
> "unix" on Unix systems and "dos" on Windows systems).

Please note that the Emacs behaves in this case like every other piece of
text-processing software[1] does on DOS/Windows: it treats a single LF
character as a newline. For example, programs that search text files for
strings don't care whether the file is in DOS or Unix format. If you change
that, you will surprise many a programmer.


---------------
[1] I don't consider Notepad to deserve the name ``software''.

Stefan Monnier <foo@acm.com>

unread,
Nov 20, 2000, 3:00:00 AM11/20/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:
> Now, where's the flaw in this design?

I think the flaw appears when a ^J appears in a Mac file (for example).
Emacs has no easy way to distinguish this ^J from an EOL in its
internal representation so the ^J will be turned into ^M when writing.
There are two solutions:
- turn the ^J into some other char (turned back into ^J upon writing)
but that means that searching for ^J won't do the right thing.
- use a special char for EOL different from any other existing char.
this would most likely imply using a 2-byte char which will won't
work in unibyte buffers and will surprise users who expect to be able
to search for ^J to find a newline.


Stefan

David Masterson

unread,
Nov 20, 2000, 3:00:00 AM11/20/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> David Masterson wrote:

>> The thing is that Emacs defaults to displaying such binary files in
>> "Unix" style (ie. CRs are displayed as "^M") whereas other DOS
>> programs (like Notepad) display such files in "Dos" style (ie.
>> bare LFs are displayed as an unprintable character).

> Do you really mean that Emacs should behave like that God-awful
> shame called Notepad?? Notepad also chokes on files larger than
> 64KB---do you want Emacs to do the same?

No! Come on -- my intent was clearer than that. I was only pointing
an example where it would be nicer for Emacs to operate like a native
editor rather than like a UNIX editor that's been ported.

>> The question is not whether these are text files or binary files.
>> The question is how should Emacs display them by default.
>> Currently, the default is "Unix" style -- I think it should be the
>> base style for the operating system (ie. "unix" on Unix systems
>> and "dos" on Windows systems).

> Please note that the Emacs behaves in this case like every other
> piece of text-processing software[1] does on DOS/Windows: it treats
> a single LF character as a newline. For example, programs that
> search text files for strings don't care whether the file is in DOS
> or Unix format. If you change that, you will surprise many a
> programmer.

Hmmm. If it truly didn't care like all these other programs, then
shouldn't it be treating CR, LF, and CRLF as an EOL?

But, then again, remember that Emacs is a text-editor and not a text
processor.

--
David Masterson (dmas...@rational.com)
Rational Software (but I don't speak for them)


Jason Rumney

unread,
Nov 20, 2000, 3:00:00 AM11/20/00
to
"Stefan Monnier <f...@acm.com>" <monnier+comp.emacs/news/@flint.cs.yale.edu> writes:

> >>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> > Now, where's the flaw in this design?
>
> I think the flaw appears when a ^J appears in a Mac file (for example).
> Emacs has no easy way to distinguish this ^J from an EOL in its
> internal representation so the ^J will be turned into ^M when writing.

That flaw does not appear in the design that Eli described (the
current one), as Emacs reverts to displaying the file in binary mode
when it detects an anomaly in the line-end characters. The flaw does
appear in the "solution" that David is suggesting, which is to treat
such a case as a native text file even though it quite clearly is not.

--
Jason Rumney <jas...@altavista.net>

David Masterson

unread,
Nov 20, 2000, 3:00:00 AM11/20/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> Kai Großjohann wrote:
>> It would be a fair amount of work to change Emacs' behavior to
>> distinguish between linefeed and newline. (I think.)

> It is simply unimaginable to change the visual effect of a newline


> character, as it will break all the display engine.

This may be true -- I don't know.

> It will also screw up display of Unix-style text files on
> DOS/Windows, which is a REALLY Bad Thing.

Only if Emacs "panics" while reading the file. If a UNIX file has no
CRLFs in it, then Emacs should not panic and should display it as it
does now. If it does have CRLFs in it, then yes it will screw up the
display UNIX-style text files, but this may be the more natural thing
for a Windows editor (which Emacs should be when it is on Windows).

> I don't understand why such a simple problem requires such
> complicated solutions.

Nobody said that it does (yet!). I asked how Emacs handles these
issues of LF vs. EOL and the discussion has been around trying to
state what the issue is:

* Emacs displays files with bare LFs in UNIX-style even on Windows
systems whereas other Windows editors display the LFs as strange
characters in the file.

and *possible* solutions:

* When Emacs "panics", try to pick a coding system closer to what is
natural for the operating system.

Ultimate, I suppose its possible to add a function to find-file-hook
that forces a buffer to a specific coding-system based upon the path
from which the file was loaded...

David Masterson

unread,
Nov 20, 2000, 3:00:00 AM11/20/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> David Masterson wrote:

>> This is why I would prefer that Emacs display (on Windows) the bare
>> LF as "^J" in the buffer rather than force the buffer into an
>> unnatural coding-system

> This would render Emacs virtually unable to edit files in Unix
> format, like Notepad and other ``editors''. Please realize that
> many people who work in mixed Unix/Windows environment edit
> Unix-style files as a matter of habit!

I only meant this to occur on files that Emacs "panics" over.
Currently, it would only be the case where Emacs has chosen "dos" as
its initial guess and then "panics" over a bare LF. This occurs in
two cases:

* DOS file with embedded LF (for whatever reason).
** Choosing "dos" after the panic would be the right thing.

* UNIX file with CRLF at the end of the first few lines.
** How often does this happen?

>> Too many files that I don't own (ie. can't change) seem to cause
>> the "panic".

> This indicates that there's some problem with your Emacs
> customization. Please describe the particulars, and maybe post one
> or two files which cause this problem. I think this would allow to
> identify the problem(s) and suggest solution(s) that don't interfere
> with what the majority of Emacs users want, both on Unix and on
> DOS/Windows.

I ran into this when I ran a batch script and captured the output to a
file (as in "file.bat > file.log 2>&1"). When I looked at the file,
every line had a "^M" at the end of it and the coding-system was
"Unix". That started me asking the question of why would this be?

> FWIW, the current behavior didn't change since Emacs 20.1 went into
> pretest, more than 3 years ago. Evidently, most of the users don't
> see this as a problem.

Ummm, this argument is equivalent to "everyone else is jumping off the
bridge -- why aren't you?"

David Masterson

unread,
Nov 20, 2000, 3:00:00 AM11/20/00
to
>>>>> "David" == David Masterson <dmas...@rational.com> writes:

>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

>> I don't understand why such a simple problem requires such
>> complicated solutions.

> Nobody said that it does (yet!). I asked how Emacs handles these
> issues of LF vs. EOL and the discussion has been around trying to
> state what the issue is:

> * Emacs displays files with bare LFs in UNIX-style even on Windows
> systems whereas other Windows editors display the LFs as strange
> characters in the file.

> and *possible* solutions:

> * When Emacs "panics", try to pick a coding system closer to what is
> natural for the operating system.

Whoops. Should've also mentioned:

* Emacs should display characters that don't fit the definition of EOL
for the current coding-system. For "dos", this means that bare LFs
should be displayed. For "Unix", this means that CRs should be
displayed.

> Ultimate, I suppose its possible to add a function to find-file-hook
> that forces a buffer to a specific coding-system based upon the path
> from which the file was loaded...

Ultimately, not "Ultimate"... *sigh* (the fingers need a retread)

David Masterson

unread,
Nov 20, 2000, 3:00:00 AM11/20/00
to
>>>>> "Jason" == Jason Rumney <jas...@altavista.net> writes:

> "Stefan Monnier <f...@acm.com>" <monnier+comp.emacs/news/@flint.cs.yale.edu> writes:
>> >>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

>> > Now, where's the flaw in this design?

>> I think the flaw appears when a ^J appears in a Mac file (for
>> example). Emacs has no easy way to distinguish this ^J from an EOL
>> in its internal representation so the ^J will be turned into ^M
>> when writing.

Did somebody say that Mac treats CR as EOL and LF as a normal
character (I don't have a Mac)? Assuming that...

> That flaw does not appear in the design that Eli described (the
> current one), as Emacs reverts to displaying the file in binary mode
> when it detects an anomaly in the line-end characters. The flaw does
> appear in the "solution" that David is suggesting, which is to treat
> such a case as a native text file even though it quite clearly is
> not.

So, when Emacs "panics" on a Mac text file with (say) one stray LF in
the file, it displays the entire file as two lines? Wouldn't it be
more natural to display it as a Mac text file and display the LF as a
"strange" character (for example, "^J").

Eli Zaretskii

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
Jason Rumney wrote:
>
> > I think the flaw appears when a ^J appears in a Mac file (for example).
> > Emacs has no easy way to distinguish this ^J from an EOL in its
> > internal representation so the ^J will be turned into ^M when writing.
>
> That flaw does not appear in the design that Eli described (the
> current one), as Emacs reverts to displaying the file in binary mode
> when it detects an anomaly in the line-end characters.

Correct.

> The flaw does
> appear in the "solution" that David is suggesting, which is to treat
> such a case as a native text file even though it quite clearly is not.

Yes, that's exactly what I was trying to point out.

Eli Zaretskii

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
David Masterson wrote:
>
> Did somebody say that Mac treats CR as EOL and LF as a normal
> character

Yes, this is so.

> So, when Emacs "panics" on a Mac text file with (say) one stray LF in
> the file, it displays the entire file as two lines?

Yes.

> Wouldn't it be
> more natural to display it as a Mac text file and display the LF as a
> "strange" character (for example, "^J").

No, because the file might not be a text file at all.

Eli Zaretskii

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
David Masterson wrote:
>
> * DOS file with embedded LF (for whatever reason).
> ** Choosing "dos" after the panic would be the right thing.
>
> * UNIX file with CRLF at the end of the first few lines.
> ** How often does this happen?

It happens as often as a DOS file with an embedded LF.

You also forgot the third possibility: that the file is not a text file at
all. In that case, defaulting to no EOL conversion is the right thing to do.

So you have two cases out of three which support the current default.

Plus, don't forget that you can always force Emacs to do what you want, with
"C-x RET c".

> >> Too many files that I don't own (ie. can't change) seem to cause
> >> the "panic".
>
> > This indicates that there's some problem with your Emacs
> > customization. Please describe the particulars, and maybe post one
> > or two files which cause this problem. I think this would allow to
> > identify the problem(s) and suggest solution(s) that don't interfere
> > with what the majority of Emacs users want, both on Unix and on
> > DOS/Windows.
>
> I ran into this when I ran a batch script and captured the output to a
> file (as in "file.bat > file.log 2>&1").

I do this all the time, but don't have mixed DOS/Unix EOLs. Are you even
sure there is a mixed EOL format in these files? Could you please look and
tell for sure?

Assuming the mixed EOL format is indeed the reason for the ^M characters,
the next step would be to find out how does this mixed format come into
existence. Are you using a mixture of Windows and Cygwin programs, for
example?

> > FWIW, the current behavior didn't change since Emacs 20.1 went into
> > pretest, more than 3 years ago. Evidently, most of the users don't
> > see this as a problem.
>
> Ummm, this argument is equivalent to "everyone else is jumping off the
> bridge -- why aren't you?"

If everyone else is jumping off the bridge, where are the heaps of corpses?

Eli Zaretskii

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
David Masterson wrote:
>
> * Emacs displays files with bare LFs in UNIX-style even on Windows
> systems whereas other Windows editors display the LFs as strange
> characters in the file.

Many people, myself included, don't consider programs which cannot display a
Unix-style text file as readable text to be ``editors''.

We went to great lengths to cause Unix programs to grok DOS-style EOLs,
because not doing so would be a bad idea in this age of interoperability and
networked drives. Heck, even GCC and GNU Make now accept CRLF-style files
and DTRT with them. Now you suggest to go in the reverse direction on the
DOS/Windows part of the equation? That's a huge regression.

> and *possible* solutions:
>
> * When Emacs "panics", try to pick a coding system closer to what is
> natural for the operating system.

It does that already: the natural coding system for a strange file is
no-conversion (as far as EOL format is considered).

> Ultimate, I suppose its possible to add a function to find-file-hook
> that forces a buffer to a specific coding-system based upon the path
> from which the file was loaded...

Yes, that, too, is a possibility, if you need it.

Eli Zaretskii

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
David Masterson wrote:
>
> > Yes, binary is the ONLY safe option in these cases.
>
> No, its the only safe option because of the way Emacs works internally.

No, Emacs defaults to binary because whoever designed that thought that it
was the right thing to do. Personally, I agree with that design decision: a
file that has strange EOL format renders the EOL detection suspect.
Therefore, Emacs should stop trying to be smart and leave the human decide.

Eli Zaretskii

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
David Masterson wrote:
>
> I was only pointing
> an example where it would be nicer for Emacs to operate like a native
> editor rather than like a UNIX editor that's been ported.

That would be a step backwards: we already had Emacs that on DOS and Windows
always converted files to CRLF format (in version 19). We've found that the
current operation is a better default.

If the default is not right for your case, you can always customize Emacs as
you see fit.

However, I still think that a much better way to solve your problem would be
to find out what is causing the problems in the first place, and eliminate
that reason.

David Masterson

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> David Masterson wrote:

>> Wouldn't it be more natural to display it as a Mac text file and
>> display the LF as a "strange" character (for example, "^J").

> No, because the file might not be a text file at all.

Hmmm. Is this true?

undecided-dos
* convert CRLF to LF(1) on reading
* convert LF(1) to CRLF on writing

undecided-mac
* convert CR to LF(1) on reading
* convert LF(1) to CR on writing

undecided-unix
* convert LF to LF(1) on reading (no-op)
* convert LF(1) to LF on writing (no-op)

This is because Emacs is trying to internally represent EOL as LF
(and, therefore, it is not able to display an LF that is
distinguishable from EOL). Therefore, in the dos or mac case, if
Emacs didn't panic, bare LFs would be (silently?) changed to the
system EOL on write. To prevent that, Emacs has no choice other than
to fallback to a "binary" mode (in this case, undecided-unix).

(1) If this were changed to "special EOL char" *AND* Emacs could
display LF as a separate char, would there be any reason for Emacs
to "panic" when it saw a bare LF?

David Masterson

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> David Masterson wrote:
>> I ran into this when I ran a batch script and captured the output to a
>> file (as in "file.bat > file.log 2>&1").

> I do this all the time, but don't have mixed DOS/Unix EOLs. Are you even
> sure there is a mixed EOL format in these files? Could you please look and
> tell for sure?

I'm working on that, but I haven't found an easy way of doing it yet
(od -cb output is tough to search).

> Assuming the mixed EOL format is indeed the reason for the ^M characters,
> the next step would be to find out how does this mixed format come into
> existence. Are you using a mixture of Windows and Cygwin programs, for
> example?

Maybe a mixture of Perl, MKS, and Windows... :-\

>> > FWIW, the current behavior didn't change since Emacs 20.1 went into
>> > pretest, more than 3 years ago. Evidently, most of the users don't
>> > see this as a problem.
>>
>> Ummm, this argument is equivalent to "everyone else is jumping off the
>> bridge -- why aren't you?"

> If everyone else is jumping off the bridge, where are the heaps of
> corpses?

Below the water line where they're not visible. (I couldn't resist ;-)

Truthfully, though, in my years, I have seen *LOTS* of users who
silently suffered (a relative term) through using a tool with quirks
until the quirk just became a natural part of using the tool to them
(hmmm, maybe that's how alt.religions get started... :-)

David Masterson

unread,
Nov 21, 2000, 3:00:00 AM11/21/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> David Masterson wrote:

>> * Emacs displays files with bare LFs in UNIX-style even on Windows
>> systems whereas other Windows editors display the LFs as strange
>> characters in the file.

> Many people, myself included, don't consider programs which cannot
> display a Unix-style text file as readable text to be ``editors''.

Again, you're misinterpretting me. I'm not suggesting that *ALL*
files on Windows should be displayed in "undecided-dos" -- just the
ones that Emacs "panics" over. By implication, I guess I'm suggesting
that Mac files that Emacs "panics" over should (by default) be
displayed in "undecided-mac". However, that is predicated on two
things:

* Emacs needs to be able to display bare LFs as a separate character
from the EOL.
* Choosing these defaults should not cause the file to be changed when
it is saved.

MShe...@compuserve.com

unread,
Nov 21, 2000, 9:06:36 PM11/21/00
to
On Tue, 21 Nov 2000 22:08:15 +0200, Eli Zaretskii <el...@is.elta.co.il>
wrote:

[snippage]


>I do this all the time, but don't have mixed DOS/Unix EOLs. Are you even
>sure there is a mixed EOL format in these files? Could you please look and
>tell for sure?

Under NT, it depends on what program you redirect. If I redirect
ipconfig, I find each line terminated incorrectly with 0x0D 0x0D 0x0A.
If I redirect dir, I find each line terminated correctly with 0x0D
0x0A.

I can see this with od and with M-x hexl-find-file.

>Assuming the mixed EOL format is indeed the reason for the ^M characters,
>the next step would be to find out how does this mixed format come into
>existence. Are you using a mixture of Windows and Cygwin programs, for
>example?

My ipconfig is identified internally as being from Microsoft. It
ships with NT. Most of the "net" programs terminate lines
incorrectly--ipconfig, ping, tracert. netstat works right.

Running Cygwin programs inside Cygwin's bash shell and redirecting the
output to a file shows each line terminated with 0x0D 0x0A. Running
Cygwin programs through NT's shell and redirecting the output to a
file shows each line terminated with 0x0A.

--
Mike Sherrill
Information Management Systems

Eli Zaretskii

unread,
Nov 21, 2000, 10:50:51 PM11/21/00
to
David Masterson wrote:
>
> >> Wouldn't it be more natural to display it as a Mac text file and
> >> display the LF as a "strange" character (for example, "^J").
>
> > No, because the file might not be a text file at all.
>
> This is because Emacs is trying to internally represent EOL as LF
> (and, therefore, it is not able to display an LF that is
> distinguishable from EOL). Therefore, in the dos or mac case, if
> Emacs didn't panic, bare LFs would be (silently?) changed to the
> system EOL on write. To prevent that, Emacs has no choice other than
> to fallback to a "binary" mode (in this case, undecided-unix).

We are going in circles. I have already written several times in this
thread that Emacs falls back to binary because of the possibility that the
file is not a text file. It does so because that is a right thing to do
with binary files, not because of some internal representation problem.

> (1) If this were changed to "special EOL char" *AND* Emacs could
> display LF as a separate char, would there be any reason for Emacs
> to "panic" when it saw a bare LF?

I already wrote that, in order to convert an EOL to anything special, Emacs
needs to know how does the file represent an EOL externally. The case
around which we are going in circles is a case where the external EOL
representation is inconsistent, as far as Emacs is concerned. So it would
not know what characters to convert to the internal EOL representation.

In other words, the internal representation of an EOL has no real relevance
to the actual problem.

Eli Zaretskii

unread,
Nov 21, 2000, 10:53:27 PM11/21/00
to
David Masterson wrote:
>
> Truthfully, though, in my years, I have seen *LOTS* of users who
> silently suffered (a relative term) through using a tool with quirks
> until the quirk just became a natural part of using the tool to them

Emacs users are not known to be silent sufferers.

I don't recall any significant complaints about how files with mixed
Unix/DOS EOLs are handled.

Eli Zaretskii

unread,
Nov 21, 2000, 10:56:43 PM11/21/00
to
MShe...@compuserve.com wrote:
>
> Running Cygwin programs inside Cygwin's bash shell and redirecting the
> output to a file shows each line terminated with 0x0D 0x0A.

This depends on how did you mount the volume where the redirected file
resides. You can do it either in binary or in text mode.

> Running
> Cygwin programs through NT's shell and redirecting the output to a
> file shows each line terminated with 0x0A.

??? The same program redirected to a file on the same disk drive? Are you
sure this happens with all Cygwin programs? What version of Cygwin do you
have?

Eli Zaretskii

unread,
Nov 22, 2000, 12:39:59 AM11/22/00
to
David Masterson wrote:
>
> > Are you even
> > sure there is a mixed EOL format in these files? Could you please look and
> > tell for sure?
>
> I'm working on that, but I haven't found an easy way of doing it yet
> (od -cb output is tough to search).

Why do you need to OD (pun intended)? What's wrong with Emacs's "M-x
hexl-find-file", or even "M-x find-file-literally"?

Kai Großjohann

unread,
Nov 22, 2000, 3:00:00 AM11/22/00
to
On Tue, 21 Nov 2000, David Masterson wrote:

> Again, you're misinterpretting me. I'm not suggesting that *ALL*
> files on Windows should be displayed in "undecided-dos" -- just the
> ones that Emacs "panics" over.

But Eli has a point: what if the file isn't really a text file with
some b0rken line endings? What if the file is really a binary file?

And if Emacs finds strange line endings in the file, it shows you ^M
characters explicitly, so you can just replace them with nothing, and
soon you have a Unix text file. Then you can easily convert this to a
DOS text file and you can be a happy camper. (Similarly for Mac
files, but there you'll have to replace ^J with nothing, I guess.)

We have to think about stuff: if Emacs tries to guess what the file
really is, it might be wrong and thereby hose your file. If Emacs
assumes binary for all files which it doesn't grok unambiguously, you
immediately see what's wrong.

What do you think?

kai
--
The arms should be held in a natural and unaffected way and never
be conspicuous. -- Revised Technique of Latin American Dancing

Jason Rumney

unread,
Nov 22, 2000, 3:00:00 AM11/22/00
to
Eli Zaretskii <el...@is.elta.co.il> writes:

> David Masterson wrote:
> >
> > * DOS file with embedded LF (for whatever reason).
> > ** Choosing "dos" after the panic would be the right thing.
> >
> > * UNIX file with CRLF at the end of the first few lines.
> > ** How often does this happen?
>
> It happens as often as a DOS file with an embedded LF.

In my experience, more often. Internet Explorer produces files like
this when you choose to save html files - it seems to add its own
lines to the HEAD in DOS format, then output the rest of the file as
is.

What we are talking about here is BROKEN files. Whatever you do with
them is going to be broken in a lot of cases. The solution is not to
add a lot more complexity to Emacs that effectively changes the
current broken behaviour to a different broken behaviour. The solution
is to fix the files.

--
Jason Rumney <jas...@altavista.net>

MShe...@compuserve.com

unread,
Nov 24, 2000, 5:06:24 PM11/24/00
to
On Wed, 22 Nov 2000 05:56:43 +0200, Eli Zaretskii <el...@is.elta.co.il>
wrote:

>??? The same program redirected to a file on the same disk drive?

Yep.

>Are you sure this happens with all Cygwin programs?

It did with the half-dozen I tested. That behavior is what I
expected. Should I have expected something else?

Eli Zaretskii

unread,
Nov 24, 2000, 5:50:07 PM11/24/00
to
MShe...@compuserve.com wrote:
>
> On Wed, 22 Nov 2000 05:56:43 +0200, Eli Zaretskii <el...@is.elta.co.il>
> wrote:
>
> >??? The same program redirected to a file on the same disk drive?
>
> Yep.
>
> >Are you sure this happens with all Cygwin programs?
>
> It did with the half-dozen I tested. That behavior is what I
> expected. Should I have expected something else?

Well, _I_ would expect something else. Let me explain why.

The text vs binary file I/O is not an OS feature, it is entirely in the
``mind'' of the program which does that I/O. Normally, a low-level library
function converts \n to a CR-LF (0x0D 0x0A) pair, just before handing the
data to an appropriate system call, if it's a text I/O, and leaves \n alone
if it's a binary I/O.

Therefore, one program (the shell, in this case) cannot possibly affect how
another program (the Cygwin programs you run from the shell) does its I/O.
The shell performs the redirection, but the redirection cannot inherit the
binary or text type to the child program, because it's not an OS feature,
and thus the filesystem layer in the OS doesn't know anything about it, and
cannot pass it to the child program.

The only possible way I can understand your description is that Bash sets
some environment variable which causes every Cygwin program to use text-mode
I/O. CMD.EXE probably doesn't set that variable, so you get binary I/O
there. Or the other way around.

If that is the case, I think you should change your Cygwin setup so that
programs use the same I/O mode both inside and outside Bash. I think it's
very confusing to have different behavior that depends on the shell; I'm
sure it will bite you some day, especially if you use Emacs or XEmacs (which
sometimes run programs through a shell and sometimes directly).

MShe...@compuserve.com

unread,
Nov 25, 2000, 1:13:19 AM11/25/00
to
On Sat, 25 Nov 2000 00:50:07 +0200, Eli Zaretskii <el...@is.elta.co.il>
wrote:

>The text vs binary file I/O is not an OS feature, it is entirely in the


>``mind'' of the program which does that I/O. Normally, a low-level library
>function converts \n to a CR-LF (0x0D 0x0A) pair, just before handing the
>data to an appropriate system call, if it's a text I/O, and leaves \n alone
>if it's a binary I/O.

I agree. But running Cygwin isn't "normal", right? In the "mind" of
the Cygwin programs, well, they think they're running under Unix. As
I understand it, the Cygwin DLL does all the mapping from Unix-isms to
Win-isms, including EOL mapping. The behavior I see supports that
mental model.

And before we go any further, are you sure we're on the same page?
The broken behavior I first cited was in reading a file produced by
redirecting a *Microsoft* program to a file using Microsoft's shell.

Eli Zaretskii

unread,
Nov 25, 2000, 3:00:00 AM11/25/00
to
MShe...@compuserve.com wrote:
>
> In the "mind" of
> the Cygwin programs, well, they think they're running under Unix. As
> I understand it, the Cygwin DLL does all the mapping from Unix-isms to
> Win-isms, including EOL mapping. The behavior I see supports that
> mental model.

The aspect I was relating to is why the same Cygwin program when run under
Bash behaves differently than when run under CMD.EXE. This is the same
Cygwin program in both cases, using the same Cygwin DLL that should be doing
the same EOL mapping, right?

> And before we go any further, are you sure we're on the same page?
> The broken behavior I first cited was in reading a file produced by
> redirecting a *Microsoft* program to a file using Microsoft's shell.

If some of your programs behave inconsistently as far as the EOL format is
concerned, you can easily get text files with inconsistent, half DOS half
Unix EOLs, which will confuse Emacs when it tries to guess the EOL format.
This is what this thread is about, as far as I understand.

David Masterson

unread,
Nov 29, 2000, 3:00:00 AM11/29/00
to
The news server here is quirky at best, so this may be old
news... *sigh*

>>>>> "Kai" == Kai Großjohann <Kai.Gro...@CS.Uni-Dortmund.DE> writes:

> On Tue, 21 Nov 2000, David Masterson wrote:

>> Again, you're misinterpretting me. I'm not suggesting that *ALL*
>> files on Windows should be displayed in "undecided-dos" -- just the
>> ones that Emacs "panics" over.

> But Eli has a point: what if the file isn't really a text file with

> some broken line endings? What if the file is really a binary file?

Then Emacs would display it wrong! So what? The same thing happens
now in that a "near" text file is displayed wrong. I'm not saying
that this works for *all* cases, just in most cases that Emacs
"panics" over (and don't forget my assumption that Emacs should have a
way of displaying bare LFs when they are not the EOL).

Regardless of what the EOL character is, Emacs displays all files as
"text" files (ie. what you see on the screen is ASCII/ISO characters
and not binary bits [unless you're in hexl-mode]). Therefore, the
issue over binary vs. text is really a non-sequitor. The only
question is how do you handle EOL characters. What I'm suggesting is,
*WHEN IN DOUBT* attempt to handle them in the method that is natural
for the operating system (see below for example).

> And if Emacs finds strange line endings in the file, it shows you ^M
> characters explicitly, so you can just replace them with nothing, and
> soon you have a Unix text file. Then you can easily convert this to a
> DOS text file and you can be a happy camper. (Similarly for Mac
> files, but there you'll have to replace ^J with nothing, I guess.)

But this argument has a few problems:

* it scares novice users ("why do I have to convert my file?")
* it is a workaround to something the editor *should* already do
* I may not own the file and, so, cannot do a permanent conversion
(ie. I'd have to convert it each time I read it).

> We have to think about stuff: if Emacs tries to guess what the file
> really is, it might be wrong and thereby hose your file. If Emacs
> assumes binary for all files which it doesn't grok unambiguously, you
> immediately see what's wrong.

Why would Emacs *change* a file on reading it in? Regardless of the
coding system it chooses, I don't think the file should be changed.

I just ran an experiment.

I have a log file from a build I ran last night on my Windows machine.
When I find-file it, it loads as "undecided-unix" and, so, displays
"^M" at the end of all the lines. I haven't been able to find why it
does this (funny control character combinations are hard to search for
with grep, etc.), but its a big file created with a combination of
Windows and UNIX-like (perl, MKS, etc.) programs, so I have to assume
that somewhere in this file is one or more bare LFs.

My experiment was to use universal-coding-system-argument to force
find-file to read the file as "undecided-dos". The file was then
displayed as I expected with no "^M" chars *AND* the file was
apparently not changed by this process as the "modified" indicator on
the modeline was off ("--" not "**"). I then modified the file myself
(added a couple of new lines) and saved it to a new file. When I did
a find-file on this new file, it was also displayed in
"undecided-unix" with all the "^M" characters.

If I assume that this process is essentially the same one that Emacs
does (where I chose correctly and it didn't), this tells me that
choosing even the wrong coding system on reading a file into an Emacs
buffer will not change the file -- it only impacts how the file is
displayed in the buffer. Therefore, esthetically, I think its more
correct to fall-back to the natural coding system for the O/S when
Emacs "panics" on the EOL representation *AND*, if the chosen EOL is
CRLF, handle bare CRs or bare LFs as displayable characters (like any
other control character).

Eli Zaretskii

unread,
Nov 29, 2000, 3:00:00 AM11/29/00
to
David Masterson wrote:
>
> > But Eli has a point: what if the file isn't really a text file with
> > some broken line endings? What if the file is really a binary file?
>
> Then Emacs would display it wrong! So what? The same thing happens
> now in that a "near" text file is displayed wrong.

No, currently such files are not displayed wrong. It's just that you don't
like to see the ^M characters.

> * it scares novice users ("why do I have to convert my file?")

You don't need to convert your file, unless you don't like seeing ^M
characters.

Btw, I'm sure novices will likewise be annoyed and confused by ^J
characters.

> * it is a workaround to something the editor *should* already do

Not as far as the majority of users want it. If you want otherwise, you can
customize Emacs to do that.

> * I may not own the file and, so, cannot do a permanent conversion
> (ie. I'd have to convert it each time I read it).

No, you just need to use "C-x RET c".

> I have a log file from a build I ran last night on my Windows machine.
> When I find-file it, it loads as "undecided-unix" and, so, displays
> "^M" at the end of all the lines. I haven't been able to find why it
> does this (funny control character combinations are hard to search for
> with grep, etc.)

Too bad that you don't want to pursue this further. IMHO you should find
the reason why do you get mixed DOS/Unix EOLs, and then you will be able to
forget about this problem forever.

You don't need to grep to look for these characters, just use hexl-find-file
or find-file-literally inside Emacs, and you will see all the characters
verbatim.

> Therefore, esthetically, I think its more
> correct to fall-back to the natural coding system for the O/S when
> Emacs "panics" on the EOL representation

And what if the file is on a networked drive which is mounted on a Unix
machine? What EOL format is then ``natural''?

David Masterson

unread,
Nov 29, 2000, 3:00:00 AM11/29/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> David Masterson wrote:

>> > But Eli has a point: what if the file isn't really a text file with
>> > some broken line endings? What if the file is really a binary file?

>> Then Emacs would display it wrong! So what? The same thing happens
>> now in that a "near" text file is displayed wrong.

> No, currently such files are not displayed wrong. It's just that
> you don't like to see the ^M characters.

Actually, ^M (or ^J) characters are fine -- when they are not part of
the EOL!

>> * it scares novice users ("why do I have to convert my file?")

> You don't need to convert your file, unless you don't like seeing ^M
> characters.

I did say "novice" here. Most novices would probably wonder why (say)
Notepad displays their file with *almost* no funny characters while
Emacs displays the same file with *lots* of funny characters. At
least Notepad focuses their attention on exactly where the problem
might be in the file (if its really a problem).

Of course, Notepad doesn't handle Unix files and, so, would be equally
broken when displaying a Unix file with a stray CRLF in it.

> Btw, I'm sure novices will likewise be annoyed and confused by ^J
> characters.

For either ^M or ^J characters, displaying them on every line (because
Emacs chose an EOL that was not the intended EOL of the file) would be
annoying and confusing. H***, it confused me at first (which led to
my starting this thread) and, with 20 years of off and on usage of
Emacs, I'm not a novice anymore (I'm not an expert either).

>> * it is a workaround to something the editor *should* already do

> Not as far as the majority of users want it. If you want otherwise,
> you can customize Emacs to do that.

Mostly agreed in that you can force certain types of files or files in
certain locations (like a UNIX share) to be interpretted under a
particular coding system. However, I don't think there is a general
solution unless you want to write an elaborate Elisp program to
post-read check a file and guess what it's EOL should be (sounds like
a performance problem).

>> * I may not own the file and, so, cannot do a permanent conversion
>> (ie. I'd have to convert it each time I read it).

> No, you just need to use "C-x RET c".

Exactly what I meant -- you'd have to do this each and everytime you
read the file.

>> I have a log file from a build I ran last night on my Windows machine.
>> When I find-file it, it loads as "undecided-unix" and, so, displays
>> "^M" at the end of all the lines. I haven't been able to find why it
>> does this (funny control character combinations are hard to search for
>> with grep, etc.)

> Too bad that you don't want to pursue this further. IMHO you should find
> the reason why do you get mixed DOS/Unix EOLs, and then you will be able to
> forget about this problem forever.

> You don't need to grep to look for these characters, just use hexl-find-file
> or find-file-literally inside Emacs, and you will see all the characters
> verbatim.

Bingo! That's what I needed (plus a better regexp to search for the
problem). In my case, the problem comes from MS-Robocopy. As
Robocopy does its work, it outputs a series of messages telling what
its percentage of completion is for each (large) file. These messages
have a ^M between them so that they overwrite each other (a standard
technique for such messages -- I could see a lot of programs doing
this). When I captured this to a log file, Emacs saw the bare CRs and
fell back to "undecided-unix" (saying that it fell back to "binary"
doesn't seem right).

BTW, viewing this file in "undecided-unix" made this difficult to spot
when glancing through the file. Viewing it in "undecided-dos" made it
very obvious as to what was going on (if I had just looked far enough
into the file... :-\ )

p.s. I was previously backward about this issue in this case -- it
wasn't a bare LF, it was a bare CR.

> And what if the file is on a networked drive which is mounted on a
> Unix machine? What EOL format is then ``natural''?

Isn't this more likely to be the "special case" that a novice user
would use auto-coding-alist or file-coding-system-alist for? (The
expert user who does UNIX development on a Windows system probably
knows quite well the EOL differences and probably already setup these
variables.)

John Clonts

unread,
Nov 29, 2000, 3:00:00 AM11/29/00
to
Eli Zaretskii wrote:
>
> David Masterson wrote:
[snip]

> > * I may not own the file and, so, cannot do a permanent conversion
> > (ie. I'd have to convert it each time I read it).
>
> No, you just need to use "C-x RET c".
>

Pardon my dumb-newbie question, but what is that supposed to do? It
doesn't seem to be defined on my xemacs 21.1 on Linux....

Cheers,
John

Jason Rumney

unread,
Nov 29, 2000, 3:00:00 AM11/29/00
to
David Masterson <dmas...@rational.com> writes:

> > And if Emacs finds strange line endings in the file, it shows you ^M
> > characters explicitly, so you can just replace them with nothing, and
> > soon you have a Unix text file. Then you can easily convert this to a
> > DOS text file and you can be a happy camper. (Similarly for Mac
> > files, but there you'll have to replace ^J with nothing, I guess.)
>
> But this argument has a few problems:
>

> * it scares novice users ("why do I have to convert my file?")

I disagree that this is a bad thing. If we hide the brokenness of the
files from users, they are likely to run into problems with other
tools that use these files later.

> * it is a workaround to something the editor *should* already do

I am not totally clear on what this something is. As far as I can see,
you want Emacs to display broken files in a different broken way than
the broken way it currently displays them. They are still broken
files, and they will still look broken to the user. Some broken files
(the ones that you seem to strike all the time, but my experience is
different) may look less broken than they did before. IMHO this is a
bad thing, as it means users may not notice that the files are broken.


--
Jason Rumney <jas...@altavista.net>

David Masterson

unread,
Nov 29, 2000, 3:00:00 AM11/29/00
to
>>>>> "Jason" == Jason Rumney <jas...@altavista.net> writes:

> David Masterson <dmas...@rational.com> writes:

>> * it is a workaround to something the editor *should* already do

> I am not totally clear on what this something is. As far as I can see,
> you want Emacs to display broken files in a different broken way than
> the broken way it currently displays them. They are still broken
> files, and they will still look broken to the user. Some broken files
> (the ones that you seem to strike all the time, but my experience is
> different) may look less broken than they did before. IMHO this is a
> bad thing, as it means users may not notice that the files are broken.

IMHO, your definition of "broken" is colored by Emacs/Unix thinking.

Your claims are:

* A text file is *only* a file that has consistent EOLs.

* A file that has something that looks like an EOL, but doesn't agree
with the guessed EOL *must* be broken.

* A broken file *must* be displayed with an "undecided-unix" coding
system (the key problem).

* The fact that Emacs now displays *lots* of control characters (some
which are meaningful and some which aren't) alerts the user to a
problem is considered a Good Thing(tm).

Consider the following:

* Create a file with NT Emacs with a number of lines of text (by
default, it will be created in "undecided-dos"). Save the file.
* Kill the buffer and reload the file into NT Emacs. Note that
there are no ^Ms in the buffer.
* Pick a spot in the buffer and add a CR (C-q C-m). Note that there are
still no ^Ms anywhere else in the buffer. Resave the file.
* Kill the buffer and reload the file. Note that now all lines have a
^M at the end and the buffer is in the "undecided-unix" coding
system.
* Load the file into Notepad and compare the "look" of the two.
Notepad (as all other Windows programs) always displays the file in
(essentially) "undecided-dos" and displays only the bare CR as a
funny character.

Which program has the more "correct" display? (Be careful that your
answer isn't colored by Emacs/Unix thinking.)

Kai Großjohann

unread,
Nov 29, 2000, 9:12:00 PM11/29/00
to
I think you can re-search-forward for "\r." to find ^M characters
which are not at EOL. And you can re-search-forward for "[^\r]$" to
find lines ending in LF rather than CRLF.

I think that a function along the lines of: I thought this was a DOS
text file, so why does Emacs not show it as a DOS text file? would be
a useful thing to have.

Maybe like this: M-x explain-encoding RET prompts the user with `Which
encoding did you expect?' and then Emacs tells the user where in the
file the encoding didn't fit.

Stephen J. Turnbull

unread,
Nov 29, 2000, 9:57:57 PM11/29/00
to
>>>>> "John" == John Clonts <jcl...@mastnet.net> writes:

John> Eli Zaretskii wrote:
>> David Masterson wrote:

John> [snip]


>> > * I may not own the file and, so, cannot do a permanent
>> conversion > (ie. I'd have to convert it each time I read it).
>>
>> No, you just need to use "C-x RET c".
>>

John> Pardon my dumb-newbie question, but what is that supposed to

Changes the coding system for the next operation.

John> do? It doesn't seem to be defined on my xemacs 21.1 on
John> Linux....

You probably don't have a Mule XEmacs, then. In XEmacs, it's defined
in mule/mule-cmds.el.

--
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
_________________ _________________ _________________ _________________
What are those straight lines for? "XEmacs rules."

Stephen J. Turnbull

unread,
Nov 29, 2000, 10:44:55 PM11/29/00
to
>>>>> "David" == David Masterson <dmas...@rational.com> writes:

>>>>> "Kai" == Kai Großjohann <Kai.Gro...@CS.Uni-Dortmund.DE> writes:

>> On Tue, 21 Nov 2000, David Masterson wrote:

David> What I'm suggesting is, *WHEN IN DOUBT* attempt to handle
David> them in the method that is natural for the operating system
David> (see below for example).

I don't see a problem with this, as long as Emacs doesn't attempt to
"correct" the "broken" line breaks, and does mark them in an
"in-your-face" manner.

All it requires is some way to indicate the "broken" line breaks.
Treating a "broken" ASCII CR specially is necessary; treating it as a
carriage return will not produce visually obvious brokenness (assuming
you aren't using a paper teletype). Treating a "broken" ASCII LF as a
linefeed (without carriage return) is pretty stunningly obvious.
Alternatively you could display it (like CR) with the octal or caret
notations.

This would require more than customizing the display table, though,
since on DOS newline is a character pair, either of which is "broken"
alone. While (unfortunately) the buffer respresentation is a single
character, which happens to be one of the "broken" "lonely" characters.

>> And if Emacs finds strange line endings in the file, it shows
>> you ^M characters explicitly, so you can just replace them with
>> nothing, and soon you have a Unix text file. Then you can
>> easily convert this to a DOS text file and you can be a happy
>> camper. (Similarly for Mac files, but there you'll have to
>> replace ^J with nothing, I guess.)

David> But this argument has a few problems:

David> * it scares novice users ("why do I have to convert my
David> file?")

Well, they should be scared. Either they're editing a binary file,
which novices probably should not do if they can't hack newline
conventions, or they're editing a corrupted file, also, er,
"inadvisable".

David> * it is a workaround to something the editor *should*
David> already do

I agree; it may not be entirely trivial, that's all. It may require
changing the internal buffer representation. Alternatively, a
"binary-dos" coding system could be created which puts a text property
'indecently-exposed on naked LFs at input, and does not convert them
to CRLF on output. I don't think this would be very easy to do
efficiently outside of Mule (ie, in no-mule XEmacs or GNU Emacs
-buffers-are-unibyte or whatever the idiom is). In Mule, it's easy:
just create a private character MULE_INTERNAL_NAKED_LF different from
all other characters and translate naked LFs to that on input, and
vice-versa on output where appropriate. But Mule carries its own
infficiencies with it.

David> * I may not own the file and, so, cannot do a permanent
David> conversion (ie. I'd have to convert it each time I read
David> it).

`file-coding-system-alist' is your friend. This also handles volatile
files like log files.

>> We have to think about stuff: if Emacs tries to guess what the
>> file really is, it might be wrong and thereby hose your file.
>> If Emacs assumes binary for all files which it doesn't grok
>> unambiguously, you immediately see what's wrong.

David> Why would Emacs *change* a file on reading it in?
David> Regardless of the coding system it chooses, I don't think
David> the file should be changed.

I think you're wrong. _Emacs changes every DOS file on input, by
translating CRLF to LF._ If you read in a file which is mostly DOS,
but contains a few lonely ASCII LFs, and force it to be recognized as
DOS, those LFs will be appear in the buffer as newlines. When writing
the file out, there is no mechanism (in XEmacs and older GNU Emacs +
Mule, dunno GNU Emacs NT 20.7) for remembering "lonely LFs" as opposed
to "converted from CRLF LFs", and _all_ will be converted to CRLF on
output. Double plus ungood if the file happens to be a binary file.
This does not happen with undecided-unix because undecided-unix is
identical to the buffer representation.

Your experiment would seem to say this is not so, but at least in
XEmacs the presence of "unusual" control characters, or extremely long
lines, will cause the auto-recognizer to declare the file as binary
(ie, undecided-unix). "Unusual" control characters of course includes
CR and LF when not used according to the native convention, but it
also includes all the non-whitespace control characters, including all
of the C1 set (IIRC). If so, and your file includes any Windows 125x
strings, all bets are off. There are lots of ways to get the behavior
you describe; it's not clear the scenario you are advocating (naked
LFs) is correct.

Please run M-: (re-search-forward "[^\015]$") RET on that log file
from an unholy pottage of utilities to confirm the existence of naked
LFs. Also, a binary diff on the before and after files.

Eli Zaretskii

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
John Clonts wrote:
>
> > No, you just need to use "C-x RET c".
> >
>
> Pardon my dumb-newbie question, but what is that supposed to do? It
> doesn't seem to be defined on my xemacs 21.1 on Linux....

The original poster has NTEmacs, not XEmacs. In NTEmacs, C-x RET c
determines the coding system used by the following command.

Eli Zaretskii

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
David Masterson wrote:
>
> >> * it is a workaround to something the editor *should* already do
>
> > Not as far as the majority of users want it. If you want otherwise,
> > you can customize Emacs to do that.
>
> Mostly agreed in that you can force certain types of files or files in
> certain locations (like a UNIX share) to be interpretted under a
> particular coding system. However, I don't think there is a general
> solution

The general solution is `file-coding-system-alist', as Stephen has pointed
out.

> Bingo! That's what I needed (plus a better regexp to search for the
> problem). In my case, the problem comes from MS-Robocopy. As
> Robocopy does its work, it outputs a series of messages telling what
> its percentage of completion is for each (large) file. These messages
> have a ^M between them so that they overwrite each other (a standard
> technique for such messages -- I could see a lot of programs doing
> this).

That program has a bug, IMHO: it should refrain from using a lone ^M when
its standard output is redirected to a file.

> When I captured this to a log file, Emacs saw the bare CRs and
> fell back to "undecided-unix" (saying that it fell back to "binary"
> doesn't seem right).

undecided-unix _is_ binary as far as the EOLs are considered: it doesn't
change the EOL characters in any way.

> > And what if the file is on a networked drive which is mounted on a
> > Unix machine? What EOL format is then ``natural''?
>
> Isn't this more likely to be the "special case" that a novice user
> would use auto-coding-alist or file-coding-system-alist for?

Not with today's proliferation of LANs and networking, IMHO. For example,
all PCs on my daytime job mount a few networked drives at startup, whether
the user wants it or not (well, you can always pull the Ethernet line from
its socket...).

Eli Zaretskii

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
David Masterson wrote:
>
> * Pick a spot in the buffer and add a CR (C-q C-m). Note that there are
> still no ^Ms anywhere else in the buffer. Resave the file.
> * Kill the buffer and reload the file. Note that now all lines have a
> ^M at the end and the buffer is in the "undecided-unix" coding
> system.

And I agree with Jason that this makes the file ``broken''. If you submit
such a file to programs that process text, then depending on how
sophisticated those programs are regarding I/O, you will see an entire
array of possible behavior types. Most programs will remove the single ^M
character entirely, as if it were not there.

Emacs does a better job by giving users a visual cue that something is
dead wrong with the file. It cannot always say what is wrong, exactly,
but you cannot expect that from a program.

I don't really understand why do we have to go through the same arguments,
time and again, when you have a simple customization facilities, mentioned
in this thread more than once, to make Emacs behave like you want.

(Btw, Jason is one of the maintainers of NTEmacs, so he can hardly be
``Unix-biased''...)

Eli Zaretskii

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
Kai Großjohann wrote:
>
> I think that a function along the lines of: I thought this was a DOS
> text file, so why does Emacs not show it as a DOS text file? would be
> a useful thing to have.

Not very easy without rewriting core code that decodes the file. When
Emacs converts the EOLs, the information about the original EOLs is
completely forgotten. To implement this function, you'd need to read the
original file again, with no-conversion, and then do everything on your
own, including large parts of decoding.

Per Abrahamsen

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
David Masterson <dmas...@rational.com> writes:

> Which program has the more "correct" display?

Neither. Both. Emacs behavior is correct for a "multi-format"
editor, Notepad behavior is correct for a "native-only" editor.

Eli Zaretskii

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
"Stephen J. Turnbull" wrote:
>
> >>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:
>
> Eli> That program has a bug, IMHO: it should refrain from using a
> Eli> lone ^M when its standard output is redirected to a file.
>
> Probably.
>
> But this is not something you're going to convince most programmers
> of, especially on Windows. Is it easy (in general) to figure that out
> on Windows? I guess V[ABC...] must have isatty()...?

Yes, isatty is the way to go.

I think that outputting ^M on DOS/Windows is clearly suited for interactive
display: only then does it make sense to prevent the display from
scrolling. It is quite a nasty thing to do when the output goes to a file,
because most native DOS/Windows tools will become hopelessly confused by
such a file.

At least Emacs shows the file with all the ^M characters, instead of
silently removing them and displaying one long line...

> If you have a single shell buffer, and you get standard
> output from DOS, Mac, and Unix systems in succession (eg with telnet
> or the like) you are going to confuse that buffer unmercifully.

I never tried something like that, but doesn't the protocol allow for the
EOLs to be converted on the fly? Otherwise, I'd imagine that most other
telnet clients will become confused as well, no?

> Or,
> alternatively, download files via FTP and then cat them in the shell
> buffer.

FTP does allow text-mode transfers, which maps EOLs. Of course, if you are
talking about compressed archives...

Jason Rumney

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
David Masterson <dmas...@rational.com> writes:

> IMHO, your definition of "broken" is colored by Emacs/Unix thinking.

IMHO, your definition of "broken" is colored by Notepad/Microsoft
thinking. :-)

Notepad is the sort of program that novices write after they get bored
with Hello World. Please stop expecting Emacs to act like it. And
Microsoft has a habit of hiding problems from the user to avoid
scaring them. In my day job, I do web development. I therefore tend to
notice the huge number of WWW sites that are broken out there. "But it
works fine on IE, it must be a bug in Netscape" is the usual response
you get from authors who write broken HTML and Javascript. This is why
I have a strong opinion that brokenness should not be hidden, and I
suspect that your opinions on brokenness have formed in much the same
way as those web authors - from becoming too used to software that hides
brokenness, so you now think of brokenness as normal.

> Your claims are:
>
> * A text file is *only* a file that has consistent EOLs.

Yes.

> * A file that has something that looks like an EOL, but doesn't agree
> with the guessed EOL *must* be broken.

Yes.

> * A broken file *must* be displayed with an "undecided-unix" coding
> system (the key problem).

A broken file may be displayed with any line end convention, but it
will still be a broken file. To display it in any line-end convention
other than -unix would require major changes to the internals of
Emacs, which I really do not think are worth it right now - unless you
want to volunteer to make these changes, as you seem to be the only
one who is passionate about them.


> * The fact that Emacs now displays *lots* of control characters (some
> which are meaningful and some which aren't) alerts the user to a
> problem is considered a Good Thing(tm).

The number of control characters that are displayed depends on the
nature of the brokenness of the file. Emacs has no control over this,
and different ways of displaying broken files will produce different
results for different files.

[snip instructions on creating a broken file with one Mac line end,
and many DOS ones].

> * Load the file into Notepad and compare the "look" of the two.
> Notepad (as all other Windows programs) always displays the file in
> (essentially) "undecided-dos" and displays only the bare CR as a
> funny character.
>
> Which program has the more "correct" display? (Be careful that your
> answer isn't colored by Emacs/Unix thinking.)

Neither has a "correct" display, as there is no correct display. In
this particular case, the results on Notepad look more aesthetically
pleasing. In other cases (the cases I find most often, where a Unix
text file has some DOS line-end pollution in it), the results on Emacs
will be less annoying.

--
Jason Rumney <jas...@altavista.net>

Jason Rumney

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
turn...@sk.tsukuba.ac.jp (Stephen J. Turnbull) writes:

> Because it's arguable that "the default sucks" in this case. I don't
> think it does, for all the same reasons given so far.
>
> It would however be worth polling Windows users on the subject ... if
> we could be confident that defaulting to 'undecided-dos on Windows is
> not a data corruption risk.

With the current internal representation of Emacs buffers, it is a data
corruption risk. I can see some merit in changing the internal
representation of line-ends to a platform neutral one to eliminate
this data corruption risk so that this can become an option, but I can
also see a lot more areas of Emacs that can be improved and extended
that are much higher priority IMHO.

--
Jason Rumney <jas...@altavista.net>

David Masterson

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
>>>>> "Per" == Per Abrahamsen <abr...@dina.kvl.dk> writes:

> David Masterson <dmas...@rational.com> writes:
>> Which program has the more "correct" display?

> Neither. Both. Emacs behavior is correct for a "multi-format"


> editor, Notepad behavior is correct for a "native-only" editor.

Understood. However, as your first two sentences imply, "correct" is
in the eye of the beholder. Since there are a lot of different
"beholders" out there, there probably is no behavior that is "correct"
in everyone's eyes.

*Sigh* I hate shades of grey...

David Masterson

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> I don't really understand why do we have to go through the same arguments,
> time and again, when you have a simple customization facilities, mentioned
> in this thread more than once, to make Emacs behave like you want.

Agreed. This thread has gone on much further than I ever originally
intended. At various times, I thought there was a point to be made
and some times I get hard-headed about it. For instance, I still
don't clearly see the distinction between "broken" and how files are
displayed or why one character in a file can cause the whole file to
be labelled "broken".

However, I see that this thread duplicates another thread from a few
years ago which you were a key part in. I'm still reading this thread
at http://www.gnu.org/software/emacs/windows/ntemacs/todo/translate to
see if I understand the points being made.

> (Btw, Jason is one of the maintainers of NTEmacs, so he can hardly be
> ``Unix-biased''...)

Yeah, I was probably the one being "unix-biased" in that. I suddenly
got it in my head that files with EOL=LF are Unix-like and EOL=CRLF
are DOS-like.

Silly me...

David Masterson

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

> John Clonts wrote:

Hmmm. Sometime I need to look at how XEmacs handles these things...

Ilya Zakharevich

unread,
Nov 30, 2000, 3:00:00 AM11/30/00
to
[A complimentary Cc of this posting was sent to Eli Zaretskii
<el...@is.elta.co.il>],
who wrote in article <3A262883...@is.elta.co.il>:

> Kai Großjohann wrote:
> >
> > I think that a function along the lines of: I thought this was a DOS
> > text file, so why does Emacs not show it as a DOS text file? would be
> > a useful thing to have.
>
> Not very easy without rewriting core code that decodes the file. When
> Emacs converts the EOLs, the information about the original EOLs is
> completely forgotten.

So the BIG question is: WHY? Have Emacs developers ever heard about
text attributes? ;-)

Ilya

Stephen J. Turnbull

unread,
Nov 30, 2000, 10:37:30 AM11/30/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

Eli> Kai Großjohann wrote:

>> I think that a function along the lines of: I thought this was
>> a DOS text file, so why does Emacs not show it as a DOS text
>> file? would be a useful thing to have.

Eli> Not very easy without rewriting core code that decodes the
Eli> file. When Emacs converts the EOLs, the information about
Eli> the original EOLs is completely forgotten.

Not if Emacs is not too aggressive about guessing alternate coding
systems. When Emacs is in "panic mode", he defaults to binary, and
the buffer contains exactly what the file did (modulo certain Mule
representation issues).

Stephen J. Turnbull

unread,
Nov 30, 2000, 10:44:21 AM11/30/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

Eli> I don't really understand why do we have to go through the
Eli> same arguments, time and again, when you have a simple
Eli> customization facilities, mentioned in this thread more than
Eli> once, to make Emacs behave like you want.

Because it's arguable that "the default sucks" in this case. I don't
think it does, for all the same reasons given so far.

It would however be worth polling Windows users on the subject ... if
we could be confident that defaulting to 'undecided-dos on Windows is

not a data corruption risk. I still think it is highly risky, because
(as you point out to Kai) "naked" linefeeds can't be distinguished
from "converted CRLF" ones after the conversion is done. (Modulo the
"private character" trick I described for Mule.)

Stephen J. Turnbull

unread,
Nov 30, 2000, 10:53:08 AM11/30/00
to
>>>>> "Eli" == Eli Zaretskii <el...@is.elta.co.il> writes:

Eli> That program has a bug, IMHO: it should refrain from using a
Eli> lone ^M when its standard output is redirected to a file.

Probably.

But this is not something you're going to convince most programmers
of, especially on Windows. Is it easy (in general) to figure that out
on Windows? I guess V[ABC...] must have isatty()...?

It's a general problem with handling process output in heterogeneous
environments. If you have a single shell buffer, and you get standard


output from DOS, Mac, and Unix systems in succession (eg with telnet

or the like) you are going to confuse that buffer unmercifully. Or,


alternatively, download files via FTP and then cat them in the shell

buffer. You just can't win without some kind of adaptive file-coding
recognizer.

Eli Zaretskii

unread,
Dec 1, 2000, 12:54:08 AM12/1/00
to
David Masterson wrote:
>
> However, I see that this thread duplicates another thread from a few
> years ago which you were a key part in. I'm still reading this thread
> at http://www.gnu.org/software/emacs/windows/ntemacs/todo/translate to
> see if I understand the points being made.

As far as I could tell, the current setup works according to what was
discussed there.

Eli Zaretskii

unread,
Dec 1, 2000, 12:59:41 AM12/1/00
to
Ilya Zakharevich wrote:
>
> > Not very easy without rewriting core code that decodes the file. When
> > Emacs converts the EOLs, the information about the original EOLs is
> > completely forgotten.
>
> So the BIG question is: WHY?

Here's the BIG answer to that: WHY NOT? ;-)

> Have Emacs developers ever heard about text attributes? ;-)

Yes. ;-)

(If you want to discuss this seriously, please tell what is it that you
suggest to use text properties--this is what I think you meant--for, and
why. I seem to be too dense this morning to be able to guess ;-)

Stephen J. Turnbull

unread,
Dec 1, 2000, 12:23:28 AM12/1/00
to
>>>>> "Ilya" == Ilya Zakharevich <il...@math.ohio-state.edu> writes:

Ilya> [A complimentary Cc of this posting was sent to Eli
Ilya> Zaretskii <el...@is.elta.co.il>], who wrote in article
Ilya> <3A262883...@is.elta.co.il>:

>> Not very easy without rewriting core code that decodes the
>> file. When Emacs converts the EOLs, the information about the
>> original EOLs is completely forgotten.

Ilya> So the BIG question is: WHY? Have Emacs developers ever
Ilya> heard about text attributes? ;-)

Yes, they have, and adding them during I/O operations would require
changing the core, because the core currently doesn't do that. There
are also issues like what the API should be; since they're basically
intended for error recovery, it may make sense to keep them _out_ of
the normally accessible text properties for efficiency reasons. Or
not; the point is that it's not a trivial decision.

(Note that for the vast majority of cases where the text file is
"unbroken", we can compute exactly what the external representation is
supposed to be.)

Ben Wing made exactly this proposal for XEmacs a while back, but in
the context of Mule, where the problem is much more hairy. It has not
yet been implemented, though.

Stephen J. Turnbull

unread,
Dec 1, 2000, 12:32:46 AM12/1/00
to
>>>>> "David" == David Masterson <dmas...@rational.com> writes:

David> Hmmm. Sometime I need to look at how XEmacs handles these
David> things...

Basically the same as Emacs does. In particular, the linebreak
handling is the same.

The main difference is that XEmacs comes in three versions (controlled
by compile-time configuration):

o "unibyte"; no detection of encoding of data
o "unibyte"; newline convention detection (the default now, IIRC, in
the devel branch; not available in 21.1 "stable")
o Mule; detects both newlines and certain aspects of charset
encoding (what I use); XEmacs Mule does not have TEXT-as-multibyte
functions, since the developers are philosophically opposed to
exposing the text representation to the Lisp level.

Stephen J. Turnbull

unread,
Dec 1, 2000, 12:12:00 AM12/1/00
to
>>>>> "David" == David Masterson <dmas...@rational.com> writes:

David> For instance, I still don't clearly see the distinction
David> between "broken" and how files are displayed or why one
David> character in a file can cause the whole file to be labelled
David> "broken".

"Broken" means that Emacs cannot have a good idea of how to display or
save the file; the defaults will be wrong, and perhaps dangerous, in
some circumstances. Only the user can know, and so she should be
informed.

Why is one character so important? If it's the null at the end of a
command string submitted to system() that goes missing, uh-oh!

True, few human-oriented text applications have such demanding
requirements on consistency. But to Emacs, "text" is simply a string
of characters, which can be uninterpreted bytes, that someone wants to
manipulate. People can and do write modes to directly edit such
things as tar files. It would require suid helper applications, I
suppose, but you could use Emacs to edit disk partition tables or (in
DOS, anyway) to debug or patch a running kernel in memory.

Ilya Zakharevich

unread,
Dec 1, 2000, 2:56:24 AM12/1/00
to
[A complimentary Cc of this posting was sent to Stephen J. Turnbull
<turn...@sk.tsukuba.ac.jp>],
who wrote in article <87elzsw...@turnbull.sk.tsukuba.ac.jp>:

> >> Not very easy without rewriting core code that decodes the
> >> file. When Emacs converts the EOLs, the information about the
> >> original EOLs is completely forgotten.
>
> Ilya> So the BIG question is: WHY? Have Emacs developers ever
> Ilya> heard about text attributes? ;-)
>
> Yes, they have, and adding them during I/O operations would require
> changing the core, because the core currently doesn't do that.

Well, in the simplest implementation all the I/O functions need to do
is setting a flag for a followup function to convert ^M-before^J to an
attribute-on-^J (needed only if mixed-lineendings are detected). The
same function would set a hook to do a reverse conversion when saving.

Ilya

Eli Zaretskii

unread,
Dec 1, 2000, 3:00:00 AM12/1/00
to
Ilya Zakharevich wrote:
>
> Well, in the simplest implementation all the I/O functions need to do
> is setting a flag for a followup function to convert ^M-before^J to an
> attribute-on-^J (needed only if mixed-lineendings are detected).

This would require the I/O functions to do their own detection of EOL
format, which would slow down I/O.

Eli Zaretskii

unread,
Dec 1, 2000, 3:00:00 AM12/1/00
to
"Stephen J. Turnbull" wrote:
>
> The main difference is that XEmacs comes in three versions (controlled
> by compile-time configuration):
>
> o "unibyte"; no detection of encoding of data
> o "unibyte"; newline convention detection (the default now, IIRC, in
> the devel branch; not available in 21.1 "stable")
> o Mule; detects both newlines and certain aspects of charset
> encoding (what I use); XEmacs Mule does not have TEXT-as-multibyte
> functions, since the developers are philosophically opposed to
> exposing the text representation to the Lisp level.

The last two configurations (modulo the philosophy-related aspects ;-) are
available in Emacs 20.x and later as run-time options: you get the second
configuration above by invoking "emacs --unibyte", while the third
configuration is the default.

(There's also an environment variable to set, in case you want --unibyte to
be the default.)

You can get the first configuration by both specifying --unibyte and
customizing the option inhibit-eol-conversion.

Ilya Zakharevich

unread,
Dec 1, 2000, 10:25:42 PM12/1/00
to
[A complimentary Cc of this posting was sent to Eli Zaretskii
<el...@is.elta.co.il>],
who wrote in article <3A275DC7...@is.elta.co.il>:

> > Well, in the simplest implementation all the I/O functions need to do
> > is setting a flag for a followup function to convert ^M-before^J to an
> > attribute-on-^J (needed only if mixed-lineendings are detected).
>
> This would require the I/O functions to do their own detection of EOL
> format, which would slow down I/O.

Eh??? The I/O functions already detect line-ending type, and already
undo these decisions if conflicting types are found. All what is
needed is setting a buffer-local variable, so that a hook
(mixed-line-endings-hook) may be run.

Ilya

Eli Zaretskii

unread,
Dec 2, 2000, 2:10:03 AM12/2/00
to
Ilya Zakharevich wrote:
>
> > > Well, in the simplest implementation all the I/O functions need to do
> > > is setting a flag for a followup function to convert ^M-before^J to an
> > > attribute-on-^J (needed only if mixed-lineendings are detected).
> >
> > This would require the I/O functions to do their own detection of EOL
> > format, which would slow down I/O.
>
> Eh??? The I/O functions already detect line-ending type, and already
> undo these decisions if conflicting types are found.

We have terminology problems: you seem to call ``I/O functions'' everything
that runs from the insert-file-contents primitive (otherwise what you wrote
is simply not true). Perhaps you could explain which functions do you mean,
specifically.

What I wanted to point out was that conversion of EOL format is performed
roughly in the same place where EOL format detection is done. Your
suggestion seemed to try to separate them, or else I misunderstood what you
were saying, exactly. Detection of the EOL format is done by
general-purpose code that doesn't always know why, or indeed on what object,
was it called. So asking that code to do something that is only required in
specific situations is possible, but not easy.

Stephen J. Turnbull

unread,
Dec 2, 2000, 1:57:20 AM12/2/00
to
>>>>> "Ilya" == Ilya Zakharevich <il...@math.ohio-state.edu> writes:

Ilya> Eh??? The I/O functions already detect line-ending type,
Ilya> and already undo these decisions if conflicting types are
Ilya> found. All what is needed is setting a buffer-local
Ilya> variable, so that a hook (mixed-line-endings-hook) may be
Ilya> run.

Great! You got it analyzed, you know what needs to be done, I guess
we're just waiting for your patch and legal papers! And test results....

Ilya Zakharevich

unread,
Dec 2, 2000, 9:07:48 PM12/2/00
to
[A complimentary Cc of this posting was sent to Stephen J. Turnbull
<turn...@sk.tsukuba.ac.jp>],
who wrote in article <87sno7u...@turnbull.sk.tsukuba.ac.jp>:

> Ilya> Eh??? The I/O functions already detect line-ending type,
> Ilya> and already undo these decisions if conflicting types are
> Ilya> found. All what is needed is setting a buffer-local
> Ilya> variable, so that a hook (mixed-line-endings-hook) may be
> Ilya> run.
>
> Great! You got it analyzed, you know what needs to be done, I guess
> we're just waiting for your patch and legal papers! And test results....

Of course, feel free to wait for it! But until Emacs uses a sane
internationalization scheme, there will be very incentive for third
parties to fix Emacs's bugs...

Ilya

Eli Zaretskii

unread,
Dec 3, 2000, 2:48:59 AM12/3/00
to
Ilya Zakharevich wrote:
>
> But until Emacs uses a sane
> internationalization scheme, there will be very incentive for third
> parties to fix Emacs's bugs...

Care to tell what would make the i18n scheme in Emacs sane?

Ilya Zakharevich

unread,
Dec 3, 2000, 3:00:00 AM12/3/00
to
[A complimentary Cc of this posting was sent to Eli Zaretskii
<el...@is.elta.co.il>],
who wrote in article <3A29FAEB...@is.elta.co.il>:

Eric told it in very fine details. I told it many many times already
(mostly repeating Eric's arguments).

Why should we do it again and again?

But if you want it: there *is* a notion of "a character". This notion
is independent of the source this character is read from. MULE's
insistence that "a" read from ASCII document is a different character
than "a" read from a document with a different encoding has no excuses.

Ilya

P.S. On the user-interface front: Emacs may want to detect "holes" in
fonts, like one in 128..160 range in Latin-1 fonts. Something
liek this: if a char's bitmap has width 0, behave as if this
char is not present in the font (print \203 or whatever).

In some cases Emacs is in a better position than the user to
determine such holes...

Eli Zaretskii

unread,
Dec 4, 2000, 3:00:00 AM12/4/00
to
Ilya Zakharevich wrote:
>
> [A complimentary Cc of this posting was sent to Eli Zaretskii
> <el...@is.elta.co.il>],
> who wrote in article <3A29FAEB...@is.elta.co.il>:
> Eric told it in very fine details. I told it many many times already
> (mostly repeating Eric's arguments).

The only constructive element in what Eric said was that the internal
encoding should be changed to be based on Unicode; Emacs slowly moves in
that direction (volunteers are welcome to help doing that faster).

But I think you are wrong if you think that Unicode-based internal
encoding will magically solve all, or even most, of the problems. The
basic issues which complicate i18n will not be removed by that: we will
still have problems with unibyte vs multibyte buffers, display and
keyboard encoding, problems with determining the encoding of files, etc.

> Why should we do it again and again?

Because I thought you might have new ideas. It's been two years or so
since these issues were discussed publicly.

> P.S. On the user-interface front: Emacs may want to detect "holes" in
> fonts, like one in 128..160 range in Latin-1 fonts. Something
> liek this: if a char's bitmap has width 0, behave as if this
> char is not present in the font (print \203 or whatever).

This is already so in Emacs 21. Well, sort of (it's a long story).

Kai Großjohann

unread,
Dec 4, 2000, 3:00:00 AM12/4/00
to
On Thu, 30 Nov 2000, Eli Zaretskii wrote:

> Kai Großjohann wrote:
>>
>> I think that a function along the lines of: I thought this was a
>> DOS text file, so why does Emacs not show it as a DOS text file?
>> would be a useful thing to have.
>

> Not very easy without rewriting core code that decodes the file.
> When Emacs converts the EOLs, the information about the original
> EOLs is completely forgotten.

?? When you have a file with 99 CRLFs in it and one lone LF, Emacs
will create a buffer with 99 CRs in it, no? The 99 CRs are still
there...

Maybe some buffer-local variable could be used that contains a trace
of the decisions Emacs made when reading the file. The variable could
contain a list of actions. For example, the first action could be to
default to an encoding because of file-coding-system-alist. The
second action could be to panic at position 4711 because of a wrong
EOL string. And then you need a M-x describe-foo RET function which
does a pretty display of all this, and you're done.

Ah, yes, that would of course amount to ``rewriting core code that
decodes the file''. Oops.

kai
--
The arms should be held in a natural and unaffected way and never
be conspicuous. -- Revised Technique of Latin American Dancing

Stefan Monnier <foo@acm.com>

unread,
Dec 4, 2000, 3:00:00 AM12/4/00
to
>>>>> "Ilya" == Ilya Zakharevich <il...@math.ohio-state.edu> writes:
> insistence that "a" read from ASCII document is a different character
> than "a" read from a document with a different encoding has no excuses.

I don't see where MULE insists on that. As a matter of fact, there's only
one `a' in MULE that I know of. I do believe that there are several
instances of some chars (like `e' with an accent, maybe), but `a' was
a poor choice,


Stefan

It is loading more messages.
0 new messages