Reload-at-prompt after file modified outside of Vim does not reapply the 'fileencodings' heuristics

16 views
Skip to first unread message

Tony Mechelynck

unread,
Jan 1, 2021, 5:03:34 PM1/1/21
to vim_dev
gvim 8.2.2267 (Big) with GTK3 GUI; 'encoding' = utf-8; 'filencodings'
(plural) = ucs-bom,utf-8,latin1

Steps to reproduce:
1. Load a Latin1 file containing only characters in the range
0x00-0x7F. (In my case this was a corrupt version of the file.)
-- 'fileencoding' is set to utf-8, which at this point is acceptable,
since the de-facto encoding is us-ascii, which is byte-compatible with
both Latin1 and UTF-8.
2. Replace the file-on-disk (by non-Vim methods) by a version
containing one or more (Latin1) characters in the range 0x80-0xFF. (In
my case this was the correct version of the file, after fetching it
over the Net.)
-- Vim gives a prompt, with options [O]K, [L]oad file
3. Answer l (Load).
-- File is reloaded, but the 'fileencodings' heuristic is not
reapplied: 'fileencoding' (singular) is still utf-8, any Latin1
characters above 0x7F (which are not valid UTF-8 byte sequences) are
changed to question marks. No error for invalid byte sequences (I
didn't notice any at the time, and none is recorded in the :mess
messages list).
4. Make some more changes inside Vim, adding more characters in the
range 0x7F-0xFF, then save.
-- File is saved as UTF-8; if read as Latin1 outside of Vim, weird
characters appear where changes were made at step 4. <-- bad
5. :setl fenc=latin1 | w
-- If reloaded outside of Vim, the weird characters have now
disappeared; but the question marks, if not replaced by what they
should be, are still there.

Best regards,
Tony.

Bram Moolenaar

unread,
Jan 2, 2021, 8:03:53 AM1/2/21
to vim...@googlegroups.com, Tony Mechelynck

Tony wrote:

> gvim 8.2.2267 (Big) with GTK3 GUI; 'encoding' = utf-8; 'filencodings'
> (plural) = ucs-bom,utf-8,latin1
>
> Steps to reproduce:
> 1. Load a Latin1 file containing only characters in the range
> 0x00-0x7F. (In my case this was a corrupt version of the file.)
> -- 'fileencoding' is set to utf-8, which at this point is acceptable,
> since the de-facto encoding is us-ascii, which is byte-compatible with
> both Latin1 and UTF-8.
> 2. Replace the file-on-disk (by non-Vim methods) by a version
> containing one or more (Latin1) characters in the range 0x80-0xFF. (In
> my case this was the correct version of the file, after fetching it
> over the Net.)
> -- Vim gives a prompt, with options [O]K, [L]oad file
> 3. Answer l (Load).
> -- File is reloaded, but the 'fileencodings' heuristic is not
> reapplied: 'fileencoding' (singular) is still utf-8, any Latin1
> characters above 0x7F (which are not valid UTF-8 byte sequences) are
> changed to question marks. No error for invalid byte sequences (I
> didn't notice any at the time, and none is recorded in the :mess
> messages list).

Hmm, I would expect some warning being given.

> 4. Make some more changes inside Vim, adding more characters in the
> range 0x7F-0xFF, then save.
> -- File is saved as UTF-8; if read as Latin1 outside of Vim, weird
> characters appear where changes were made at step 4. <-- bad
> 5. :setl fenc=latin1 | w
> -- If reloaded outside of Vim, the weird characters have now
> disappeared; but the question marks, if not replaced by what they
> should be, are still there.

This is a very specific sequence of events, which should not happen very
often. I'm sure that if we re-detect the encoding that it will be wrong
in another situation. I think that if you would notice the wrong
encoding and used ":edit" that it would do the detection.

--
hundred-and-one symptoms of being an internet addict:
90. Instead of calling you to dinner, your spouse sends e-mail.

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Tony Mechelynck

unread,
Jan 3, 2021, 9:54:51 AM1/3/21
to Bram Moolenaar, vim_dev
On Sat, Jan 2, 2021 at 2:03 PM Bram Moolenaar <Br...@moolenaar.net> wrote:
>
>
> Tony wrote:
>
> > gvim 8.2.2267 (Big) with GTK3 GUI; 'encoding' = utf-8; 'filencodings'
> > (plural) = ucs-bom,utf-8,latin1
> >
> > Steps to reproduce:
> > 1. Load a Latin1 file containing only characters in the range
> > 0x00-0x7F. (In my case this was a corrupt version of the file.)
> > -- 'fileencoding' is set to utf-8, which at this point is acceptable,
> > since the de-facto encoding is us-ascii, which is byte-compatible with
> > both Latin1 and UTF-8.
> > 2. Replace the file-on-disk (by non-Vim methods) by a version
> > containing one or more (Latin1) characters in the range 0x80-0xFF. (In
> > my case this was the correct version of the file, after fetching it
> > over the Net.)
> > -- Vim gives a prompt, with options [O]K, [L]oad file
> > 3. Answer l (Load).
> > -- File is reloaded, but the 'fileencodings' heuristic is not
> > reapplied: 'fileencoding' (singular) is still utf-8, any Latin1
> > characters above 0x7F (which are not valid UTF-8 byte sequences) are
> > changed to question marks. No error for invalid byte sequences (I
> > didn't notice any at the time, and none is recorded in the :mess
> > messages list).
>
> Hmm, I would expect some warning being given.

Well, there wasn't.
>
> > 4. Make some more changes inside Vim, adding more characters in the
> > range 0x7F-0xFF, then save.
> > -- File is saved as UTF-8; if read as Latin1 outside of Vim, weird
> > characters appear where changes were made at step 4. <-- bad
> > 5. :setl fenc=latin1 | w
> > -- If reloaded outside of Vim, the weird characters have now
> > disappeared; but the question marks, if not replaced by what they
> > should be, are still there.
>
> This is a very specific sequence of events, which should not happen very
> often. I'm sure that if we re-detect the encoding that it will be wrong
> in another situation. I think that if you would notice the wrong
> encoding and used ":edit" that it would do the detection.

Originally, the only characters above 0x7F were part of a line of
divide-by signs (÷, 0xF7) in a comment near the top, in order to make
sure that the file was interpreted as Latin1 and not UTF-8. After
replacing the corrupt file by the correct one I didn't notice that
these ÷÷÷÷÷ had been replacing by ?????. After saving the file in
UTF-8 it was too late for :edit, and it's only then that the weird
characters in the browser arose my attention.

This was an HTML page, originally written with entities for everything
not in ASCII: &eacute; &egrave; &agrave; etc. I'm busy replacing all
these by é è à etc. in Latin1; and the few codepoints above U+00FF by
symbolic entities: &#8212; → &mdash;, &#339; → &oelig; etc. The result
is that the length of these files is reduced by about 5% on average.
But I keep the 'encoding' setting for the editor at utf-8 globally
because it normally won't mishandle other files that may be present in
other split-windows.

I don't see how re-detecting the fileencoding would be wrong in other
situations; but in any case at least, *please* give a message, not
just a warning but either a prompt or a red error (with Error or
ErrorMsg highlighting), if invalid byte sequences exist in the file on
reloading. That would have saved me all trouble.

Best regards,
Tony.
Reply all
Reply to author
Forward
0 new messages