On Sat, Jan 2, 2021 at 2:03 PM Bram Moolenaar <
Br...@moolenaar.net> wrote:
>
>
> Tony wrote:
>
> > gvim 8.2.2267 (Big) with GTK3 GUI; 'encoding' = utf-8; 'filencodings'
> > (plural) = ucs-bom,utf-8,latin1
> >
> > Steps to reproduce:
> > 1. Load a Latin1 file containing only characters in the range
> > 0x00-0x7F. (In my case this was a corrupt version of the file.)
> > -- 'fileencoding' is set to utf-8, which at this point is acceptable,
> > since the de-facto encoding is us-ascii, which is byte-compatible with
> > both Latin1 and UTF-8.
> > 2. Replace the file-on-disk (by non-Vim methods) by a version
> > containing one or more (Latin1) characters in the range 0x80-0xFF. (In
> > my case this was the correct version of the file, after fetching it
> > over the Net.)
> > -- Vim gives a prompt, with options [O]K, [L]oad file
> > 3. Answer l (Load).
> > -- File is reloaded, but the 'fileencodings' heuristic is not
> > reapplied: 'fileencoding' (singular) is still utf-8, any Latin1
> > characters above 0x7F (which are not valid UTF-8 byte sequences) are
> > changed to question marks. No error for invalid byte sequences (I
> > didn't notice any at the time, and none is recorded in the :mess
> > messages list).
>
> Hmm, I would expect some warning being given.
Well, there wasn't.
>
> > 4. Make some more changes inside Vim, adding more characters in the
> > range 0x7F-0xFF, then save.
> > -- File is saved as UTF-8; if read as Latin1 outside of Vim, weird
> > characters appear where changes were made at step 4. <-- bad
> > 5. :setl fenc=latin1 | w
> > -- If reloaded outside of Vim, the weird characters have now
> > disappeared; but the question marks, if not replaced by what they
> > should be, are still there.
>
> This is a very specific sequence of events, which should not happen very
> often. I'm sure that if we re-detect the encoding that it will be wrong
> in another situation. I think that if you would notice the wrong
> encoding and used ":edit" that it would do the detection.
Originally, the only characters above 0x7F were part of a line of
divide-by signs (÷, 0xF7) in a comment near the top, in order to make
sure that the file was interpreted as Latin1 and not UTF-8. After
replacing the corrupt file by the correct one I didn't notice that
these ÷÷÷÷÷ had been replacing by ?????. After saving the file in
UTF-8 it was too late for :edit, and it's only then that the weird
characters in the browser arose my attention.
This was an HTML page, originally written with entities for everything
not in ASCII: é è à etc. I'm busy replacing all
these by é è à etc. in Latin1; and the few codepoints above U+00FF by
symbolic entities: — → —, œ → œ etc. The result
is that the length of these files is reduced by about 5% on average.
But I keep the 'encoding' setting for the editor at utf-8 globally
because it normally won't mishandle other files that may be present in
other split-windows.
I don't see how re-detecting the fileencoding would be wrong in other
situations; but in any case at least, *please* give a message, not
just a warning but either a prompt or a red error (with Error or
ErrorMsg highlighting), if invalid byte sequences exist in the file on
reloading. That would have saved me all trouble.
Best regards,
Tony.