On 08/12/2020 17.47, Bram Moolenaar wrote:
>> This works:
>> :set fencs=utf8
>> :%!cat
>> although "fenc" remains "latin1".
> Yeah, for an existing buffer and filtering the first entry in 'fencs' is
> used to read the filter output, but 'fenc' isn't set. That's a bit
> strange, but I'm not sure what would break if we change this. It might
> actually be good to fix this, since if you write that file it might get
> messed up.
I performed a couple of tests trying to write the result to a file after
doing the above (using a correct UTF-8 file as source):
- if you leave fenc to latin1 the new file will be in latin1 (with all
the characters correctly encoded)
- if you set fenc to utf8 *after* the %!cat (but of course before
writing the file) the new file will be in UTF-8 with all the characters
correctly encoded
- if you set fenc to utf8 *before* the %!cat (and of course before
writing the file) the new file will be... a mess: by all appearances Vim
thinks that the individual bytes of the UTF-8 file are individual latin1
characters, and it then converts them to UTF-8; so you'll get a UTF-8
encoded file with the wrong characters, e.g. a "C3 B2" sequence in the
original file, which stands for a UTF-8 encoded "ò", (Unicode code point
F2) will become a "C3 83 C2 B2" sequence in the written file: "C3" is a
"Â" in latin1 (and yes, in Unicode too), and "Â" is encoded as "C3 83"
in UTF-8, "B2" is a "²" in latin1 (and Unicode) and "²" is encoded as
"C2 B2" in UTF-8 (in case someone noticed it, don't let yourself get
confused by the fact that C3 and B2 occur both in the source and the
translated sequence, that's largely just an unfortunate coincidence of
my example).
Given that Unicode is identical to latin1 in the first 256 characters,
to better confirm what happened I also tried using another charset
(cp850) instead of latin1 in the above tests (fencs=cp850 in my vimrc
and setting fenc=cp850 in the second and third tests), still using a
correct UTF-8 file as a source; the results are analogous, with a
correct cp850 file in the first test, a correct UTF-8 one in the second
and a UTF-8 one with the original file's bytes interpreted as cp850 and
then converted to UTF-8 in the third (the original "ò", "C3 83", becomes
a "E2 94 9C E2 96 93" sequence, given that "C3" is a "├" symbol in
cp850, Unicode code point 251C -> "E2 94 9C" UTF-8, and 83 is a "▓",
Unicode code point 2593 -> "E2 96 93" UTF-8).
Yes, I... ahem, had a lot of fun this afternoon :D
Cheers