Encoding and Fileencoding of a latin1 file

1,187 views
Skip to first unread message

rameo

unread,
Jul 3, 2014, 11:20:45 AM7/3/14
to vim...@googlegroups.com
I've written this in my _vimrc file

if has("multi_byte")
if &termencoding == ""
let &termencoding = &encoding
endif
set encoding=utf-8
set fileencoding=utf-8
set fileencodings=ucs-bom,utf-8,latin1
endif

When I open a latin1 file in my editor VIM indicates [CONVERTED] after the file name under the statusline.

Fileencoding has been converted to Latin1. Correct. The file will be saved in Latin1.
But my problem is that the Encoding is still in UTF-8: I see many squares in the latin1 file.

1) Why doesn't Vim also let me read the file in Latin1 (changes the encoding to latin1)?

To temporary resolve this problem I set the encoding manually to Latin1
:setlocal enc=latin1 | :e

But I noted that it changes the global encoding to Latin1 and now I see UTF-8 files in other tabs in the wrong Latin1 encoding.

2) How can I set encoding only to the local buffer?

What did I wrong?


Ben Fritz

unread,
Jul 4, 2014, 11:48:48 AM7/4/14
to vim...@googlegroups.com
On Thursday, July 3, 2014 10:20:45 AM UTC-5, rameo wrote:
> I've written this in my _vimrc file
>
> if has("multi_byte")
> if &termencoding == ""
> let &termencoding = &encoding
> endif
> set encoding=utf-8
> set fileencoding=utf-8
> set fileencodings=ucs-bom,utf-8,latin1
> endif
>
> When I open a latin1 file in my editor VIM indicates [CONVERTED] after the file name under the statusline.
>
> Fileencoding has been converted to Latin1. Correct. The file will be saved in Latin1.

I assume you verified this, using ":set fileencoding?", or you have some other method of viewing the fileencoding option? "fileencoding" is not only the way to control what encoding is used when writing; "fileencoding" also controls how Vim *reads* the file data. If Vim guessed the "fileencoding" wrongly, then you will either see a conversion error, or the wrong characters (or missing glyphs) shown in your file.

> But my problem is that the Encoding is still in UTF-8: I see many squares in the latin1 file.
>

"encoding" is Vim's internal representation of character strings. There is NO problem that the global encoding is still utf-8. This is a good thing. Furthermore, since every single character is Latin1 is also representable in utf-8, this CANNOT be the cause of the squares in your latin1 file.

> 1) Why doesn't Vim also let me read the file in Latin1 (changes the encoding to latin1)?
>

If "fileencoding" actually is set to "latin1" automatically by Vim after reading the file, then Vim *did* read the file in "latin1". I think more likely, Vim didn't actually use latin1 to read the file, or your font is missing some glyphs for characters in your file (unlikely if Vim actually used latin1).

> To temporary resolve this problem I set the encoding manually to Latin1
> :setlocal enc=latin1 | :e
>

This won't work for a couple reasons. First of all, "encoding" is a global option ONLY. It controls how Vim internally represents character data; nothing more. So not only does the "setlocal" not work, but also you've corrupted the data Vim already has stored internally, because Vim doesn't do any conversion of existing data when you set "encoding".

> But I noted that it changes the global encoding to Latin1 and now I see UTF-8 files in other tabs in the wrong Latin1 encoding.
>
> 2) How can I set encoding only to the local buffer?
>
> What did I wrong?

The correct way to force a single file to load using a given encoding, is, for example:

:e ++enc=latin1

This is confusing, because it actually sets the *fileencoding* option, not the encoding option.

Note that your current "fileencodings" option, will actually prefer to load a file in utf-8 over latin1. If a file is valid in both utf-8 and in latin1, it will be loaded in utf-8.

rameo

unread,
Jul 5, 2014, 4:06:50 AM7/5/14
to vim...@googlegroups.com
Ben,

Try to write these french words in a file with a latin1 fileencoding:
bœuf, cœur, manœuvre, œil
(beef, heart, manoeuvre, eye)

Close this file.
Set encoding to utf-8 in your vimrc.
Open the file.

Encoding is utf-8
Fileencoding is latin1 (:set fileencoding?), converted is written after the file name.
But all words have squares.
(The same file is visualized well in notepad+++, recognized as latin1)

Btw you asked me how I check encoding and fileencoding of a file?
I have this in my statusline:
set statusline+=%2*\ E:%{&fileencoding?&fileencoding:&encoding}
set statusline+=%2*\ F:%{&fileencoding?&fileencoding:&fileencoding}

Years ago I had also problems with utf-8 and switched back to latin1 encoding.
These days I switched again to utf-8 and after a while it messed up again my files (p.e. my vimrc file).

A question:
Why should there be an encoding and fileencoding? Why not put them together?
If a file is a latin1 file: encoding and fileencoding has to be in latin1.
If a file is an utf-8 file: encoding and fileencoding has to be in utf-8.
Without "Conversion" written after a file name.
And in the Config file a user can then indicate whether a new file should be in utf-8 or any other encoding, something like this:
let NewFileEncoding = "utf-8"


Ben Fritz

unread,
Jul 5, 2014, 11:46:51 AM7/5/14
to vim...@googlegroups.com
On Saturday, July 5, 2014 3:06:50 AM UTC-5, rameo wrote:
> Ben,
>
> Try to write these french words in a file with a latin1 fileencoding:
> bœuf, cœur, manœuvre, œil
> (beef, heart, manoeuvre, eye)
>

When I wry writing this in latin1, I get:


"test.txt"
"test.txt" CONVERSION ERROR in line 1; 1L, 30C written
"test.txt" CONVERSION ERROR in line 1; 1L, 30C written

It looks like this is not latin1 at all.

Indeed, looking it up at http://en.wikipedia.org/wiki/ISO/IEC_8859-1 shows that
the 'œ' character is not Latin1 at all. It is the Windows-1252 encoding, a
superset of Latin1 ( http://en.wikipedia.org/wiki/Windows-1252 ).

> Close this file.
> Set encoding to utf-8 in your vimrc.
> Open the file.
>
> Encoding is utf-8
> Fileencoding is latin1 (:set fileencoding?), converted is written after the file name.
> But all words have squares.
> (The same file is visualized well in notepad+++, recognized as latin1)

Here fileencoding is set to latin1 because your "fileencodings" option has
latin1 as the final fallback if UTF-8 fails (which it will). The file itself is
not in latin1, so reading it in latin1 failed on character 140, completely
undefined in latin1 but defined as 'œ' in Windows-1252. Thus the blank squares.

>
> Btw you asked me how I check encoding and fileencoding of a file?
> I have this in my statusline:
> set statusline+=%2*\ E:%{&fileencoding?&fileencoding:&encoding}
> set statusline+=%2*\ F:%{&fileencoding?&fileencoding:&fileencoding}
>

That's good, it will tell you what encoding the file will be saved in, and also
what encoding it got read in. I said earlier 'encoding' doesn't affect the
writing of a file, but that's a simplification. It is used in place of
'fileencoding' if that option is not set, as you apparently have learned.

> Years ago I had also problems with utf-8 and switched back to latin1 encoding.
> These days I switched again to utf-8 and after a while it messed up again my files (p.e. my vimrc file).
>
> A question:
> Why should there be an encoding and fileencoding? Why not put them together?

Because they are two different concepts. 'encoding' is a Vim internal thing.
Really I have no idea why Vim has this option at all. Every other program out
there gives you no control (and you need no control) over the encoding used
internally to represent data. 'fenc' is really the only thing you should be
using for manipulating files.

> If a file is a latin1 file: encoding and fileencoding has to be in latin1.

Wrong. You can use latin1 (or Windows-1252) files just fine regardless of your
encoding option, as long as the characters within the file can all be
represented in your chosen encoding. UTF-8 should be pretty much universally
usable.

> If a file is an utf-8 file: encoding and fileencoding has to be in utf-8.

Probably correct, I wouldn't want to mess with weird encodings that still
support all the characters in utf-8. Vim uses utf-8 internally for ANY unicode
encoding.

> Without "Conversion" written after a file name.

That "conversion" message just tells you the fileencoding differs from the
internal encoding, so Vim had to convert the bytes. It is never a problem.

> And in the Config file a user can then indicate whether a new file should be
> in utf-8 or any other encoding, something like this:
> let NewFileEncoding = "utf-8"

That would be "setglobal fileencoding=utf-8"

So, here's the real question:

Why does Vim pretend like it read a Windows-1252 file in Latin1 fileencoding,
when 'encoding' is net to "Latin1"?

Windows-1252 is commonly mistaken for Latin1. Windows systems use it by default
in place of Latin1 actually. Vim is set so that a default Windows installation
will "just work". Thus when Vim reads files that Windows pretends are Latin1,
Vim also must pretend they are Latin1 when using default settings.

When Vim must actually do encoding conversions however, it does NOT treat
Windows-1252 as Latin1.

I'm not sure if it was an oversight, or if it is just assumed that users know
what they are doing when they set their 'encoding' to a non-default value, but
when 'encoding' is UTF-8, Vim actually pays strict attention to the file
encoding. Probably it is because Vim must actually convert the file content to
its internal encoding and writing dozens of exceptions and special cases would
be prohibitive. Regardless, in your case, I would change your 'fileencodings'
option to include the Windows-1252 encoding rather than Latin1. Or, you could
manually override the encoding selection for that file.

Using Windows-1252 depends on your system. For Windows, the proper value for
your 'fileencoding' and 'fileencodings' options would be simply "cp1252". On
Linux systems, it changes to "8bit-cp1252".

Ben Fritz

unread,
Jul 5, 2014, 11:55:15 AM7/5/14
to vim...@googlegroups.com
On Saturday, July 5, 2014 10:46:51 AM UTC-5, Ben Fritz wrote:
> Regardless, in your case, I would change your 'fileencodings'
> option to include the Windows-1252 encoding rather than Latin1. Or, you could
> manually override the encoding selection for that file.
>
> Using Windows-1252 depends on your system. For Windows, the proper value for
> your 'fileencoding' and 'fileencodings' options would be simply "cp1252". On
> Linux systems, it changes to "8bit-cp1252".

By the way, I'm more of a purist and want my Latin1 files to actually be Latin1,
using cp1252 only occasionally when I know it will work.

For that reason, my Vim config contains this encoding logic (actually this is
simplified from my full config) that will detect files as cp1252 normally, but
reload them as latin1 if none of the "special" characters defined in 1252 but
not Latin1 are used:

if has('multi_byte')
set encoding=utf-8
setglobal fenc=latin1

" Don't detect utf-8 without a BOM by default, I don't use UTF-8 normally
" and any files in latin1 will detect as UTF. Detect cp1252 rather than
" latin1 so files are read in correctly.
set fileencodings=ucs-bom,8bit-cp1252,latin1
if has('autocmd')
augroup fenc_detect
au!

" Detect when a buffer should actually be latin1 (i.e. there are no cp1252
" bytes in the buffer). cp1252 is a superset of latin1. See
" http://en.wikipedia.org/wiki/Cp1252 for details.
"
" Since latin1 is a subset of cp1252, this does not ACTUALLY modify the
" buffer, so bypass the modifiable option.
let cp1252_latin1_diff =
\ '\u20AC'. '\u201A'. '\u0192'. '\u201E'. '\u2026'. '\u2020'. '\u2021'. '\u02C6'. '\u2030'. '\u0160'. '\u2039'. '\u0152'. '\u017D'.
\ '\u2018'. '\u2019'. '\u201C'. '\u201D'. '\u2022'. '\u2013'. '\u2014'. '\u02DC'. '\u2122'. '\u0161'. '\u203A'. '\u0153'. '\u017E'. '\u0178'
autocmd BufReadPost * let s:oldmod = &modifiable | if !s:oldmod | setlocal modifiable | endif
autocmd BufReadPost * if &fenc=~?'cp1252$' && search('['.cp1252_latin1_diff.']', 'nw') == 0 | setlocal fenc=latin1 nomodified | endif
autocmd BufReadPost * if !s:oldmod | setlocal nomodifiable | endif
augroup END
endif
endif


For some file types (notably HTML) I use the "autofenc" plugin: http://www.vim.org/scripts/script.php?script_id=2721

rameo

unread,
Jul 6, 2014, 6:32:32 AM7/6/14
to vim...@googlegroups.com
Thank you very much Ben for your great explication.
Not easy to understand.

I still don't understand why my vimrc and menu.vim, containing both french characters as "œu", could be read in latin1 in the past, without any problem or error.
(The only encoding line I had in my vimrc file at that moment was "set encoding="latin1")

What I also don't understand is that with above setting, files in latin1 where encoded in latin1 but the fileencoding in my statusline was empty (no fileencoding was indicated by vim)
If I changed the above setting to "set encoding=utf8", encoding and fileencoding both indicated utf-8. Does vim take as default encoding the default windows encoding?

Every now and then I write something in Russian that is why it might be better to change the default encoding to utf8, isn't it? I had also troubles to use a plugin using latin1 as the default encoding.
If I set my default encoding to utf-8, what would be the "filencodings"?
Set fileencodings=ucs-bom,utf8,cp1252,latin1?
utf8 at the end or after ucs-bom?
Btw I'm on a windows OS, 8bit-cp1252 has to be cp1252, isn't it?

If my default encoding will be utf-8, it is better to convert vimrc and menu.vim to utf8 as well to avoid that I see every time "Converted" after the filename, isn't it?
Do you know a good software to convert cp1252 files to utf-8? (I used iconv in the past)

Btw Ben, you noted in your reply "setglobal fileencoding=utf-8"
What difference is there between "set fileencoding=utf-8" and "setglobal fileencoding=utf-8". I thought there was no local buffer encoding setting?

Dominique Pellé

unread,
Jul 6, 2014, 9:07:39 AM7/6/14
to Vim List
rameo wrote:

> Thank you very much Ben for your great explication.
> Not easy to understand.
>
> I still don't understand why my vimrc and menu.vim,
> containing both french characters as "œu", could
> be read in latin1 in the past, without any problem
> or error.
> (The only encoding line I had in my vimrc file at
> that moment was "set encoding="latin1")

The ligature œ is missing in latin1.
So it's not suitable for French, especially
since œ appears in frequent words.

ISO-8859-15 is almost like latin1 but contains
all French characters. Windows-1252 also
contains œ. But Unicode is preferable
nowadays.

Regards
Dominique

Tony Mechelynck

unread,
Jul 6, 2014, 9:09:09 AM7/6/14
to vim...@googlegroups.com
See also http://vim.wikia.com/wiki/Working_with_Unicode

You seem to have read that article, which I wrote myself, so I'll try to
explain in more detail (I hope not in boring detail) the logic behind
it. Be sure to check the Vim help for anything which would still be unclear.


'encoding' is a global option determining how Vim represents characters
in memory. The right place to set it is in your vimrc, BEFORE loading
any editfile. Once you have started opening a file, changing 'encoding'
makes the contents of ALL your current editfiles invalid, because it is
not possible to convert all the contents of all your loaded buffers from
one encoding to another as a result of your changing that option.


The :scriptencoding ex-command (not mentioned in that wiki page) tells
Vim to override 'encoding' for the purpose of reading the current
script. For instance if your vimrc is encoded in Windows-1252 you can use
scriptencoding Windows-1252
and any bytes between 0x80 and 0xFF in your script will be interpreted
as in Windows-1252 even after you set 'encoding' to UTF-8.


'fileencoding' (singular) is a local option. It says how the file in
question will be represented on disk. If 'encoding' is UTF-8
(recommended) and if your Vim can use iconv (i.e., has(iconv) returns 1,
i.e. you either have +iconv linked-in statically, or +iconv/dyn
compiled-in dynamically and the iconv or libiconv library found at
runtime), then any encoding can be translated to and from UTF-8, and Vim
can do just that when reading and writing. But note that if 'encoding'
is set to UTF-8, and you modify a file to put in it characters not
acceptable for that file's 'fileencoding', Vim will give you no error
signal as long as you don't save the file; so you can change the
'fleencoding' before or after you change the file contents: as long as
they agree when you write the file it's OK.

If the file contains only bytes less than 0x80, it will be interpreted
identically in any of the following encodings (where those I'm writing
on one line are synonyms, equivalent for Vim with iconv), and in a
number of others:
- us-ascii
- latin1, iso-8859-1
- cp1252, Windows-1252
- latin9, iso-8859-15
- utf-8
so don't be afraid if Vim detects one of your Latin1 files (with no
accented characters, French guillemets, etc.) as being UTF-8. In fact,
with those contents, it could just as well be any of the encodings
mentioned above (or a number of others). If you want to be sure that a
given file remains Latin1 even if you add accented characters to it in
the future, be sure to add some non-ASCII characters in it now (e.g.,
for text, underline the main heading with a line of ÷÷÷÷÷÷÷÷÷÷÷ American
divided-by signs), then save it immediately with one of
:x ++enc=latin1
or
:setl fenc=latin1
:w
Similarly for Windows-1252 or iso-8859-15, but use a different non-ASCII
character, since they both are supersets of Latin1. On a side note,
sometimes I notice that I send an email with headers declaring it to be
8bit utf-8 and that it comes back to me as 7bit us-ascii; the body, in
that case, is byte-for-byte identical. (This one won't, because of the
divided-by signs above. Maybe it'll come back as quoted-printable utf-8,
or even as quoted-printable iso-8859-1.)

To convert a file from one encoding to another (e.g. Windows-1252 to
UTF-8, and assuming that both can be represented in your present
'encoding'), it is extremely easy to do it with Vim (if has(iconv)
returns 1 of course), as follows:
:e ++enc=Windows-1252 filename
:setl fenc=utf-8
:w

You ask what it means to use ":setglobal fileencoding=utf-8". That tells
Vim what 'fileencoding' value to use when you create a new file which
didn't exist before. Or you could use ":setglobal
fileencoding=Windows-1252" which will create files by default in
Windows-1252 encoding, but of course in that case you will get a signal
at write-time (and not before) if you write in the file something that
has no representation in Windows-1252. See ":help local-options".


++enc=something (before the filename in a file-read or file-write
command such as :e or :saveas) tells Vim the 'fileencoding' to use for
this read or write. When reading, it also sets 'fileencoding' (locally)
for the file regardless of the 'fileencodings' heuristics. In spite of
its name, this ++enc modifier has NOTHING TO DO with 'encoding' but only
with 'fileencoding'.


'fileencodings' (plural) is a comma-separated list of values of
'fileencoding' (singular) to be tried when opening an editfile without
the++enc modifier. They are tested from left to right in sequence:

- ucs-bom (if present) should be first. It will test the first few bytes
of the first against the possible representations of U+FEFF in the
various Unicode encodings. If found, and the rest of the file agrees
with that particular encoding, it will set 'bomb' to true and
'fileencoding' to the corresponding encoding. In that case the
heuristics ends there. Otherwise 'bomb' is set to false and the next
encoding is tried.

- Any multibyte encoding (for instance utf-8) tests the contents of the
file against the admissible character values for that encoding. If an
error is found, the test ends there (gives a "fail" result) and the next
encoding in sequence is tested. If the end of the file is reached with
no error (all bytes and byte sequences are acceptable for that
encoding), 'fileencoding' is set and the heuristics ands.

- An 8-byte encoding can never fail: it will set 'fileencoding' with no
test. IOW there should be at most one 8-byte encoding, and it should be
last. If there are more than one 8-byte encoding, Vim won't give an
error, it will just never try anything (not even a multybyte encoding,
if present) after the first 8-byte encoding.

- The value "default" is special: it means the value from your OS
locale, i.e. the value which 'encoding' had before sourcing any startup
script, even the system vimrc. It may be useful to put it last if you
don't already try an 8-bit encoding before that.


Conclusion:
Vim has no built-in mechanism to sort Windows-1252, iso-8859-15 and
Latin1 apart from each other. They are all 8-bit encodings, and
sometimes one of the former two is used for the latter. You will have,
for each of your files, to know which is which and, if necessary, use
the appropriate ++enc modifier when reading it. This will set
'fileencoding' to what you tell Vim, and the same encoding will be used
when writing. Just make sure that if you guess wrong, you notice it
immediately, and read the file again in another 'fileencoding' before
you modify it.



Best regards,
Tony.

Ben Fritz

unread,
Jul 6, 2014, 1:11:45 PM7/6/14
to vim...@googlegroups.com
On Sunday, July 6, 2014 5:32:32 AM UTC-5, rameo wrote:
> Thank you very much Ben for your great explication.
>
> Not easy to understand.
>
>

Agreed, encoding stuff is hard to understand in the best of cases. I think Vim's mix of 4 options (enc, fenc, fencs, tenc), one of which (fenc) is "global-local" (it has both a global "default" value and also a buffer-local value) makes it even more confusing. Read Tony's post for a great detailed explanation.

In summary, I think the best method is:

1. Set "encoding" to utf-8 and forget the option even exists
2. Set "fileencodings" to detect the files you edit most. Consider a plugin like autofenc for any rough edges.
3. Pay attention to the intended encoding of your file and set "fileencoding" accordingly if Vim guesses wrong.

You can forget about "termencoding" unless you use Vim in a terminal a lot and you have encoding/display problems.

>
> I still don't understand why my vimrc and menu.vim, containing both french characters as "œu", could be read in latin1 in the past, without any problem or error.
>
> (The only encoding line I had in my vimrc file at that moment was "set encoding="latin1")
>

Again, Windows likes to pretend cp1252 (a.k.a. Windows-1252) and Latin1 are synonymous. They are not, but Vim treats them as such if Vim's encoding is set to Latin1 (default value on most English Windows installations).

>
>
> What I also don't understand is that with above setting, files in latin1 where encoded in latin1 but the fileencoding in my statusline was empty (no fileencoding was indicated by vim)
>
> If I changed the above setting to "set encoding=utf8", encoding and fileencoding both indicated utf-8. Does vim take as default encoding the default windows encoding?
>
>

The default encoding when saving a file, if the 'fileencoding' option is empty, matches Vim's 'encoding' option. Like I said above though, you should really forget that the 'encoding' option even exists, so your fileencoding should normally be set to something.

That's what the "setglobal fileencoding=utf-8" command I suggested is for. Or substitute your preferred default encoding.

>
> Every now and then I write something in Russian that is why it might be better to change the default encoding to utf8, isn't it? I had also troubles to use a plugin using latin1 as the default encoding.
>

I suggest "encoding" as utf-8 regardless of whether you are writing files with special characters. With a utf-8 encoding, you can set various Vim options like 'listchars' and 'showbreak' to fancy Unicode characters instead of boring ASCII characters, plugins can show fancy arrows and the like for their UI, and other such niceties. True Latin1 should not give you any trouble with a Vim running in utf-8 mode, and if it does, a simple "scriptencoding" added at the top of the plugin as Tony details will fix that.

> If I set my default encoding to utf-8, what would be the "filencodings"?
>
> Set fileencodings=ucs-bom,utf8,cp1252,latin1?
>
> utf8 at the end or after ucs-bom?
>

utf-8 must come before any of the fixed 8-bit encodings as Tony says. I left it out of my config entirely because I didn't want my Latin1 files detected as UTF-8. I usually set the "bomb" option on my UTF-8 files so that the "ucs-bom" portion will detect my UTF-8 files. For files where a BOM is not valid (e.g. HTML) I have the AutoFenc plugin.

> Btw I'm on a windows OS, 8bit-cp1252 has to be cp1252, isn't it?
>

Yes, sorry about that. I'm normally on Windows too, it didn't occur to me I would have tweaked my config for the Linux system I'm on at the moment.

>
>
> If my default encoding will be utf-8, it is better to convert vimrc and menu.vim to utf8 as well to avoid that I see every time "Converted" after the filename, isn't it?

"converted" is for files you edit that don't match your global encoding. It doesn't have anything to do with scripts.

You could convert all your scripts, but it can be easier (and clearer as to the intended encoding) to add a "scriptencoding cp1252" or something to the top of each file.

>
> Do you know a good software to convert cp1252 files to utf-8? (I used iconv in the past)
>
>

Vim can do it, if compiled with multibyte support (I have yet to see a Windows Vim without it).

gvim -N -u NONE -i NONE
:set encoding=utf-8
:e ++enc=cp1252 blah.txt
:setlocal fileencoding=utf-8
:wq

>
> Btw Ben, you noted in your reply "setglobal fileencoding=utf-8"
>
> What difference is there between "set fileencoding=utf-8" and "setglobal fileencoding=utf-8". I thought there was no local buffer encoding setting?

There is no buffer-local "encoding" option. "fileencoding" (the one that actually exists for daily use ;-) ) is almost entirely buffer-local but using "set" instead of "setlocal" will affect the default value for new buffers. The setglobal command just sets the default value without changing the value for the current buffer.
Reply all
Reply to author
Forward
0 new messages