opening a Unicode file

443 views
Skip to first unread message

msorens

unread,
Sep 13, 2007, 6:13:21 PM9/13/07
to vim_multibyte
I have been using gvim for several years and vi before that for a
couple decades. I thought I understood how to wade through the rather
terse documentation, though there are still quite a few features I
have not touched.

Recently I wanted to try to read a Unicode file in vim. What could be
simpler, I thought? Well, I was unsuccessful with vim help. So I
searched the web and came across variations of the same settings to
add to _vimrc, as shown below. But I still was unsuccessful in just
opening an existing Unicode file.

I guess I should make sure that my file is Unicode: I opened it in vim
in binary mode, then ran xxd, and observed a null byte after very
character. (In non-binary mode I see an up-arrow followed by an @-sign
after each character.) This *is* Unicode, right?

The file opens just fine in Notepad; what is the secret about having
it "just open right" in vim?

===============================================
if has("multi_byte") " if not, we need to recompile
if &enc !~? '^u' " if the locale 'encoding' starts with u or U
" then Unicode is already set
if &tenc == ''
let &tenc = &enc " save the keyboard charset
endif
set enc=utf-8 " to support Unicode fully, we need to be able
" to represent all Unicode codepoints in
memory
endif
set fencs=ucs-bom,utf-8,latin1
setg bomb " default for new Unicode files
setg fenc=latin1 " default for files created from scratch
else
echomsg 'Warning: Multibyte support is not compiled-in.'
endif
===============================================
if has("multi_byte")
if &termencoding == ""
let &termencoding = &encoding
endif
set encoding=utf-8
setglobal fileencoding=utf-8 bomb
set fileencodings=ucs-bom,utf-8,latin1
endif
===============================================

Tony Mechelynck

unread,
Sep 13, 2007, 6:45:59 PM9/13/07
to vim_mu...@googlegroups.com
msorens wrote:
> I have been using gvim for several years and vi before that for a
> couple decades. I thought I understood how to wade through the rather
> terse documentation, though there are still quite a few features I
> have not touched.
>
> Recently I wanted to try to read a Unicode file in vim. What could be
> simpler, I thought? Well, I was unsuccessful with vim help. So I
> searched the web and came across variations of the same settings to
> add to _vimrc, as shown below. But I still was unsuccessful in just
> opening an existing Unicode file.
>
> I guess I should make sure that my file is Unicode: I opened it in vim
> in binary mode, then ran xxd, and observed a null byte after very
> character. (In non-binary mode I see an up-arrow followed by an @-sign
> after each character.) This *is* Unicode, right?

This is just one of the possible "transfer formats" of Unicode. From what you
describe, it could be either ucs-2 (which is not a UTF: it represents U+0000
to U+FFFF as one 16-bit word each but cannot represent anything above that),
or UTF-16 (which represents U+0000 to U+FFFF as one 16-bit word each, and
U+10000 to (IIRC) U+10FFFF by means of pairs of "surrogate" codepoints below
U+FFFF.

There are other Unicode Transfer Formats: UTF-32 (which represents each
Unicode codepoint by one 32-bit doubleword), UTF-8 (which represents UTF
codepoints by a variable number of 8-bit bytes each) and even GB18030 (which
can represent all Unicode codepoints, but is optimized in favour of Chinese,
while UTF-8 is optimized in favour of West-European Latin scripts, especially
English).

>
> The file opens just fine in Notepad; what is the secret about having
> it "just open right" in vim?

:e ++enc=utf-16 filename

This does the equivalent of ":setlocal fileencoding=utf-16" at the same time
as reading the file (after reading the file would be too late). If you still
see garbled gobbledygook, it may mean that the file's endianness (i.e., which
byte comes first in a 16-bit word) is not the same as whatever Vim uses as
default. In that case, replace "utf-16" above by either "utf-16be" (big
endian: high byte first) or "utf-16le" (little endian: low byte first).

See
:help ++opt
:help 'fileencoding'
:help mbyte-encoding


Best regards,
Tony.
--
It is too bad that the speed of light hasn't kept pace with the
changes in CPU speed and network bandwidth. -- <wie...@porcupine.org>

John (Eljay) Love-Jensen

unread,
Sep 14, 2007, 7:37:20 AM9/14/07
to vim_mu...@googlegroups.com
Hi Tony,

> :e ++enc=utf-16 filename

Thanks Tony! I've been wondering how to do that!

Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).

I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.

Also if you need to make sure the file is written with BOM you can use:

:set bomb

Or without the BOM:

:set nobomb

For some light reading on Unicode 5.0:

http://www.amazon.com/dp/0321480910/

HTH,
--Eljay

winmail.dat

Tony Mechelynck

unread,
Sep 14, 2007, 11:46:42 AM9/14/07
to vim_mu...@googlegroups.com
John (Eljay) Love-Jensen wrote:
> Hi Tony,
>
>> :e ++enc=utf-16 filename
>
> Thanks Tony! I've been wondering how to do that!
>
> Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).

If any Unicode file (here I mean UTF-8, UTF16le, UTF-16be, UTF-32le or
UTF-32be -- I'll leave out GB18030 for the moment) starts with a BOM, Vim will
recognise it _provided_ that your 'fileencodings' (plural) starts with
"ucs-bom". In order for it to work properly, though, 'encoding' should already
be UTF-8 (or UTF-16 or UTF-32, which Vim handles internally as UTF-8 to avoid
problems with null bytes terminating C strings).

Specifying explicitly that a file is, for instance, UTF-16le is IMHO not
"wrong" (unless the file is actually in some other encoding, of course); it is
just "unnecessary" if the file starts with a BOM.

>
> I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.

:-)

>
> Also if you need to make sure the file is written with BOM you can use:
>
> :set bomb
>
> Or without the BOM:
>
> :set nobomb

...and if you want to make sure that "newly created" Unicode files will (or
won't) have a BOM by default you can write

setglobal bomb
or
setglobal nobomb

in your vimrc. (I use ":setglobal bomb" but YMMV.) This setting has no
influence on non-Unicode files such as those in Latin1.

>
> For some light reading on Unicode 5.0:
>
> http://www.amazon.com/dp/0321480910/

For serious reading, see also http://www.unicode.org/ -- and others.

>
> HTH,
> --Eljay

Best regards,
Tony.
--
99 blocks of crud on the disk,
99 blocks of crud!
You patch a bug, and dump it again:
100 blocks of crud on the disk!

100 blocks of crud on the disk,
100 blocks of crud!
You patch a bug, and dump it again:
101 blocks of crud on the disk! ...

mbbill

unread,
Sep 14, 2007, 12:53:40 PM9/14/07
to vim_multibyte




>John (Eljay) Love-Jensen wrote:
>> Hi Tony,
>>
>>> :e ++enc=utf-16 filename
>>
>> Thanks Tony! I've been wondering how to do that!
>>
>> Note: if the utf-16 file contains a BOM (which, often, it should/will), then it should not be necessary to specify utf-16le or utf-16be explicitly (and, indeed, would be incorrect according to Unicode standards to do so -- Vim probably does the friendly thing anyway).
>
>If any Unicode file (here I mean UTF-8, UTF16le, UTF-16be, UTF-32le or
>UTF-32be -- I'll leave out GB18030 for the moment) starts with a BOM, Vim will
>recognise it _provided_ that your 'fileencodings' (plural) starts with
>"ucs-bom". In order for it to work properly, though, 'encoding' should already
>be UTF-8 (or UTF-16 or UTF-32, which Vim handles internally as UTF-8 to avoid
>problems with null bytes terminating C strings).
>
>Specifying explicitly that a file is, for instance, UTF-16le is IMHO not
>"wrong" (unless the file is actually in some other encoding, of course); it is
>just "unnecessary" if the file starts with a BOM.
>
>>
>> I say this not for Tony's edification, because I'm sure that he already knows this, but for everyone else who may be in msorens's situation.
>
>:-)
>
>>
>> Also if you need to make sure the file is written with BOM you can use:
>>
>> :set bomb
>>
>> Or without the BOM:
>>
>> :set nobomb
>
>....and if you want to make sure that "newly created" Unicode files will (or

Camillo Särs

unread,
Sep 15, 2007, 11:33:50 AM9/15/07
to vim_mu...@googlegroups.com
Tony Mechelynck wrote:
> ...and if you want to make sure that "newly created" Unicode files will (or
> won't) have a BOM by default you can write
>
> setglobal bomb
> or
> setglobal nobomb
>
> in your vimrc. (I use ":setglobal bomb" but YMMV.) This setting has no
> influence on non-Unicode files such as those in Latin1.

Beware, though, that if your environment defaults to utf-8 file
encoding, then setting "bomb" will cause the BOM to be written to all
new files. This can become a problem when dealing with some legacy
applications that don't expect to see those extra bytes at the
beginning. Examples range from *nix shells and hashbang (#!) processing
to Windows .ini file headings [...].

So this setting may indeed cause some legacy apps to "bomb" on you.
Pardon the pun, but I thought it was hilarious once I got over the "duh"
factor after debugging.

Regards,
Camillo
--
Camillo Särs <g...@iki.fi> Aim for the impossible and you
http://www.ged.fi will achieve the improbable

Reply all
Reply to author
Forward
0 new messages