Trouble getting started with vim and utf-8 file

123 views
Skip to first unread message

DanKegel

unread,
Apr 8, 2011, 1:33:07 AM4/8/11
to vim_multibyte
The file http://winetricks.org/winetricks is, I hope, a utf-8 file,
but is not recognized as such in the vim that comes
with ubuntu 11.04 (with german locale, even).
It's mostly ascii, with just a few non-ascii lines, e.g.

# If you do not see an o with two dots over it here [ö], stop!
...
mymenu="$HOME/.local/share/applications/wine/Programs/
Electronic Arts/Th
e Sims Medieval/The Sims™ Medieval.desktop"

That first line contains an o umlaut, and the second line contains the
trademark symbol.

Opening the file with vi winetricks shows

# If you do not see an o with two dots over it here [ö], stop!
...
mymenu="$HOME/.local/share/applications/wine/Programs/
Electronic Arts/The Sims Medieval/The Simsâ<84>¢ Medieval.desktop"

which isn't right. Just opening up vi with no arguments, and doing
!!cat winetricks
brings the file in great, and the utf-8 chars look good, but then
saving it complains
"winetricks" CONVERSION ERROR in line 12328; 14640 lines, 496509
characters written
and yields a very corrupt file.

So what's going on? It seems that vim has decided the file Is Not
UTF-8. :se shows
fileencoding=latin1
fileencodings=ucs-bom,utf-8,default,latin1
even if I put
set encoding=utf8 fileencoding=utf8
in ~/.vimrc.

Help...

Thanks,
Dan

Aleksey

unread,
Apr 8, 2011, 3:07:59 AM4/8/11
to vim_multibyte
Here's what i've found

Opening this file in gVim doesn't show it right. Encoding detected is
cp1251 (on my config)

issuing this command
:e ++enc=utf-8

did fine and displayed TM symbol, but it also gave warning about
illegal byte at line 7388
which looked so
title="?Torrent 3.0" \
Previous section had µ , so just replaced illegal char with it.

Saving/opening from command line - works fine with encoding detected

It doesn't answer your question, just a workaround

On Apr 8, 9:33 am, DanKegel <daniel.r.ke...@gmail.com> wrote:
> The filehttp://winetricks.org/winetricksis, I hope, a utf-8 file,

Dan Kegel

unread,
Apr 8, 2011, 5:51:30 PM4/8/11
to Tony Mechelynck, vim_mu...@googlegroups.com
Thanks very much, guys!

Tony Mechelynck

unread,
Apr 8, 2011, 3:22:49 AM4/8/11
to vim_mu...@googlegroups.com, DanKegel
On 08/04/11 07:33, DanKegel wrote:
> The file http://winetricks.org/winetricks is, I hope, a utf-8 file,
> but is not recognized as such in the vim that comes
> with ubuntu 11.04 (with german locale, even).
> It's mostly ascii, with just a few non-ascii lines, e.g.
>
> # If you do not see an o with two dots over it here [�], stop!

> ...
> mymenu="$HOME/.local/share/applications/wine/Programs/
> Electronic Arts/Th
> e Sims Medieval/The Sims� Medieval.desktop"

>
> That first line contains an o umlaut, and the second line contains the
> trademark symbol.
>
> Opening the file with vi winetricks shows
>
> # If you do not see an o with two dots over it here [ö], stop!
> ...
> mymenu="$HOME/.local/share/applications/wine/Programs/
> Electronic Arts/The Sims Medieval/The Sims�<84>� Medieval.desktop"

>
> which isn't right. Just opening up vi with no arguments, and doing
> !!cat winetricks
> brings the file in great, and the utf-8 chars look good, but then
> saving it complains
> "winetricks" CONVERSION ERROR in line 12328; 14640 lines, 496509
> characters written
> and yields a very corrupt file.
>
> So what's going on? It seems that vim has decided the file Is Not
> UTF-8. :se shows
> fileencoding=latin1
> fileencodings=ucs-bom,utf-8,default,latin1
> even if I put
> set encoding=utf8 fileencoding=utf8
> in ~/.vimrc.
>
> Help...
>
> Thanks,
> Dan
>

I've downloaded that file in my browser, then tried to open it in Vim,
which does not see it as UTF-8 even though I have 'enc' set to utf-8 and
'fencs' set to ucs-bom,utf-8,latin1

Intrigued, I hit 8g8 which brings me to line 7388 column 11 where the
character � ("micro" prefix, similar to Greek mu, 0xB5) cannot be UTF-8
(bytes in the range 0x80 to 0xBF can only exist in UTF-8 as "trailing
bytes" in a multibyte sequence whose first byte is 0xC0 or higher).
Moving the cursor one position right and repeating gives me only a beep,
so this is AFAICT the only illegal character in the file -- but one
illegal byte in the whole file is enough to reject UTF-8 as the file's
'fileencoding'.

Rereading the file with

:view ++enc=utf-8

reads it as UTF-8 at the cost of an error message about line 7388, where
the � is now replaced by a question mark (but the o-umlaut at line 71
appears as �).

It seems that your file is in UTF-8 at line 71 but in Latin1 at line
7388, which means that it is the file's fault, not Vim's fault, that
such a file cannot be displayed correctly.

See
:help 8g8
:help ++opt


Best regards,
Tony.
--
Never hit a man with glasses. Hit him with a baseball bat.

John Beckett

unread,
Apr 8, 2011, 8:13:41 PM4/8/11
to vim_mu...@googlegroups.com
DanKegel wrote:
> The file http://winetricks.org/winetricks is, I hope, a utf-8
> file, but is not recognized as such in the vim that comes
> with ubuntu 11.04 (with german locale, even).

It looks like you created that file, so you need to fix it
because it is not UTF-8.

Downloading the file with wget and dumping the bytes shows that
the character which I have shown as "?" in the following is not
valid UTF-8:
title="?Torrent 3.0" \

That single byte is hex B5 or binary 10110101. That starts with
"10" which is never valid as the first byte of a character in
UTF-8.

BTW you can find that in Vim by opening the file and typing 8g8
which jumps to the next illegal byte sequence, then typing ga to
show the value.

John

Dan Kegel

unread,
Apr 8, 2011, 8:58:10 PM4/8/11
to vim_mu...@googlegroups.com, John Beckett
On Sat, Apr 9, 2011 at 12:13 AM, John Beckett <johnb....@gmail.com> wrote:
> It looks like you created that file, so you need to fix it
> because it is not UTF-8.
>
> Downloading the file with wget and dumping the bytes shows that
> the character which I have shown as "?" in the following is not
> valid UTF-8:
>   title="?Torrent 3.0" \
>
> That single byte is hex B5 or binary 10110101. That starts with
> "10" which is never valid as the first byte of a character in
> UTF-8.
>
> BTW you can find that in Vim by opening the file and typing 8g8
> which jumps to the next illegal byte sequence, then typing ga to
> show the value.

Yeah, that's what I gathered from the other replies.
Thanks!
- Dan

Reply all
Reply to author
Forward
0 new messages