":set enc=utf-16" causes encoding conversion problem.

413 views
Skip to first unread message

Yukihiro Nakadaira

unread,
Jun 8, 2009, 10:58:31 AM6/8/09
to vim...@googlegroups.com
When 'encoding' is utf-16 (or ucs-2 or ucs-4) and 'fileencoding' is
utf-8, Vim converts encoding wrongly.

Steps to reproduce:
:set enc=utf-16
:e ++enc=utf-8 utf8-multi-byte-text.txt
(At this time, file is properly loaded)
:w
Then, Vim breaks the file.

It seems that Vim converts the internal utf-8 text from latin1 to utf-8.

This problem doesn't occur when fenc is not utf-8.

--
Yukihiro Nakadaira - yukihiro....@gmail.com

Bram Moolenaar

unread,
Jun 8, 2009, 3:35:09 PM6/8/09
to Yukihiro Nakadaira, vim...@googlegroups.com

Yukihiro Nakadaira wrote:

> When 'encoding' is utf-16 (or ucs-2 or ucs-4) and 'fileencoding' is
> utf-8, Vim converts encoding wrongly.
>
> Steps to reproduce:
> :set enc=utf-16
> :e ++enc=utf-8 utf8-multi-byte-text.txt
> (At this time, file is properly loaded)
> :w
> Then, Vim breaks the file.
>
> It seems that Vim converts the internal utf-8 text from latin1 to utf-8.
>
> This problem doesn't occur when fenc is not utf-8.

I'll put it in the todo list. Using utf-16 for 'encoding' is rather
unusual, but we must not destroy the file contents even in rare
situations.

--
The goal of science is to build better mousetraps.
The goal of nature is to build better mice.

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ download, build and distribute -- http://www.A-A-P.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Matt Wozniski

unread,
Jun 8, 2009, 4:43:33 PM6/8/09
to vim...@googlegroups.com
On Mon, Jun 8, 2009 at 3:35 PM, Bram Moolenaar wrote:
>
> Yukihiro Nakadaira wrote:
>
>> When 'encoding' is utf-16 (or ucs-2 or ucs-4) and 'fileencoding' is
>> utf-8, Vim converts encoding wrongly.
>>
>> Steps to reproduce:
>>   :set enc=utf-16
>>   :e ++enc=utf-8 utf8-multi-byte-text.txt
>>     (At this time, file is properly loaded)
>>   :w
>> Then, Vim breaks the file.
>>
>> It seems that Vim converts the internal utf-8 text from latin1 to utf-8.
>>
>> This problem doesn't occur when fenc is not utf-8.
>
> I'll put it in the todo list.  Using utf-16 for 'encoding' is rather
> unusual, but we must not destroy the file contents even in rare
> situations.

Well, keeping in mind that vim will use utf-8 internally even if you
explicitly :set enc=utf-16, maybe the best fix would be to always
change &encoding to 'utf-8' whenever doing a :set
encoding=SomethingUnicode? It seems like it would fix this bug. This
bug, as far as I can tell from a quick glance, is because vim tries to
convert from UTF-16 (&enc) to UTF-8 (&fenc) when writing the file, and
since the buffer is being internally stored as UTF-8 this is the wrong
thing to do.

~Matt

Bram Moolenaar

unread,
Jun 10, 2009, 7:24:38 AM6/10/09
to Matt Wozniski, vim...@googlegroups.com

Matt Wozniski wrote:

> On Mon, Jun 8, 2009 at 3:35 PM, Bram Moolenaar wrote:
> >
> > Yukihiro Nakadaira wrote:
> >
> >> When 'encoding' is utf-16 (or ucs-2 or ucs-4) and 'fileencoding' is
> >> utf-8, Vim converts encoding wrongly.
> >>
> >> Steps to reproduce:

> >> Â :set enc=utf-16
> >> Â :e ++enc=utf-8 utf8-multi-byte-text.txt
> >> Â Â (At this time, file is properly loaded)
> >> Â :w


> >> Then, Vim breaks the file.
> >>
> >> It seems that Vim converts the internal utf-8 text from latin1 to utf-8.
> >>
> >> This problem doesn't occur when fenc is not utf-8.
> >

> > I'll put it in the todo list. Â Using utf-16 for 'encoding' is rather


> > unusual, but we must not destroy the file contents even in rare
> > situations.
>
> Well, keeping in mind that vim will use utf-8 internally even if you
> explicitly :set enc=utf-16, maybe the best fix would be to always
> change &encoding to 'utf-8' whenever doing a :set
> encoding=SomethingUnicode? It seems like it would fix this bug. This
> bug, as far as I can tell from a quick glance, is because vim tries to
> convert from UTF-16 (&enc) to UTF-8 (&fenc) when writing the file, and
> since the buffer is being internally stored as UTF-8 this is the wrong
> thing to do.

The main reason one would set 'encoding' to utf-16 is when this should
be the default file format. On MS-Windows some files are utf-16, if you
are editing a whole bunch of them this could be useful (even though
using utf-8 should work).

I don't think finding one bug is a good reason to drop support for this.
It's probably easy to fix.

--
hundred-and-one symptoms of being an internet addict:
15. Your heart races faster and beats irregularly each time you see a new WWW
site address in print or on TV, even though you've never had heart
problems before.

Matt Wozniski

unread,
Jun 11, 2009, 9:14:22 AM6/11/09
to vim...@googlegroups.com
Bram Moolenaar wrote:

>
> Matt Wozniski wrote:
>
> >
> > Well, keeping in mind that vim will use utf-8 internally even if you
> > explicitly :set enc=utf-16, maybe the best fix would be to always
> > change &encoding to 'utf-8' whenever doing a :set
> > encoding=SomethingUnicode? It seems like it would fix this bug. This
> > bug, as far as I can tell from a quick glance, is because vim tries to
> > convert from UTF-16 (&enc) to UTF-8 (&fenc) when writing the file, and
> > since the buffer is being internally stored as UTF-8 this is the wrong
> > thing to do.
>
> The main reason one would set 'encoding' to utf-16 is when this should
> be the default file format. On MS-Windows some files are utf-16, if you
> are editing a whole bunch of them this could be useful (even though
> using utf-8 should work).

Well, that's another thing that has never worked, then. When 'enc' is
'utf-16' and 'fenc' is unset, files are written out in utf-8, not
utf-16.

Simple testcase:

vim -u NONE -N --cmd 'set enc=utf-16 fenc= | exe "normal! i\<C-k>`e" | w !iconv -f utf-16' -c 'q!'
iconv: incomplete character or shift sequence at end of buffer
shell returned 1

Change the '-f utf-16' to '-f utf-8' and iconv confirms that it's being
passed valid utf-8.

Is the desired behavior even well defined? The docs seem to contradict;
:help 'encoding' says:

When "unicode", "ucs-2" or "ucs-4" is used, Vim internally uses utf-8.

but :help 'fileencoding' says:

When 'fileencoding' is empty, the same value as 'encoding' will be
used (no conversion when reading or writing a file).

In this case, 'fileencoding' is empty, but conversion *is* supposed to
occur when writing the file (from the internal utf-8 buffer to the
'encoding' utf-16).

> I don't think finding one bug is a good reason to drop support for this.
> It's probably easy to fix.

~Matt

Tony Mechelynck

unread,
Jun 15, 2009, 6:36:36 AM6/15/09
to vim...@googlegroups.com
On 11/06/09 15:14, Matt Wozniski wrote:
>
> Bram Moolenaar wrote:
>>
>> Matt Wozniski wrote:
>>
>>>
>>> Well, keeping in mind that vim will use utf-8 internally even if you
>>> explicitly :set enc=utf-16, maybe the best fix would be to always
>>> change&encoding to 'utf-8' whenever doing a :set

I'm not Bram, so take my opinions below with a grain of salt; however,
after attentively reading the Vim multibyte docs for years, I believe
that the "desired" (or at least the "least surprising") behaviour would be:

- If 'encoding' is one of ucs-2, ucs-2le, utf-16, utf-16le, ucs-4,
ucs-4le (or utf-32, utf-32le which are aliases for ucs-4 ucs-4le; or the
*be aliases for ucs-? utf-??), use utf-8 internally, but convert between
utf-8 and 'encoding' when reading and writing if 'fileencoding' is
empty. Vim ought to be able to do these conversions without calling
iconv, they are trivial (the "least trivial" of them, I think, is when
converting between UTF-16 surrogate pairs and UTF-8 representation for
codepoints in the range U+10000 - U+10FFFF, but even that is systematic,
and documented with no ambiguity somewhere on the Unicode site, and even
IIRC on the Wikipedia).

- With the same values of 'encoding', when 'fileencoding' is nonempty,
always pass UTF-8 to represent the "internal encoding" when invoking
iconv for reading or writing. The same of course applies when
"bypassing" iconv, e.g. when 'fileencoding' is latin1.

- With other values of 'encoding' (including utf-8), 'encoding'
represents the actual memory representation. This is the "general case"
and is what is documented wherever the Vim help doesn't explicitly
mention the opposite.


Best regards,
Tony.
--
Line Printer paper is strongest at the perforations.

Reply all
Reply to author
Forward
0 new messages