Steps to reproduce:
:set enc=utf-16
:e ++enc=utf-8 utf8-multi-byte-text.txt
(At this time, file is properly loaded)
:w
Then, Vim breaks the file.
It seems that Vim converts the internal utf-8 text from latin1 to utf-8.
This problem doesn't occur when fenc is not utf-8.
--
Yukihiro Nakadaira - yukihiro....@gmail.com
> When 'encoding' is utf-16 (or ucs-2 or ucs-4) and 'fileencoding' is
> utf-8, Vim converts encoding wrongly.
>
> Steps to reproduce:
> :set enc=utf-16
> :e ++enc=utf-8 utf8-multi-byte-text.txt
> (At this time, file is properly loaded)
> :w
> Then, Vim breaks the file.
>
> It seems that Vim converts the internal utf-8 text from latin1 to utf-8.
>
> This problem doesn't occur when fenc is not utf-8.
I'll put it in the todo list. Using utf-16 for 'encoding' is rather
unusual, but we must not destroy the file contents even in rare
situations.
--
The goal of science is to build better mousetraps.
The goal of nature is to build better mice.
/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ download, build and distribute -- http://www.A-A-P.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
Well, keeping in mind that vim will use utf-8 internally even if you
explicitly :set enc=utf-16, maybe the best fix would be to always
change &encoding to 'utf-8' whenever doing a :set
encoding=SomethingUnicode? It seems like it would fix this bug. This
bug, as far as I can tell from a quick glance, is because vim tries to
convert from UTF-16 (&enc) to UTF-8 (&fenc) when writing the file, and
since the buffer is being internally stored as UTF-8 this is the wrong
thing to do.
~Matt
> On Mon, Jun 8, 2009 at 3:35 PM, Bram Moolenaar wrote:
> >
> > Yukihiro Nakadaira wrote:
> >
> >> When 'encoding' is utf-16 (or ucs-2 or ucs-4) and 'fileencoding' is
> >> utf-8, Vim converts encoding wrongly.
> >>
> >> Steps to reproduce:
> >> Â :set enc=utf-16
> >> Â :e ++enc=utf-8 utf8-multi-byte-text.txt
> >> Â Â (At this time, file is properly loaded)
> >> Â :w
> >> Then, Vim breaks the file.
> >>
> >> It seems that Vim converts the internal utf-8 text from latin1 to utf-8.
> >>
> >> This problem doesn't occur when fenc is not utf-8.
> >
> > I'll put it in the todo list. Â Using utf-16 for 'encoding' is rather
> > unusual, but we must not destroy the file contents even in rare
> > situations.
>
> Well, keeping in mind that vim will use utf-8 internally even if you
> explicitly :set enc=utf-16, maybe the best fix would be to always
> change &encoding to 'utf-8' whenever doing a :set
> encoding=SomethingUnicode? It seems like it would fix this bug. This
> bug, as far as I can tell from a quick glance, is because vim tries to
> convert from UTF-16 (&enc) to UTF-8 (&fenc) when writing the file, and
> since the buffer is being internally stored as UTF-8 this is the wrong
> thing to do.
The main reason one would set 'encoding' to utf-16 is when this should
be the default file format. On MS-Windows some files are utf-16, if you
are editing a whole bunch of them this could be useful (even though
using utf-8 should work).
I don't think finding one bug is a good reason to drop support for this.
It's probably easy to fix.
--
hundred-and-one symptoms of being an internet addict:
15. Your heart races faster and beats irregularly each time you see a new WWW
site address in print or on TV, even though you've never had heart
problems before.
Well, that's another thing that has never worked, then. When 'enc' is
'utf-16' and 'fenc' is unset, files are written out in utf-8, not
utf-16.
Simple testcase:
vim -u NONE -N --cmd 'set enc=utf-16 fenc= | exe "normal! i\<C-k>`e" | w !iconv -f utf-16' -c 'q!'
iconv: incomplete character or shift sequence at end of buffer
shell returned 1
Change the '-f utf-16' to '-f utf-8' and iconv confirms that it's being
passed valid utf-8.
Is the desired behavior even well defined? The docs seem to contradict;
:help 'encoding' says:
When "unicode", "ucs-2" or "ucs-4" is used, Vim internally uses utf-8.
but :help 'fileencoding' says:
When 'fileencoding' is empty, the same value as 'encoding' will be
used (no conversion when reading or writing a file).
In this case, 'fileencoding' is empty, but conversion *is* supposed to
occur when writing the file (from the internal utf-8 buffer to the
'encoding' utf-16).
> I don't think finding one bug is a good reason to drop support for this.
> It's probably easy to fix.
~Matt
I'm not Bram, so take my opinions below with a grain of salt; however,
after attentively reading the Vim multibyte docs for years, I believe
that the "desired" (or at least the "least surprising") behaviour would be:
- If 'encoding' is one of ucs-2, ucs-2le, utf-16, utf-16le, ucs-4,
ucs-4le (or utf-32, utf-32le which are aliases for ucs-4 ucs-4le; or the
*be aliases for ucs-? utf-??), use utf-8 internally, but convert between
utf-8 and 'encoding' when reading and writing if 'fileencoding' is
empty. Vim ought to be able to do these conversions without calling
iconv, they are trivial (the "least trivial" of them, I think, is when
converting between UTF-16 surrogate pairs and UTF-8 representation for
codepoints in the range U+10000 - U+10FFFF, but even that is systematic,
and documented with no ambiguity somewhere on the Unicode site, and even
IIRC on the Wikipedia).
- With the same values of 'encoding', when 'fileencoding' is nonempty,
always pass UTF-8 to represent the "internal encoding" when invoking
iconv for reading or writing. The same of course applies when
"bypassing" iconv, e.g. when 'fileencoding' is latin1.
- With other values of 'encoding' (including utf-8), 'encoding'
represents the actual memory representation. This is the "general case"
and is what is documented wherever the Vim help doesn't explicitly
mention the opposite.
Best regards,
Tony.
--
Line Printer paper is strongest at the perforations.