[RFC] Default 'encoding' to UTF-8

141 views
Skip to first unread message

James Vega

unread,
Mar 2, 2009, 7:40:34 PM3/2/09
to Bram Moolenaar, vim...@googlegroups.com
With Vim's current behavior, 'encoding' is derived from the environment
and 'fileencoding'/'termencoding' derive from 'encoding' (modulo
'fileencodings' affect on 'fenc'). This seems sub-optimal for various
reasons.

1) Vim is using an internal encoding derived from the environment which
may or may not be able to represent the different file encodings
encountered when editing various files.
2) The encoding Vim uses for interpreting input from the user and
determining how to display to the user is not directly derived from
the user's environment.
3) File encoding detection ('fencs') defaults to a value that is
unlikely to correctly work with most interesting (non-ascii) files.

Defaulting 'enc' to UTF-8 helps address these problems.

1) This is now a non-issue as Vim can internally represent all
characters by converting them to their unicode counterpart.
2) This can be addressed by making 'tenc' derive its value from the
environment instead of from 'enc', which is more in line with the
behavior implied by the name.
3) File encoding detection now has a sane default value which means new
users are less likely to encounter problems when editing files of
various encodings.

This change would also allow eliminating 'encoding' as an option or,
less drastic, disallowing changing 'enc' once the startup files have
been sourced.

Changing 'enc' in a running Vim session is a very common mistake to new
Vim users that are trying to get their file written out in a specific
encoding or editing a file that's not in their environment's encoding.

The help already states that changing 'enc' in a running session is a
bad idea, and I know from experience that it can cause Vim to crash[0].
Taking the next logical step and preventing users from doing that
(unless someone can provide a compelling reason to continue allowing it)
makes sense and helps prevent potential data loss.

--
James
GPG Key: 1024D/61326D40 2003-09-02 James Vega <jame...@jamessan.com>

[0] vim -u NONE --cmd 'set enc=utf8 list' -c 'let &lcs="nbsp:".nr2char("8215")'
:put =nr2char("160")
:set enc=latin1

signature.asc

Tony Mechelynck

unread,
Mar 2, 2009, 9:32:45 PM3/2/09
to Bram Moolenaar, vim...@googlegroups.com

I have the following remarks:

1) When using gvim with GTK2 GUI, setting 'encoding' to UTF-8 is the
preferred option, though not enforced. However in that case,
'termencoding' is fixed as UTF-8 (unchangeable) in the GUI. I wonder
whether it is possible to configure a GTK2 build with --disable-multibyte.
2) Vim compiled with the --disable-multibyte configure option cannot use
UTF-8, or any other multibyte encoding; in fact it doesn't even accept
the 'encoding' option as valid.
3) 'termencoding' (the encoding used for the keyboard and, in Console
mode, for the display) defaults to empty (which means, fall back to
'encoding') except when running in GUI mode with GTK2. This means that,
by default, communication between Vim and the user is done in the system
locale.
4) It _is_ possible to set 'encoding' to UTF-8 in the vimrc, with
appropriate safeguards, if used at the right spot in the "chronology" of
successive actions (and in particular, before defining mappings or
setting string option values including characters above 0x7F). On this
Linux box, my locale encoding is UTF-8, but that was not the case when I
acquired a serious interest in Vim: the latest version at the time was
some patchlevel of Vim 6.1 and I was using Win98. A compelling reason
for doing so would be a desire to create or edit files using characters
not supported by your system locale, for instance multi-charset files in
UTF-8 when the Windows locale is Windows-1252, as it was (IIRC) on that
W98 system mentioned above.

OTOH, changing the 'encoding' _after_ the end of startup, when you
already have one or more buffers loaded, is not something I would
recommend; it may lead to dataloss or file data corruption, depending on
how you do it. However, I believe that forbidding it by means of
something in the C code would probably be too harsh, and how would you
do it? It _is_ useful to test the value of 'encoding' at any time, or to
use the value to set something else (IOW, to use &encoding in an
expression), so the option should still exist after startup. I don't
think there is a precedent (is there?) for an option that can be
changed, but only until the last VimEnter autocommand (if any) terminates.


Best regards,
Tony.
--
BEDEVERE: Stand by for attack!!
[CUT TO enormous army forming up. Trebuchets, rows of PIKEMEN, siege
towers, pennants flying, shouts of "Stand by for attack!" Traditional
army build-up shots. The shouts echo across the ranks of the army.
We see various groups reacting, and stirring themselves in readiness.]
ARTHUR: Who are they?
BEDEVERE: Oh, just some friends!
"Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

James Vega

unread,
Mar 3, 2009, 12:40:27 AM3/3/09
to vim...@googlegroups.com, Bram Moolenaar
On Tue, Mar 03, 2009 at 03:32:45AM +0100, Tony Mechelynck wrote:
>
> On 03/03/09 01:40, James Vega wrote:
> > ...

> > 3) File encoding detection ('fencs') defaults to a value that is
> > unlikely to correctly work with most interesting (non-ascii) files.
> >
> > Defaulting 'enc' to UTF-8 helps address these problems.
> >
> > ...

> > 3) File encoding detection now has a sane default value which means new
> > users are less likely to encounter problems when editing files of
> > various encodings.
> > ...

>
> 1) When using gvim with GTK2 GUI, setting 'encoding' to UTF-8 is the
> preferred option, though not enforced. However in that case,
> 'termencoding' is fixed as UTF-8 (unchangeable) in the GUI. I wonder
> whether it is possible to configure a GTK2 build with --disable-multibyte.

According to the help, "utf-8" hasn't been made the default for
'encoding' in GTK2 builds to prevent different behavior of the terminal
and GUI versions. Since supporting multibyte is pretty much standard on
any relatively recent OS, trending towards UTF-8 instead of the other
way around seems more logical.

> 2) Vim compiled with the --disable-multibyte configure option cannot use
> UTF-8, or any other multibyte encoding; in fact it doesn't even accept
> the 'encoding' option as valid.

Is there a reason to allow building Vim without multibyte support?
Always having multibyte support would make the code simpler/smaller.

> 3) 'termencoding' (the encoding used for the keyboard and, in Console
> mode, for the display) defaults to empty (which means, fall back to
> 'encoding') except when running in GUI mode with GTK2. This means that,
> by default, communication between Vim and the user is done in the system
> locale.

Unless 'encoding' is set in the user's ~/.vimrc, which in my experience is
pretty common. I'm not sure how closely that aligns with the overall usage
patterns, though.

> 4) It _is_ possible to set 'encoding' to UTF-8 in the vimrc, with
> appropriate safeguards, if used at the right spot in the "chronology" of
> successive actions (and in particular, before defining mappings or
> setting string option values including characters above 0x7F).

As per my response to your previous point, 'termencoding' is less likely to
be based on their locale even though it should always be based on their
locale.

> On this Linux box, my locale encoding is UTF-8, but that was not the
> case when I acquired a serious interest in Vim: the latest version at
> the time was some patchlevel of Vim 6.1 and I was using Win98. A
> compelling reason for doing so would be a desire to create or edit
> files using characters not supported by your system locale, for
> instance multi-charset files in UTF-8 when the Windows locale is
> Windows-1252, as it was (IIRC) on that W98 system mentioned above.

Right, point 3 from my initial mail.

> OTOH, changing the 'encoding' _after_ the end of startup, when you
> already have one or more buffers loaded, is not something I would
> recommend; it may lead to dataloss or file data corruption, depending on
> how you do it.

Exactly.

> However, I believe that forbidding it by means of something in the C
> code would probably be too harsh, and how would you do it? It _is_
> useful to test the value of 'encoding' at any time, or to use the
> value to set something else (IOW, to use &encoding in an expression),
> so the option should still exist after startup.

I'm not suggesting removing read access to the option. I'm purely
suggesting that write access is disabled after the startup scripts are
sourced. Making this change to the source would be fairly trivial,
especially if support for using :lockvar on options were implemented.

> I don't think there is a precedent (is there?) for an option that can
> be changed, but only until the last VimEnter autocommand (if any)
> terminates.

No, there isn't yet but 'encoding' seems like a good one to set the
precedent.

signature.asc

Tony Mechelynck

unread,
Mar 3, 2009, 1:52:37 AM3/3/09
to vim...@googlegroups.com, Bram Moolenaar
On 03/03/09 06:40, James Vega wrote:
> On Tue, Mar 03, 2009 at 03:32:45AM +0100, Tony Mechelynck wrote:
>> On 03/03/09 01:40, James Vega wrote:
>>> ...
>>> 3) File encoding detection ('fencs') defaults to a value that is
>>> unlikely to correctly work with most interesting (non-ascii) files.
>>>
>>> Defaulting 'enc' to UTF-8 helps address these problems.
>>>
>>> ...
>>> 3) File encoding detection now has a sane default value which means new
>>> users are less likely to encounter problems when editing files of
>>> various encodings.
>>> ...
>> 1) When using gvim with GTK2 GUI, setting 'encoding' to UTF-8 is the
>> preferred option, though not enforced. However in that case,
>> 'termencoding' is fixed as UTF-8 (unchangeable) in the GUI. I wonder
>> whether it is possible to configure a GTK2 build with --disable-multibyte.
>
> According to the help, "utf-8" hasn't been made the default for
> 'encoding' in GTK2 builds to prevent different behavior of the terminal
> and GUI versions. Since supporting multibyte is pretty much standard on
> any relatively recent OS, trending towards UTF-8 instead of the other
> way around seems more logical.

UTF-8 support is pretty much standard on any recent Unix-like OS, though
its use by default is not necessarily universal. I don't know about
Vista, but on XP the default was _not_ to have UTF-8 as the system
default encoding.

>
>> 2) Vim compiled with the --disable-multibyte configure option cannot use
>> UTF-8, or any other multibyte encoding; in fact it doesn't even accept
>> the 'encoding' option as valid.
>
> Is there a reason to allow building Vim without multibyte support?
> Always having multibyte support would make the code simpler/smaller.

With +multi_byte is always bigger than -multi_byte: one reason could be
making the Vim binary really "lean and mean". Personally I keep two Vim
builds on this computer: a Huge build named vim, with GTK2/Gnome2 GUI
(and +multi_byte), used via softlinks for most possible executable
names, and a Tiny build named vi (with no GUI and -multi_byte).

>
>> 3) 'termencoding' (the encoding used for the keyboard and, in Console
>> mode, for the display) defaults to empty (which means, fall back to
>> 'encoding') except when running in GUI mode with GTK2. This means that,
>> by default, communication between Vim and the user is done in the system
>> locale.
>
> Unless 'encoding' is set in the user's ~/.vimrc, which in my experience is
> pretty common. I'm not sure how closely that aligns with the overall usage
> patterns, though.

I recommend it for users who need or want to use various encodings, and
possibly plurilingual files mixing them. Users with simpler needs may
quite validly leave 'encoding' at whatever their OS locale sets, and
never stray away from it.

>
>> 4) It _is_ possible to set 'encoding' to UTF-8 in the vimrc, with
>> appropriate safeguards, if used at the right spot in the "chronology" of
>> successive actions (and in particular, before defining mappings or
>> setting string option values including characters above 0x7F).
>
> As per my response to your previous point, 'termencoding' is less likely to
> be based on their locale even though it should always be based on their
> locale.
>
>> On this Linux box, my locale encoding is UTF-8, but that was not the
>> case when I acquired a serious interest in Vim: the latest version at
>> the time was some patchlevel of Vim 6.1 and I was using Win98. A
>> compelling reason for doing so would be a desire to create or edit
>> files using characters not supported by your system locale, for
>> instance multi-charset files in UTF-8 when the Windows locale is
>> Windows-1252, as it was (IIRC) on that W98 system mentioned above.
>
> Right, point 3 from my initial mail.
>
>> OTOH, changing the 'encoding' _after_ the end of startup, when you
>> already have one or more buffers loaded, is not something I would
>> recommend; it may lead to dataloss or file data corruption, depending on
>> how you do it.
>
> Exactly.
>
>> However, I believe that forbidding it by means of something in the C
>> code would probably be too harsh, and how would you do it? It _is_
>> useful to test the value of 'encoding' at any time, or to use the

>> value to set something else (IOW, to use&encoding in an expression),


>> so the option should still exist after startup.
>
> I'm not suggesting removing read access to the option. I'm purely
> suggesting that write access is disabled after the startup scripts are
> sourced. Making this change to the source would be fairly trivial,
> especially if support for using :lockvar on options were implemented.
>
>> I don't think there is a precedent (is there?) for an option that can
>> be changed, but only until the last VimEnter autocommand (if any)
>> terminates.
>
> No, there isn't yet but 'encoding' seems like a good one to set the
> precedent.
>

Hm, to use one of your earlier arguments, it might make the code more
complex, and thus add some bloat and possibly some bugs, where the
present code cannot really be said to be malfunctioning. "If it ain'
broke, don' fix it."


Best regards,
Tony.
--
Why is "abbreviation" such a long word?

Dennis Benzinger

unread,
Mar 3, 2009, 1:54:34 AM3/3/09
to vim...@googlegroups.com
Hi!

Am 03.03.2009 06:40, James Vega schrieb:
> [...]


>> 2) Vim compiled with the --disable-multibyte configure option cannot use
>> UTF-8, or any other multibyte encoding; in fact it doesn't even accept
>> the 'encoding' option as valid.
>
> Is there a reason to allow building Vim without multibyte support?
> Always having multibyte support would make the code simpler/smaller.

It would make the code smaller but compiling without multibyte support
probably makes the resulting binary smaller. That can make a big
difference for users on resource constrained systems.

>> 3) 'termencoding' (the encoding used for the keyboard and, in Console
>> mode, for the display) defaults to empty (which means, fall back to
>> 'encoding') except when running in GUI mode with GTK2. This means that,
>> by default, communication between Vim and the user is done in the system
>> locale.
>
> Unless 'encoding' is set in the user's ~/.vimrc, which in my experience is
> pretty common. I'm not sure how closely that aligns with the overall usage
> patterns, though.

> [...]

FWIW, I don't explicitly set it in my .vimrc. My Ubuntu (8.10) system
uses an UTF-8 locale and Vim detects it. Because this just works I
suppose it's not that common to set it explicitly.


Dennis Benzinger

Markus Heidelberg

unread,
Mar 3, 2009, 5:14:25 AM3/3/09
to vim...@googlegroups.com
Dennis Benzinger, 03.03.2009:

>
> Hi!
>
> Am 03.03.2009 06:40, James Vega schrieb:
> > [...]
> >> 2) Vim compiled with the --disable-multibyte configure option cannot use
> >> UTF-8, or any other multibyte encoding; in fact it doesn't even accept
> >> the 'encoding' option as valid.
> >
> > Is there a reason to allow building Vim without multibyte support?
> > Always having multibyte support would make the code simpler/smaller.
>
> It would make the code smaller but compiling without multibyte support
> probably makes the resulting binary smaller. That can make a big
> difference for users on resource constrained systems.

What do you mean exactly with "resource constrained systems"?
On an old PC, Vim with multibyte should still run fast.
On embedded devices people normally use vi from the busybox package.
Development is not done on this devices, mostly just editing config
files. No need for a featureful editor like Vim.

But now that multibyte support is optional and people are using versions
without it, it should of course not be thrown out unnecessarily.

Markus

Markus Heidelberg

unread,
Mar 3, 2009, 5:20:45 AM3/3/09
to vim...@googlegroups.com
Tony Mechelynck, 03.03.2009:

>
> On 03/03/09 06:40, James Vega wrote:
> > On Tue, Mar 03, 2009 at 03:32:45AM +0100, Tony Mechelynck wrote:
> >> 2) Vim compiled with the --disable-multibyte configure option cannot use
> >> UTF-8, or any other multibyte encoding; in fact it doesn't even accept
> >> the 'encoding' option as valid.
> >
> > Is there a reason to allow building Vim without multibyte support?
> > Always having multibyte support would make the code simpler/smaller.
>
> With +multi_byte is always bigger than -multi_byte: one reason could be
> making the Vim binary really "lean and mean". Personally I keep two Vim
> builds on this computer: a Huge build named vim, with GTK2/Gnome2 GUI
> (and +multi_byte), used via softlinks for most possible executable
> names, and a Tiny build named vi (with no GUI and -multi_byte).

Why the tiny build without multibyte? Is this only a fallback in case of
system problems, when root has to edit config files, where you know,
they don't contain multibyte characters?

Markus

Dennis Benzinger

unread,
Mar 3, 2009, 7:12:36 AM3/3/09
to vim...@googlegroups.com
Hi Markus!

Am 03.03.2009 11:14, Markus Heidelberg schrieb:
> Dennis Benzinger, 03.03.2009:
>>
>> Hi!
>>
>> Am 03.03.2009 06:40, James Vega schrieb:
>> > [...]
>> >> 2) Vim compiled with the --disable-multibyte configure option cannot use
>> >> UTF-8, or any other multibyte encoding; in fact it doesn't even accept
>> >> the 'encoding' option as valid.
>> >
>> > Is there a reason to allow building Vim without multibyte support?
>> > Always having multibyte support would make the code simpler/smaller.
>>
>> It would make the code smaller but compiling without multibyte support
>> probably makes the resulting binary smaller. That can make a big
>> difference for users on resource constrained systems.
>
> What do you mean exactly with "resource constrained systems"?
> On an old PC, Vim with multibyte should still run fast.

> [...]

I meant systems which have or can use only a small amount of memory. For
example (16bit) MS-DOS where you can only use 640KB. These systems may
be rare nowadays but if you'll encounter one you'd probably be happy to
be able to minimize the size of the binary. But I didn't try it out how
much the size differs between a multibyte and a non-multibyte build.
Therefore I wrote "_probably_ makes the resulting binary smaller" ;-)

So if ripping out non-multibyte support does not make the code much
simpler or smaller I'd simply keep it. Do you have any idea much simpler
or smaller the code would be?


Dennis Benzinger

Markus Heidelberg

unread,
Mar 3, 2009, 8:57:13 PM3/3/09
to vim...@googlegroups.com
Dennis Benzinger, 03.03.2009:

>
> Hi Markus!
>
> Am 03.03.2009 11:14, Markus Heidelberg schrieb:
> > Dennis Benzinger, 03.03.2009:
> >>
> >> Hi!
> >>
> >> Am 03.03.2009 06:40, James Vega schrieb:
> >> > [...]
> >> >> 2) Vim compiled with the --disable-multibyte configure option cannot use
> >> >> UTF-8, or any other multibyte encoding; in fact it doesn't even accept
> >> >> the 'encoding' option as valid.
> >> >
> >> > Is there a reason to allow building Vim without multibyte support?
> >> > Always having multibyte support would make the code simpler/smaller.
> >>
> >> It would make the code smaller but compiling without multibyte support
> >> probably makes the resulting binary smaller. That can make a big
> >> difference for users on resource constrained systems.
> >
> > What do you mean exactly with "resource constrained systems"?
> > On an old PC, Vim with multibyte should still run fast.
> > [...]
>
> I meant systems which have or can use only a small amount of memory. For
> example (16bit) MS-DOS where you can only use 640KB. These systems may
> be rare nowadays but if you'll encounter one you'd probably be happy to
> be able to minimize the size of the binary. But I didn't try it out how
> much the size differs between a multibyte and a non-multibyte build.
> Therefore I wrote "_probably_ makes the resulting binary smaller" ;-)

No, that's for sure :)

> So if ripping out non-multibyte support does not make the code much
> simpler or smaller I'd simply keep it. Do you have any idea much simpler
> or smaller the code would be?

Not sure, a lot of #ifdef would vanish.

Markus

Tony Mechelynck

unread,
Mar 3, 2009, 9:48:53 PM3/3/09
to vim...@googlegroups.com

That, and also a "sanity check" that the latest patches work also with a
minimal config, so if they don't I can warn Bram immediately. Once I was
very happy to have it, in order to be able to intervene halfway a system
install run, when my Huge GTK2/Gnome2 build wouldn't load because of
missing libraries.


Best regards,
Tony.
--
"Even nowadays a man can't step up and kill a woman without feeling
just a bit unchivalrous ..."
-- Robert Benchley

Tony Mechelynck

unread,
Mar 3, 2009, 10:11:21 PM3/3/09
to vim...@googlegroups.com

I did try:
- vim (gvim with all bells and whistles except +mzscheme) 3370388 bytes.
- vi (vim with minimal features) 508048 bytes
6.63 times smaller

Both compiled on the same Linux-i686 system with the same 7.2.130
sources (but different config options), and both binaries "stripped" of
their debug info. The difference consists not only of +multi_byte but of
everything which I knew how to enable/disable at compile-time. These are
32-bit binaries; I suspect 16-bit builds would be smaller -- hopefully
they would, because 508k is still big for a Dos machine without Extended
Memory.


Best regards,
Tony.
--
Remember, UNIX spelled backwards is XINU.

Tony Mechelynck

unread,
Mar 3, 2009, 10:17:53 PM3/3/09
to vim...@googlegroups.com
On 04/03/09 02:57, Markus Heidelberg wrote:
> Dennis Benzinger, 03.03.2009:
[...]

>> So if ripping out non-multibyte support does not make the code much
>> simpler or smaller I'd simply keep it. Do you have any idea much simpler
>> or smaller the code would be?
>
> Not sure, a lot of #ifdef would vanish.
>
> Markus

Making the source smaller and simpler, but not the object code since
"false ifdef" sections are removed before parsing the resulting C code.

Best regards,
Tony.
--
The shortest distance between two points is under construction.
-- Noelie Alito

James Vega

unread,
Mar 4, 2009, 1:27:29 AM3/4/09
to vim...@googlegroups.com
On Tue, Mar 03, 2009 at 01:12:36PM +0100, Dennis Benzinger wrote:
>
> Hi Markus!
>
> Am 03.03.2009 11:14, Markus Heidelberg schrieb:
> > Dennis Benzinger, 03.03.2009:
> >>
> >> Hi!
> >>
> >> Am 03.03.2009 06:40, James Vega schrieb:
> >> > [...]
> >> >> 2) Vim compiled with the --disable-multibyte configure option cannot use
> >> >> UTF-8, or any other multibyte encoding; in fact it doesn't even accept
> >> >> the 'encoding' option as valid.
> >> >
> >> > Is there a reason to allow building Vim without multibyte support?
> >> > Always having multibyte support would make the code simpler/smaller.
> >>
> >> It would make the code smaller but compiling without multibyte support
> >> probably makes the resulting binary smaller. That can make a big
> >> difference for users on resource constrained systems.
> >
> > What do you mean exactly with "resource constrained systems"?
> > On an old PC, Vim with multibyte should still run fast.
> > [...]
>
> I meant systems which have or can use only a small amount of memory. For
> example (16bit) MS-DOS where you can only use 640KB. These systems may
> be rare nowadays but if you'll encounter one you'd probably be happy to
> be able to minimize the size of the binary.

Indeed, but there are currently checks that prevent Vim from building
with multibyte support on such systems (ints that are smaller than 32
bit). I guess supporting such OSes would be a reason not to disallow
building without multibyte entirely.

That does raise the question of where the trade-off between keeping
legacy, mostly unused code versus dropping support occurs.

> But I didn't try it out how
> much the size differs between a multibyte and a non-multibyte build.
> Therefore I wrote "_probably_ makes the resulting binary smaller" ;-)
>
> So if ripping out non-multibyte support does not make the code much
> simpler or smaller I'd simply keep it. Do you have any idea much simpler
> or smaller the code would be?

Well, since supporting 16bit systems is still desirable, there'd be no
change in code size.

Just for the sake of argument, though, it would remove 933
'#ifdef FEAT_MBYTE' (or equivalent) conditional parts of the code and 4
'#ifndef FEAT_MBYTE' (or equivalent). How many of the #ifdef scenarios
have a paired #else would require more investigation than I'm willing to
do for the sake of argument. :)

As for the resulting binary sizes:

features=tiny, with multibyte: 560.9k
features=tiny, w/out multibyte: 493.4k
67k or 12% saving

features=small, with multibyte: 618.7k
features=small, w/out multibyte: 551.1k
67k or 11% saving

features=normal, with multibyte: 1390.3k
features=normal, w/out multibyte: 1279.0k
111k or 8% saving

signature.asc

James Vega

unread,
Mar 4, 2009, 2:24:03 AM3/4/09
to vim...@googlegroups.com
On Wed, Mar 04, 2009 at 01:27:29AM -0500, James Vega wrote:
> On Tue, Mar 03, 2009 at 01:12:36PM +0100, Dennis Benzinger wrote:
> > I meant systems which have or can use only a small amount of memory. For
> > example (16bit) MS-DOS where you can only use 640KB. These systems may
> > be rare nowadays but if you'll encounter one you'd probably be happy to
> > be able to minimize the size of the binary.
>
> Indeed, but there are currently checks that prevent Vim from building
> with multibyte support on such systems (ints that are smaller than 32
> bit). I guess supporting such OSes would be a reason not to disallow
> building without multibyte entirely.
>
> That does raise the question of where the trade-off between keeping
> legacy, mostly unused code versus dropping support occurs.

Actually, according to <http://www.vim.org/download.php>, the 16-bit DOS
executable stopped being provided as of Vim 7.2 because 7.2 was too
large for DOS' memory model.

> > But I didn't try it out how
> > much the size differs between a multibyte and a non-multibyte build.
> > Therefore I wrote "_probably_ makes the resulting binary smaller" ;-)
> >
> > So if ripping out non-multibyte support does not make the code much
> > simpler or smaller I'd simply keep it. Do you have any idea much simpler
> > or smaller the code would be?
>
> Well, since supporting 16bit systems is still desirable, there'd be no
> change in code size.

Since 16-bit DOS is out of the picture, are there any other supported
OSes which *don't* have 32-bit integers? If so, that changes the weight
behind supporting the ability to build Vim without multibyte support.

Of course, this whole tangent is just about speculative advantages to
only supporting multibyte-capable Vim builds.

The primary point of my original post is still to determine whether
there are any impediments preventing Vim from using UTF-8 for the
default 'encoding' and determining 'termencoding' from the user's
locale. Anything else that would happen because of that is just icing
on the cake.

signature.asc

Tony Mechelynck

unread,
Mar 4, 2009, 9:22:34 PM3/4/09
to vim...@googlegroups.com

I don't know how large integers are in zOS (with EBCDIC), I guess large
enough, since this is a Unix-like OS (but not Linux) for IBM mainframes,
but according to the latest os_390.txt (under |zOS-weaknesses|), that
port of Vim has no multibyte support. However the zOS port of Vim is
apparently a port made by IBM software engineers in their spare time,
"just for fun because they liked Vim", and I don't know how active it
might still be. Bram might know, but don't ask IBM.

Best regards,
Tony.
--
Famous, adj.:
Conspicuously miserable.
-- Ambrose Bierce

Antonio Colombo

unread,
Mar 5, 2009, 5:19:57 AM3/5/09
to vim...@googlegroups.com
Hi everybody,

> I don't know how large integers are in zOS (with EBCDIC), I
> guess large
> enough, since this is a Unix-like OS (but not Linux) for IBM
> mainframes,

zOS has 32 bits and 64 bits integers. It never really had
16bits integers (back from 1964 or 1965). You could use them,
but the hardware registers have always
been 32 bits long, so the related 16bits hardware instructions
just blanked out the leftmost part of the said registers.

zOS itself is NOT Unix like, but the underlying architecture
can support Linux as well. I think we are speaking here of the
mainframe part of zOS, which can support a kind of Unix, more
or less in the same way Cygwin is supported under Windows.

Cheers, Antonio
--
/||\ | Antonio Colombo
/ || \ | ant...@geekcorp.com
/ () \ | azc...@gmail.com
(___||___) | az...@yahoo.com

Matt Wozniski

unread,
Mar 13, 2009, 11:36:23 AM3/13/09
to vim...@googlegroups.com, Bram Moolenaar

Yeah. We regularly see people in #vim who don't realize that they
should be changing 'fenc' instead of 'enc', and I've seen it come up
on vim-use a few times as well...

> The help already states that changing 'enc' in a running session is a
> bad idea, and I know from experience that it can cause Vim to crash[0].
> Taking the next logical step and preventing users from doing that
> (unless someone can provide a compelling reason to continue allowing it)
> makes sense and helps prevent potential data loss.

This sounds like a very good idea to me. I don't know of any other
programs that allow you to change encoding used internally, and we
would be in good company if we chose to always use a unicode encoding
internally: Java uses UTF-16 internally, and I believe python does as
well. Is there any time when it would be desirable to use a
non-unicode 'encoding' (assuming, of course, that +multi_byte is
available)? I can't think of any.

~Matt

Mike Williams

unread,
Mar 13, 2009, 12:01:39 PM3/13/09
to vim...@googlegroups.com, Bram Moolenaar

Yes, editing very large (say a few 100MB) data files that in a single
byte encoding. For my day job I regularly enjoy having to spelunk my
way around large files containing a mix of readable ASCII and binary
data. Using a Unicode encoding could make this prohibitive. Yes, this
is essentially a raw file edit mode, perhaps that should be an option -
or would it be part of setting binary mode?

TTFN

Mike
--
I am not young enough to know everything.

Matt Wozniski

unread,
Mar 13, 2009, 12:10:00 PM3/13/09
to vim...@googlegroups.com
On Fri, Mar 13, 2009 at 12:01 PM, Mike Williams wrote:

>
> Matt Wozniski wrote:
>> This sounds like a very good idea to me.  I don't know of any other
>> programs that allow you to change encoding used internally, and we
>> would be in good company if we chose to always use a unicode encoding
>> internally: Java uses UTF-16 internally, and I believe python does as
>> well.  Is there any time when it would be desirable to use a
>> non-unicode 'encoding' (assuming, of course, that +multi_byte is
>> available)?  I can't think of any.
>
> Yes, editing very large (say a few 100MB) data files that in a single
> byte encoding.  For my day job I regularly enjoy having to spelunk my
> way around large files containing a mix of readable ASCII and binary
> data.  Using a Unicode encoding could make this prohibitive.  Yes, this
> is essentially a raw file edit mode, perhaps that should be an option -
> or would it be part of setting binary mode?

How would using Unicode for 'enc' in any way affect this? Sure, you'd
want to use a single-byte 'fenc', but no one is suggesting that the
'fenc' option should be removed. If there is a reason why editing
binary files should be affected at all by what encoding the editor
uses for storing the buffer text internally, I don't see it and you'll
need to elaborate.

~Matt

Mike Williams

unread,
Mar 13, 2009, 12:22:51 PM3/13/09
to vim...@googlegroups.com

With a UTF-16 internal encoding a 250MB data file blossoms into a nice
round 500MB. For all the cheap memory these days this will still have
an effect on system performance - time to allocate, paging out of idle
apps to disk, etc.

And will VIM internally use a canonical Unicode form? What happens if I
want to insert some 8-bit data whose unicode character has multiple
forms? Which one is used? How will I know that the 8-bit value I
intend does not appear as composed sequence? I haven't used VIM for
editing unicode with composing characters (damn my native english
country) - I see there is some discussion on composing but a first
glance it is not clear whether it is automatic or not. In my case I
would not want deletion of data byte to result in other bytes to deleted
as well.

At the moment I cannot see how supporting Unicode semantics maps to
editing binary data files. Not saying it is impossible, I'd just like
to see the possible way out of the woods if we did go this way.

TTFN

Mike
--
Imagination is more important than knowledge.

Tony Mechelynck

unread,
Mar 14, 2009, 8:10:12 AM3/14/09
to vim...@googlegroups.com

Vim doesn't use UTF-16 internally but UTF-8 -- even if you set
'encoding' to, let's say, utf-16le, because Vim cannot tolerate actual
nulls in the middle of lines. This also means there is no space loss for
7-bit ASCII, which is represented identically in ASCII, Latin1, UTF-8,
and indeed also in most iso-8859 encodings.

>
> And will VIM internally use a canonical Unicode form? What happens if I
> want to insert some 8-bit data whose unicode character has multiple
> forms? Which one is used? How will I know that the 8-bit value I
> intend does not appear as composed sequence? I haven't used VIM for
> editing unicode with composing characters (damn my native english
> country) - I see there is some discussion on composing but a first
> glance it is not clear whether it is automatic or not. In my case I
> would not want deletion of data byte to result in other bytes to deleted
> as well.
>
> At the moment I cannot see how supporting Unicode semantics maps to
> editing binary data files. Not saying it is impossible, I'd just like
> to see the possible way out of the woods if we did go this way.
>
> TTFN
>
> Mike

IMHO, binary data should be read "as if" 8-bit because in an 8-bit
'fileencoding' there are no "invalid" byte sequences -- and probably
Latin1 because the conversion Latin1 <=> UTF-8 is trivial and requires
no iconv library. An alternate possibility (but to be used only at
user's explicit request IMHO) is to convert binary to hex and vice-versa
via xxd.

However, this is not what Vim does if you read a file with ++bin: what
it does is "no conversion", which means that if 'encoding' is set to
UTF-8 you'll probably get invalid UTF-8 sequences at many places in your
code. For instance an a-acute in Spanish Latin1 text will appear as <e1>
instead of á and an e-circumflex in French Latin1 text will appear as
<ea> instead of ê. Not very convenient if they happen to be within text
strings -- messages, maybe, to be typed out on the screen. So even if
you know that the code is binary you might prefer to use

:e ++enc=latin1 ++ff=unix foobar.bin

and omit the 'binary' setting. The result, if you make changes and save
them, could be an extra 0x0A at the very end if there wasn't one
already, but I don't expect trouble even if it happens. (Overlong lines
might be split if you were on a 16-bit machine, but on 32-bit machines
the maximum line lize and the maximum file size are both 2GB, and even
on a 64-bit machine I don't expect you'll often have to edit a binary
file containing a 2GB stretch of code without a single 0x0A in it.)

Of course, the utmost care should be used when editing binary files
because, if it is e.g. program code,
- the code can contain displacements in binary, which will become
invalid if the length of the intervening text is modified
- executable code should in general not be touched
- compressed binaries are probably not editable in any way
- and what if the program includes a binary hash of its ASCII text
somewhere?

As for canonical forms: I don't think Vim will spontaneously convert
either way between a spacing character + combining character(s) combo
and a precomposed character. If you type a then Ctrl-v u 0301 you'll get
a spacing a and a combining acute. If your keyboard allows "keying" an
a-acute character, or if you type Ctrl-V x e1, you'll get a precomposed
a-acute. The two results will be indistinguishable if you have a "good"
font but Vim doesn't know that, and searching for the precomposed
character will not match the ascii + accent two-codepoint combo.


Best regards,
Tony.
--
"Can you hammer a 6-inch spike into a wooden plank with your penis?"

"Uh, not right now."

"Tsk. A girl has to have some standards."
-- "Real Genius"

Reply all
Reply to author
Forward
0 new messages