[patch] Add an option to specify filename encoding.

57 views
Skip to first unread message

kiku...@uranus.dti.ne.jp

unread,
Oct 1, 2010, 12:39:56 AM10/1/10
to vim...@googlegroups.com
Hello.

I'm using Vim with encoding=utf8 and termencoding=euc-jp currently.
But I cannot open a file that has a EUC-JP encoded filename by using :e.

Then I realized that Vim should have an option to specify filename encoding.
What do you think about this?


I attached a quick hacky patch that adds 'systemencoding' option to
specify filename encoding on a filesystem.

The option is also used to filter some commands before passing to a shell,
so that ":!echo '(mbyte words)' > (mbyte filename).txt" works correctly.

This is just a PoC code. All I need is natural unconscious filename handling.

Current implementation of the patch adds a thin layer to convert filenames
between 'encoding' and 'systemencoding'.
And internally hold all filenames in converted 'encoding' form, instead of
raw filenames.

I thought this was a bad idea because it loose original information,
but it was easy to implement.

So this patch cause some problems, for example:
* When you change 'systemencoding' on runtime, you will get trouble with
swap files.
* When you pass an invalid (for 'systemencoding') sequence of bytes as
filename, you will get some unexpected result, because of filename
conversion failed.

Maybe, we should add a new field for multibyte filenames for user in buf_T,
and keep original filenames untouched for actual read and write.
# but I think this will be hard to implement.

I have no idea which is the best way, so I made this patch to see what happen.

I repeat this is just a PoC code.
The patch is not tested enough yet, and it may be wrong.
Moreover, I'm not using gvim nor Windows. I tested only on FreeBSD.
And I'm not good at Vim source code, so I hope someone fix it.

Best regards,
Kikuchan

vim-systemencoding.patch

kiku...@uranus.dti.ne.jp

unread,
Oct 5, 2010, 12:32:10 PM10/5/10
to vim...@googlegroups.com
Hi, All.
# I changed the subject.
# (it was: [patch] Add an option to specify filename encoding.)

I'm very sad because nobody reply to this topic... ;p
This is not only a problem for CJK people.

You can't handle filename properly in Vim
when filename encoding is different than 'encoding'.
# at least on Unix like systems.

This problem affects people who are using latin1 (or whatever) encoding
for filename and UTF8 for internal Vim 'encoding'.

... Or, no one use non-unicode encoding for filename in the 21st century?


The patch I posted here before adds a new feature to convert filenames
between filename encoding on filesystem and internal 'encoding'.

I'm not sure but this problem may not happen on Windows,
because I found a lot of Unicode stuff in os_win32.c,
and maybe do the same thing with a different way (Win32 Wide APIs).

The feature should simply convert filenames only when 'systemencoding' is set,
and if 'systemencoding' is not set, it should do nothing as usual.

So, if people doesn't use non-unicode encoding for filename anymore,
the feature doesn't affect the people.

What do you think about this feature / problem, everyone?

Bram, could you add this *feature* to Vim if you think it's good, please?
# ofcourse, you don't have to use the patch.
And, let me know if there's something I can do.

Thanks in advance.

Best Regards,
Kikuchan

Benjamin R. Haskell

unread,
Oct 5, 2010, 1:44:49 PM10/5/10
to vim...@googlegroups.com
On Wed, 6 Oct 2010, kiku...@uranus.dti.ne.jp wrote:

> Hi, All.
> # I changed the subject.
> # (it was: [patch] Add an option to specify filename encoding.)
>
> I'm very sad because nobody reply to this topic... ;p
> This is not only a problem for CJK people.
>
> You can't handle filename properly in Vim when filename encoding is
> different than 'encoding'.
> # at least on Unix like systems.
>
> This problem affects people who are using latin1 (or whatever)
> encoding for filename and UTF8 for internal Vim 'encoding'.
>
> ... Or, no one use non-unicode encoding for filename in the 21st
> century?

Pretty much. :-)

Can you describe your system setup? Trying to create a test case for
this, I made a tiny vfat filesystem, but then found that the mount
options for 'VFAT' state that long filenames are stored in Unicode (I
believe as UTF-16). Short filenames depend on a codepage, but the
utility of a feature like this, just to fix 8.3 filenames seems dubious.

Not that I fully support UTF-8 monoculture[1], but right now, it's by
far the best option. Better to use something like "convmv"[3] to
convert non-UTF-8 filenames to UTF-8.

[1] http://web.archive.org/web/20070213204402/modeemi.fi/~tuomov/b/archives/2006/08/26/T20_16_06/
[*] above written by the "controversial" Tuomo Valkonen -- his writings
tend to be a bit over-the-top, but present a good counterpoint to the
"FOSScracy" [his term]
http://en.wikipedia.org/wiki/Ion_(window_manager)#Controversy

[2] http://www.j3e.de/linux/convmv/man/


> The patch I posted here before adds a new feature to convert filenames
> between filename encoding on filesystem and internal 'encoding'.
>
> I'm not sure but this problem may not happen on Windows, because I
> found a lot of Unicode stuff in os_win32.c, and maybe do the same
> thing with a different way (Win32 Wide APIs).
>
> The feature should simply convert filenames only when 'systemencoding'
> is set, and if 'systemencoding' is not set, it should do nothing as
> usual.

I think 'filenameencoding' (though long) would be a better name.
'systemencoding' sounds like what the current 'encoding' option does.
("system" to me implies the computing environment, not the filesystem.)

Bram Moolenaar

unread,
Oct 5, 2010, 2:35:12 PM10/5/10
to kiku...@uranus.dti.ne.jp, vim...@googlegroups.com

Kikuchan -

> Hi, All.
> # I changed the subject.
> # (it was: [patch] Add an option to specify filename encoding.)
>
> I'm very sad because nobody reply to this topic... ;p
> This is not only a problem for CJK people.
>
> You can't handle filename properly in Vim
> when filename encoding is different than 'encoding'.
> # at least on Unix like systems.
>
> This problem affects people who are using latin1 (or whatever) encoding
> for filename and UTF8 for internal Vim 'encoding'.
>
> ... Or, no one use non-unicode encoding for filename in the 21st century?

I have not seen this subject before.

> The patch I posted here before adds a new feature to convert filenames
> between filename encoding on filesystem and internal 'encoding'.
>
> I'm not sure but this problem may not happen on Windows,
> because I found a lot of Unicode stuff in os_win32.c,
> and maybe do the same thing with a different way (Win32 Wide APIs).
>
> The feature should simply convert filenames only when 'systemencoding' is set,
> and if 'systemencoding' is not set, it should do nothing as usual.
>
> So, if people doesn't use non-unicode encoding for filename anymore,
> the feature doesn't affect the people.
>
> What do you think about this feature / problem, everyone?
>
> Bram, could you add this *feature* to Vim if you think it's good, please?
> # ofcourse, you don't have to use the patch.
> And, let me know if there's something I can do.

It makes sense to me. For Windows we indeed have code to convert
between 'encoding' and UTF-16, which is supported by the library
functions. On Unix there is no such translation. I actually do not
know what Unix standard specifies what encoding file names are in. It
might depend on the mounted drive, in which case it may differ per
device.

- Bram

--
hundred-and-one symptoms of being an internet addict:
243. You unsuccessfully try to download a pizza from www.dominos.com.

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ download, build and distribute -- http://www.A-A-P.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Yue Wu

unread,
Oct 5, 2010, 10:06:50 PM10/5/10
to vim...@googlegroups.com
On Wed, Oct 06, 2010 at 01:32:10AM +0900, kiku...@uranus.dti.ne.jp wrote:
> Hi, All.
> # I changed the subject.
> # (it was: [patch] Add an option to specify filename encoding.)
>
> I'm very sad because nobody reply to this topic... ;p
> This is not only a problem for CJK people.
>
> You can't handle filename properly in Vim
> when filename encoding is different than 'encoding'.
> # at least on Unix like systems.
>
> What do you think about this feature / problem, everyone?
>

I can't say more because of my limited knowledge with vim source, but
I strongly support this feature, cjk people needs it, very much.

--
Regards,
Yue Wu

Key Laboratory of Modern Chinese Medicines
Department of Traditional Chinese Medicine
China Pharmaceutical University
No.24, Tongjia Xiang Street, Nanjing 210009, China

kiku...@uranus.dti.ne.jp

unread,
Oct 14, 2010, 1:29:58 PM10/14/10
to vim...@googlegroups.com, Br...@moolenaar.net
Thank you all for your reply.

Bram wrote:
> It makes sense to me. For Windows we indeed have code to convert
> between 'encoding' and UTF-16, which is supported by the library
> functions. On Unix there is no such translation. I actually do not
> know what Unix standard specifies what encoding file names are in. It
> might depend on the mounted drive, in which case it may differ per
> device.

Yes, it may differ per device on unix-like systems.
This is happen when a user mount a USB pen drive formatted on Windows.

But modern unix-like systems have a translation layer for filesystem encoding.
# This is done by kernel.

So usually user mounts the pen drive as a utf-8 encoded filesystem,
regardless of the actual filename encoding on the pen drive.

But some users, including me, still use non-unicode filesystem encoding
mainly, because of backward compatibilities.


Bram wrote:
> I have not seen this subject before.

and, Benjamin wrote:
> Can you describe your system setup? Trying to create a test case for
> this, I made a tiny vfat filesystem, but then found that the mount
> options for 'VFAT' state that long filenames are stored in Unicode (I
> believe as UTF-16). Short filenames depend on a codepage, but the
> utility of a feature like this, just to fix 8.3 filenames seems dubious.

I think this problem is not happen on Windows, because of os_win32.c.
But I'm not sure.

There is no standard encoding for file names on unix-like systems.
You can use any of characters, other than '/' and NUL, are accepted
for file names, even if it's non-printable control code.

You can reproduce this problem with following steps.
# and I assume you are on unix-like systems
----------
# First, launch (old, no unicode enabled) xterm (or whatever. or console?).

# Creates a file with latin1 filename (a copyright mark).
% touch `printf '\0251'`

# Execute Vim with latin1
% vim -u NONE --cmd 'set encoding=latin1' `printf '\0251'`

# You can see a copyright mark filename, as expected.
# Then, let's write some text with latin1.
# e.g. just type: i<C-v>xa9<ESC>:wq<CR>

# Execute Vim with utf-8, but imagine the user still use latin1 terminal.
% vim -u NONE --cmd 'set encoding=utf-8' \
--cmd 'set termencoding=latin1' `printf '\0251'`

# or, just use gVim with 'encoding=utf-8'

% gvim -u NONE --cmd 'set encoding=utf-8' `printf '\0251'`

# Then, you can see proper file contents, as expected.
# But the filename in status line, is mess (non-printable <a9>).
----------

This happens ALL CJK users who are using non-unicode filesystem with
'encoding=utf-8'.


Benjamin wrote:
> I think 'filenameencoding' (though long) would be a better name.
> 'systemencoding' sounds like what the current 'encoding' option does.
> ("system" to me implies the computing environment, not the filesystem.)

The new proposal option 'systemencoding' that I named because
this translation is also needed for executing a shell.

For example (on Vim with 'encoding=utf-8' and 'termecoding=latin1')
:w [latin1_filename] # This should be latin1 filename on filesystem too.
:e [latin1_filename] # ditto
:!echo 'some_message' > [latin1_filename] # ditto
:r! cat -n [latin1_filename] # ditto

# The above 'some_message' may be latin1 too ;)

If there is no encoding translation support for shell execution,
:w and :e works fine with latin1, but it doesn't work for :!echo and :r!.

This makes the user be confused, especially when using completion
on ex-command line.

Furthermore, the user was using a shell with latin1 before executing Vim.
Executing "echo 'some_message' > [latin1_filename]" on the shell,
create a file with latin1 filename, and latin1 message in it.

But if there is no encoding translation support for shell execution,
the user get different results on Vim ex-command line.

So, by this translation, let's make sure the user will get the same results
when executing the same command on Vim ex-command line.

The 'some_message' is treated as latin1 with this translation by side-effect.
But, this is very natural, isn't it?


Yue wrote:
> I strongly support this feature, cjk people needs it, very much.

Thanks!


P.S.
The attached patch adds the thin translation layer described above.
The translation is done everytime when Vim trying to interact with filesystem,
or Vim trying to execute a shell command.

# I'm sorry a previous version of patch have a double-free bug...

Bram, could you include the patch for beta test?
This translation won't be enabled unless the 'systemencoding' option is set.
# Currently, this may work only on unix-like systems.

vim-systemencoding.patch

Bram Moolenaar

unread,
Oct 14, 2010, 4:46:23 PM10/14/10
to kiku...@uranus.dti.ne.jp, vim...@googlegroups.com

Kikuchan wrote:

I'll add a remark in the todo list. I hope a few people will try it
out, so that we know whether it works on different systems.

--
They now pass three KNIGHTS impaled to a tree. With their feet off the
ground, with one lance through the lot of them, they are skewered up
like a barbecue.
"Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

Reply all
Reply to author
Forward
0 new messages