ft=diff occur error when encoding=cp932 on cygwin

198 views
Skip to first unread message

mattn

unread,
Aug 31, 2014, 10:11:00 PM8/31/14
to vim...@googlegroups.com
Hi list.

When encoding=utf-8 on cygwin, set ft=diff occur error.
Because some characters which can't be converted are translated to '?' by iconv.

Below is a patch to avoid handle non-related scripts.

https://gist.github.com/mattn/d9432fbd6eadca91e1d6

Thanks.
- Yasuhiro Matsumoto

mattn

unread,
Aug 31, 2014, 11:32:00 PM8/31/14
to vim...@googlegroups.com
Sorry, pointed URL is not a patch. Because it's too difficult to make a patch that contains multi-byte characters.

Bram Moolenaar

unread,
Sep 1, 2014, 3:38:31 PM9/1/14
to mattn, vim...@googlegroups.com

Yasuhiro Matsumoto wrote:

> When encoding=utf-8 on cygwin, set ft=diff occur error.
> Because some characters which can't be converted are translated to '?'
> by iconv.
>
> Below is a patch to avoid handle non-related scripts.
>
> https://gist.github.com/mattn/d9432fbd6eadca91e1d6

Sorry, it's unclear to me what the problem is. As you mention, the link
points to a new version, not a patch (can't make a diff easily right
now).

--
From "know your smileys":
@:-() Elvis Presley

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

mattn

unread,
Sep 1, 2014, 9:42:33 PM9/1/14
to vim...@googlegroups.com, matt...@gmail.com
> Sorry, it's unclear to me what the problem is. As you mention, the link
> points to a new version, not a patch (can't make a diff easily right
> now).

diff.vim contains some characters which is not possible to ANSI encodings.
So I add if-endif in each part.
The link is whole part of diff.vim instead of diff. Because I worry about dropping some characters while diff-ing. Below is a patch.

https://gist.github.com/mattn/fddeba4a9aade3067fa5

If you have problem in this diff, please use while file of diff.vim in above.

Bram Moolenaar

unread,
Sep 2, 2014, 5:42:20 PM9/2/14
to mattn, vim...@googlegroups.com
Thanks for the explanation. I'll put it on the todo list.

--
I AM THANKFUL...
...for the mess to clean after a party because it means I have
been surrounded by friends.

Bram Moolenaar

unread,
Sep 19, 2014, 11:33:55 AM9/19/14
to mattn, vim...@googlegroups.com
This makes the highlighting depend on the current language. However,
the diff files I edit may be in any language, not necessarily matching
the current language.

Can we think of another way to avoid the problems you experienced?
Perhaps only skip the lines that have characters that don't work in the
current encoding?



--
If you feel lonely, try schizophrenia.

mattn

unread,
Sep 21, 2014, 9:15:17 PM9/21/14
to vim...@googlegroups.com, matt...@gmail.com
For example:

syn match diffOnly "^.*だけに発見: .*"

This message is displayed by "diff" tool. Not depend on the languages. In multi-byte users, this makes many problems.

C:\SomeRepo\>svn diff | vim -

When some files are only in left(or right) side, the message above is displayed but it is encoded in the current locale. ex. cp932 on japanese windows. And vim often mis-judge the encoding. So I often do following.


C:\SomeRepo\>set LANG=C
C:\SomeRepo\>svn diff | vim -

This is not possible to avoid to be containing several encodings.
Just one of way to avoid this problem is to not do match the string.

- Yasuhiro Matsumoto

mattn

unread,
Nov 5, 2014, 12:30:03 PM11/5/14
to vim...@googlegroups.com, matt...@gmail.com
Bram. you seems removed this issue from todo list.
But I'm thinking merging patch above is better than keeps current status.

There is two problems.

1. diff.vim contains several encodings. So if DBCS is used on vim, vim may handle invalid-characters.
2. locale message of svn is encoded to system locale encoding. So it's not match as vim's encoding.

The first of those problems will be fixed with my patch.
To fix the second of the problems, I suggest removing syntax of 'diffOnly' for multi-byte encodings.

- Yasuhiro Matsumoto

Bram Moolenaar

unread,
Nov 5, 2014, 3:11:26 PM11/5/14
to mattn, vim...@googlegroups.com
If I remember correctly, your patch breaks recognizing diff headers if
the text does not match the current locale. E.g., when my locale is
German and I edit a diff file generated by someone in Italy, I still
expect the headers to be recognized.

When the file's encoding differs from what Vim has detected then all
bets are off, it will be impossible to compare the text correctly.
Unless we have a regexp that works around it, it's probably very
difficult.

What is the error that is reported when using a DBCS encoding?
A reproducible example is useful.

--
FATHER: You killed eight wedding guests in all!
LAUNCELOT: Er, Well ... the thing is ... I thought your son was a lady.
FATHER: I can understand that.
"Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

Ken Takata

unread,
Nov 6, 2014, 5:51:13 AM11/6/14
to vim...@googlegroups.com, matt...@gmail.com
Hi,

2014/11/6 Thu 5:11:26 UTC+9 Bram Moolenaar wrote:
> Yasuhiro Matsumoto wrote:
>
> > Bram. you seems removed this issue from todo list.
> > But I'm thinking merging patch above is better than keeps current status.
> >
> > There is two problems.
> >
> > 1. diff.vim contains several encodings. So if DBCS is used on vim, vim
> > may handle invalid-characters.
> > 2. locale message of svn is encoded to system locale encoding. So it's
> > not match as vim's encoding.
> >
> > The first of those problems will be fixed with my patch.
> > To fix the second of the problems, I suggest removing syntax of
> > 'diffOnly' for multi-byte encodings.
>
> If I remember correctly, your patch breaks recognizing diff headers if
> the text does not match the current locale. E.g., when my locale is
> German and I edit a diff file generated by someone in Italy, I still
> expect the headers to be recognized.
>
> When the file's encoding differs from what Vim has detected then all
> bets are off, it will be impossible to compare the text correctly.
> Unless we have a regexp that works around it, it's probably very
> difficult.
>
> What is the error that is reported when using a DBCS encoding?
> A reproducible example is useful.

It occurs when enc=cp932 on Cygwin/MSYS/Linux.
E.g.:

$ vim -u NONE -N -c "set enc=cp932" -c "syntax on" -c "set ft=diff"
Error detected while processing /usr/local/share/vim/vim74/syntax/diff.vim:
line 128:
E401: Pattern delimiter not found: "^\\ ????????????????????????????????????? ??
?? ???????
E475: Invalid argument: diffNoEOL^I"^\\ ????????????????????????????????????? ??
?? ???????

It doesn't occur on Win32. Maybe it occurs only when libiconv is used.
libiconv fails to convert the encoding of diff.vim from utf-8 to cp932, so Vim
opens diff.vim without converting the encoding.

The root cause of this problem is handling of invalid characters.
The last two bytes of the line 128 are 0x97 0x22 (").
0x97 can be a lead byte in cp932, but 0x22 cannot be a trail byte in cp932.
However, Vim wrongly handle the byte sequence 0x97 0x22 as one character.
Thus Vim cannot find the ending double quotation mark (0x22).
Maybe we also need to check the trail byte (not only the lead byte), but it
might be a little bit slow. BTW, I think enc=cp932 is a legacy setting
(especially on Cygwin/Linux), so I don't want to make an effort to fix this.

Instead of fixing Vim itself, I have two ideas to work around this problem:

1. Add a dummy ending quotation ( | ") at the end of the line 128.

--- a/runtime/syntax/diff.vim
+++ b/runtime/syntax/diff.vim
@@ -125,7 +125,7 @@
syn match diffDiffer "^הזמ הז םינוש `.*'-ו `.*' םיצבקה$"
syn match diffBDiffer "^הזמ הז םינוש `.*'-ו `.*' םיירניב םיצבק$"
syn match diffIsA "^.* .*-ל .* .* תוושהל ןתינ אל$"
-syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח"
+syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח" | "
syn match diffCommon "^.*-ו .* :תוהז תויקית-תת$"

" hr


2. Add an empty pattern "\%(\)" at the end of the pattern in the line 128.

--- a/runtime/syntax/diff.vim
+++ b/runtime/syntax/diff.vim
@@ -125,7 +125,7 @@
syn match diffDiffer "^הזמ הז םינוש `.*'-ו `.*' םיצבקה$"
syn match diffBDiffer "^הזמ הז םינוש `.*'-ו `.*' םיירניב םיצבק$"
syn match diffIsA "^.* .*-ל .* .* תוושהל ןתינ אל$"
-syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח"
+syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח\%(\)"
syn match diffCommon "^.*-ו .* :תוהז תויקית-תת$"

" hr


Regards,
Ken Takata

Bram Moolenaar

unread,
Nov 10, 2014, 3:56:06 PM11/10/14
to Ken Takata, vim...@googlegroups.com, matt...@gmail.com
Yes, libiconv is strict about rejecting characters it cannot convert.

> The root cause of this problem is handling of invalid characters.

I suppose you mean characters that are valid in utf-8 but cannot be
converted to cp932.

> The last two bytes of the line 128 are 0x97 0x22 (").
> 0x97 can be a lead byte in cp932, but 0x22 cannot be a trail byte in cp932.
> However, Vim wrongly handle the byte sequence 0x97 0x22 as one character.
> Thus Vim cannot find the ending double quotation mark (0x22).
> Maybe we also need to check the trail byte (not only the lead byte), but it
> might be a little bit slow. BTW, I think enc=cp932 is a legacy setting
> (especially on Cygwin/Linux), so I don't want to make an effort to fix this.
>
> Instead of fixing Vim itself, I have two ideas to work around this problem:
>
> 1. Add a dummy ending quotation ( | ") at the end of the line 128.
>
> --- a/runtime/syntax/diff.vim
> +++ b/runtime/syntax/diff.vim
> @@ -125,7 +125,7 @@
> syn match diffDiffer "^הזמ הז םינוש `.*'-ו `.*' םיצבקה$"
> syn match diffBDiffer "^הזמ הז םינוש `.*'-ו `.*' םיירניב םיצבק$"
> syn match diffIsA "^.* .*-ל .* .* תוושהל ןתינ אל$"
> -syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח"
> +syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח" | "

The quotes seem wrong here. Is it supposed to be:

syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח| "

Note that one can also use "." to match any character. So long as there
are enough characters left to avoid a false match.

> syn match diffCommon "^.*-ו .* :תוהז תויקית-תת$"
>
> " hr
>
>
> 2. Add an empty pattern "\%(\)" at the end of the pattern in the line 128.
>
> --- a/runtime/syntax/diff.vim
> +++ b/runtime/syntax/diff.vim
> @@ -125,7 +125,7 @@
> syn match diffDiffer "^הזמ הז םינוש `.*'-ו `.*' םיצבקה$"
> syn match diffBDiffer "^הזמ הז םינוש `.*'-ו `.*' םיירניב םיצבק$"
> syn match diffIsA "^.* .*-ל .* .* תוושהל ןתינ אל$"
> -syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח"
> +syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח\%(\)"
> syn match diffCommon "^.*-ו .* :תוהז תויקית-תת$"
>
> " hr

--
Q: What do you call a fish without an eye?
A: fsh!
Q: What do you call a deer with no eyes?
A: no eye deer.
Q: What do you call a deer with no eyes and no legs?
A: still no eye deer.

Ken Takata

unread,
Nov 11, 2014, 5:49:23 AM11/11/14
to vim...@googlegroups.com, ktakat...@gmail.com, matt...@gmail.com
Hi,
No, it doesn't matter whether the characters are valid or not in utf-8. It
only matters when the characters are invalid in cp932.

The last two characters of the line 128 are <U+05d7><U+0022>, the byte
sequence is <d7><97><22>. The character <U+05d7> cannot be converted to cp932,
so Vim loads diff.vim without converting. So the line is loaded as is. Then
the last two bytes <97><22> are handled as one character <9722> in cp932, so
the last " disappears. But actually <9722> is not a valid character in cp932.
This is the cause of E401.

BTW, when you open diff.vim with setting the encoding explicitly
(:e ++enc=utf-8), the character sequence <U+05d7><U+0022> is converted to '?"'.
In this case, the problem doesn't occur.


> > The last two bytes of the line 128 are 0x97 0x22 (").
> > 0x97 can be a lead byte in cp932, but 0x22 cannot be a trail byte in cp932.
> > However, Vim wrongly handle the byte sequence 0x97 0x22 as one character.
> > Thus Vim cannot find the ending double quotation mark (0x22).
> > Maybe we also need to check the trail byte (not only the lead byte), but it
> > might be a little bit slow. BTW, I think enc=cp932 is a legacy setting
> > (especially on Cygwin/Linux), so I don't want to make an effort to fix this.
> >
> > Instead of fixing Vim itself, I have two ideas to work around this problem:
> >
> > 1. Add a dummy ending quotation ( | ") at the end of the line 128.
> >
> > --- a/runtime/syntax/diff.vim
> > +++ b/runtime/syntax/diff.vim
> > @@ -125,7 +125,7 @@
> > syn match diffDiffer "^הזמ הז םינוש `.*'-ו `.*' םיצבקה$"
> > syn match diffBDiffer "^הזמ הז םינוש `.*'-ו `.*' םיירניב םיצבק$"
> > syn match diffIsA "^.* .*-ל .* .* תוושהל ןתינ אל$"
> > -syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח"
> > +syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח" | "
>
> The quotes seem wrong here. Is it supposed to be:
>
> syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח| "

No, it's exactly what I intended. This is a trick. | is a command separator
and the last " is start of a comment:

syn match diffNoEOL "pattern<U+05d7>" | "
^ start of a comment

But the line is handled as the following in cp932:

syn match diffNoEOL "pattern<d7><9722> | "
^ ending quotation

Now Vim can find an ending quotation, so the error disappears.


> Note that one can also use "." to match any character. So long as there
> are enough characters left to avoid a false match.

I don't know the output of the diff command in Hebrew, but comparing with
other translations, the line might end with <U+05d7>, so the last "." won't
match. ".\?" would be better.

So, there are three workarounds:

1. A tricky way.
syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח" | "

2. Exactly the same meaning as before.
syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח\%(\)"

3. Not exactly the same, but easier.
syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח.\?"

No.3 is the best?

Regards,
Ken Takata

Bram Moolenaar

unread,
Nov 12, 2014, 7:08:32 AM11/12/14
to Ken Takata, vim...@googlegroups.com, matt...@gmail.com
Yes, that's what I meant. The original file is valid utf-8, but because
of the failing conversion you end up with something invalid.
Aha, clever.

> > Note that one can also use "." to match any character. So long as there
> > are enough characters left to avoid a false match.
>
> I don't know the output of the diff command in Hebrew, but comparing with
> other translations, the line might end with <U+05d7>, so the last "." won't
> match. ".\?" would be better.
>
> So, there are three workarounds:
>
> 1. A tricky way.
> syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח" | "
>
> 2. Exactly the same meaning as before.
> syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח\%(\)"
>
> 3. Not exactly the same, but easier.
> syn match diffNoEOL "^\\ ץבוקה ףוסב השדח-הרוש ות רסח.\?"
>
> No.3 is the best?

I was thinking of dropping the character that causes the conversion error:

syn match diffNoEOL "^\\ ץבוקה ףוסב השד.-הרוש ות רס."

Are there any other characters that can't be converted?


--
ARTHUR: Ni!
BEDEVERE: Nu!
ARTHUR: No. Ni! More like this. "Ni"!
BEDEVERE: Ni, ni, ni!
"Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

Ken Takata

unread,
Nov 12, 2014, 7:22:44 AM11/12/14
to vim...@googlegroups.com, ktakat...@gmail.com, matt...@gmail.com
Hi,
Ah, now I understand. Seems good.


> Are there any other characters that can't be converted?

There are still many other characters that can't be converted, but
there are no other characters that cause error. So this fix is enough.

Regards,
Ken Takata
Reply all
Reply to author
Forward
0 new messages