[BUG] Passing special characters to &listchars and &fillchars causes screen corruption

54 views
Skip to first unread message

ZyX

unread,
Aug 6, 2011, 9:31:38 AM8/6/11
to vim...@googlegroups.com
Consider the following code:

vim -u NONE -c $'set list lcs=nbsp:\x0D' \
-c $'normal! i\u00A0\u00A0\u00A0a\e'

You will see cursor placed on the second virtual `M' (from `^M'), but `ga' will
show that you are on the letter `a'. Passing special characters to `tab'
suboption causes more corruption, but is less reproducible. With fillchars
results are better: highlighting partially disappears, but that's all (tested
only with vert and stl).

signature.asc

Tony Mechelynck

unread,
Aug 6, 2011, 10:17:32 AM8/6/11
to vim...@googlegroups.com, ZyX

That code is invalid, see :help 'listchars'

UTF-8 characters can be used when 'encoding' is "utf-8",
otherwise only printable characters are allowed. All characters
must be single width

I suppose a more adequate formulation would be:

Only single-width printable characters are allowed.
Multibyte characters are allowed only if 'encoding' is "utf-8".

The bug, if there is one, is that
:set list lcs=nbsp:\x0D
(with a non-printable character) does not generate an error (I get "E474
Invalid argument", which IMO is no bug.)

I'm on gvim 7.3.269, Huge build with GTK2/Gnome GUI, under utf-8
'encoding'. My "usual" 'list'/'listchars' setting is

:set list lcs=eol:�,tab:\|_,nbsp:~,conceal:*

but even temporarily trying to set only

:set list=\x0D

gives me the above-mentioned error, and 'listchars' is not modified.


Best regards,
Tony.
--
"I never met a piece of chocolate I didn't like."

ZyX

unread,
Aug 6, 2011, 10:55:30 AM8/6/11
to vim...@googlegroups.com
Reply to message «Re: [BUG] Passing special characters to &listchars and
&fillchars causes screen corruption»,
sent 18:17:32 06 August 2011, Saturday
by Tony Mechelynck:

> The bug, if there is one, is that
>
> :set list lcs=nbsp:\x0D
>
> (with a non-printable character) does not generate an error (I get "E474
> Invalid argument", which IMO is no bug.)

Yes, error is the desired behavior. It is not generated in my case (you missed
that these quotes are $'...', not '...', so it is `:set list lcs=nbsp:^M',
equivalent to `set list|let &lcs="nbsp:\x0D"').

I tested on vim-7.3.269 from mercurial repository with huge features, gui and
perl, tcl, lua, ruby and python3 support.

> but even temporarily trying to set only
>
> :set list=\x0D

You can't set boolean arguments this way, so I don't understand what is this
remark for.

Original message:

signature.asc

Tony Mechelynck

unread,
Aug 6, 2011, 10:57:07 AM8/6/11
to vim...@googlegroups.com, ZyX
> :set list lcs=eol:ś,tab:\|_,nbsp:~,conceal:*

>
> but even temporarily trying to set only
>
> :set list=\x0D
>
> gives me the above-mentioned error, and 'listchars' is not modified.
>
>
> Best regards,
> Tony.


...and for some reason that f???ing bl??dy st??id googlegroups interface
changed my Pilcrow mark to an s-acute. Well, the exact character used
there is irrelevant in this case but still, I don't like it. The copy in
my "Sent" folder is in 8bit ISO-8859-1 with the correct Pilcrow mark;
after the [me (SMTP) relay.skynet.be (ESMTP) googlegroups.com (SMTP)
gmail.com (POP3) me] round-trip it comes back in quoted-printable UTF-8
as =C5=9B (equal Charlie Pantafayf equal Noveniner Bravo) which means
U+015B SMALL LATIN LETTER S WITH ACUTE instead of the 0xB6 (U+00B6
PILCROW MARK) which I had sent. Ah, why couldn't Google simply
understand that Latin1 0xB6 means UTF-8 U+00B6? You don't need iconv to
know that. Ah, Google pisses me off. >:-(

Best regards,
Tony.
--
Seminars, n.:
From "semi" and "arse", hence, any half-assed discussion.

ZyX

unread,
Aug 6, 2011, 11:06:21 AM8/6/11
to vim...@googlegroups.com
Reply to message «Re: [BUG] Passing special characters to &listchars and
&fillchars causes screen corruption»,
sent 18:57:07 06 August 2011, Saturday
by Tony Mechelynck:

I don't use google groups and I received U+00B6, not U+015B. Anyway, I also
really *like* google groups changing my « into << and doing other bad things
when I type Russian text.

Original message:

signature.asc

Tony Mechelynck

unread,
Aug 6, 2011, 11:25:05 AM8/6/11
to vim...@googlegroups.com, ZyX, Bram Moolenaar
On 06/08/11 16:55, ZyX wrote:
> Reply to message �Re: [BUG] Passing special characters to&listchars and

> &fillchars causes screen corruption�,
> sent 18:17:32 06 August 2011, Saturday
> by Tony Mechelynck:
>
>> The bug, if there is one, is that
>>
>> :set list lcs=nbsp:\x0D
>>
>> (with a non-printable character) does not generate an error (I get "E474
>> Invalid argument", which IMO is no bug.)
> Yes, error is the desired behavior. It is not generated in my case (you missed
> that these quotes are $'...', not '...', so it is `:set list lcs=nbsp:^M',
> equivalent to `set list|let&lcs="nbsp:\x0D"').

Ah, yes, sorry, I had overlooked these dollar signs. Now I can reproduce
the problem:

:set list lcs=nbsp:^M

(entering ^M at the keyboard as Ctrl-V Enter)

does not give an error (I guess that's the bug): then typing

:normal! i~~~a^[

where each ~ represents a non-breaking space, typed as Ctrl-V x A0
(without the spaces), and ^[ an escape (entered as Ctrl-V Esc), Vim displays

^M^M^Ma

The cursor is shown on the second ^M but ga displays

<a> 97, Hex 61, Octal 141

Redrawing (Ctrl-L) doesn't change the display, and moving the cursor
left and right (in Normal mode without virtual editing) is only possible
in the first four screen cells, where ^M^M is shown in blue. Even o (to
create a second line) followed by <Esc><Up> then left-right moves, moves
only over the first four screen cells, and one by one.

>
> I tested on vim-7.3.269 from mercurial repository with huge features, gui and
> perl, tcl, lua, ruby and python3 support.

yeah, me too, except python2, not python3. FWIW I'm on linux-x86_64
(openSUSE 11.4).

>
>> but even temporarily trying to set only
>>
>> :set list=\x0D
> You can't set boolean arguments this way, so I don't understand what is this
> remark for.

I meant listchars


Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
151. You find yourself engaged to someone you've never actually met,
except through e-mail.

Tony Mechelynck

unread,
Aug 6, 2011, 11:32:20 AM8/6/11
to vim...@googlegroups.com, ZyX
On 06/08/11 17:06, ZyX wrote:
> Reply to message �Re: [BUG] Passing special characters to&listchars and

> &fillchars causes screen corruption�,
> sent 18:57:07 06 August 2011, Saturday
> by Tony Mechelynck:
>
> I don't use google groups and I received U+00B6, not U+015B. Anyway, I also
> really *like* google groups changing my � into<< and doing other bad things
> when I type Russian text.

Yeah, I had you as CC so you must have got that bypassing the list,
which would then also have bypassed the Google groups "beautifier".
Otherwise you'd have got it from the list (maybe gmail noticed that you
got two copies of a single email, one directly and one via either
vim...@googlegroups.com or vim...@vim.org, which in actuality are one
and the same) and by luck (as it were) the only one you saw was the
"good" one (or maybe not so much by luck, since it would have followed a
less roundabout route and probably arrived in your inbox a few seconds
before the list-post).

Best regards,
Tony.
--
Show respect for age. Drink good Scotch for a change.

Bram Moolenaar

unread,
Aug 7, 2011, 7:37:48 AM8/7/11
to Tony Mechelynck, vim...@googlegroups.com, ZyX

Tony Mechelynck wrote:

Last time I asked about this someone said it's probably your ISP that
does this. I have no clue why though.

--
Vi beats Emacs to death, and then again!
http://linuxtoday.com/stories/5764.html

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Benjamin R. Haskell

unread,
Aug 7, 2011, 11:57:54 AM8/7/11
to vim...@googlegroups.com, ZyX
On Sat, 6 Aug 2011, Groups munged Tony Mechelynck's mail into:

>
>> :set list lcs=eol:ś,tab:\|_,nbsp:~,conceal:*

And he followed up:


>
> ...and for some reason that f???ing bl??dy st??id googlegroups
> interface changed my Pilcrow mark to an s-acute. Well, the exact
> character used there is irrelevant in this case but still, I don't
> like it. The copy in my "Sent" folder is in 8bit ISO-8859-1 with the
> correct Pilcrow mark; after the [me (SMTP) relay.skynet.be (ESMTP)
> googlegroups.com (SMTP) gmail.com (POP3) me] round-trip it comes back
> in quoted-printable UTF-8 as =C5=9B (equal Charlie Pantafayf equal
> Noveniner Bravo) which means U+015B SMALL LATIN LETTER S WITH ACUTE
> instead of the 0xB6 (U+00B6 PILCROW MARK) which I had sent. Ah, why
> couldn't Google simply understand that Latin1 0xB6 means UTF-8 U+00B6?
> You don't need iconv to know that. Ah, Google pisses me off. >:-(

In both this thread and the last time I discussed this¹, it appears that
the only charset that survives roundtripping to Groups when using
codepoints outside of ASCII is UTF-8.

Also as before, though, it's recipient-dependent. ZyX's response² to
the initial, munged mail seems to have it correctly quoted as:

> :set list lcs=eol:¶,tab:\|_,nbsp:~,conceal:*


In the Groups web interface, all of the broken characters are replaced
(for me, using a default charset of UTF-8 everywhere) by the three
characters:

�

That means that, in the old thread { å, æ, ø, «, » } and in the
new thread { ¶ } were all replaced by �.

ZyX appears to have received the old thread correctly, too. His
response there³ has them correctly quoted, but Ben Fritz's response⁴
indicates that the erroneously converted characters were simply absent.

All that said, it's unclear how 0xB6 was misinterpreted as 0xC5,0x9B...
But, alas. Unless you have good reason to stick to explicit Latin-1,
you're probably better off using UTF-8. In the current HTML specs⁵, for
example, even stating that something is ISO-8859-1 is now
*intentionally* treated as CP1252 (Microsoft's version of Latin-1). So,
the number of places in which using ISO-8859-1 instead of UTF-8 will
bite you is only going to increase.

--
Best,
Ben

¹: https://groups.google.com/d/msg/vim_use/UY8vGwc3kvo/QPMZXlptOioJ
²: https://groups.google.com/d/msg/vim_dev/A0Q_z0OksxQ/H-zuwNjtOM4J
³: https://groups.google.com/d/msg/vim_use/UY8vGwc3kvo/P3yr3kNpMBMJ
⁴: https://groups.google.com/d/msg/vim_use/UY8vGwc3kvo/7Vs-BlvtHsQJ
⁵: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0

Tony Mechelynck

unread,
Aug 8, 2011, 6:52:10 PM8/8/11
to Bram Moolenaar, vim...@googlegroups.com, ZyX
On 07/08/11 13:37, Bram Moolenaar wrote:
>
> Tony Mechelynck wrote:
[...]

>> ...and for some reason that f???ing bl??dy st??id googlegroups interface
>> changed my Pilcrow mark to an s-acute. Well, the exact character used
>> there is irrelevant in this case but still, I don't like it. The copy in
>> my "Sent" folder is in 8bit ISO-8859-1 with the correct Pilcrow mark;
>> after the [me (SMTP) relay.skynet.be (ESMTP) googlegroups.com (SMTP)
>> gmail.com (POP3) me] round-trip it comes back in quoted-printable UTF-8
>> as =C5=9B (equal Charlie Pantafayf equal Noveniner Bravo) which means
>> U+015B SMALL LATIN LETTER S WITH ACUTE instead of the 0xB6 (U+00B6
>> PILCROW MARK) which I had sent. Ah, why couldn't Google simply
>> understand that Latin1 0xB6 means UTF-8 U+00B6? You don't need iconv to
>> know that. Ah, Google pisses me off.>:-(
>
> Last time I asked about this someone said it's probably your ISP that
> does this. I have no clue why though.
>

If my ISP does it, it should do it to all recipients. However ZyX (which
got a copy that didn't go through googlegroups, by virtue of being on
the Cc list) saw the (correct) Pilcrow mark, while I (who only got the
googlegroups version) saw the (wrong) s-acute.

Best regards,
Tony.
--
They also surf who only stand on waves.

Tony Mechelynck

unread,
Aug 8, 2011, 8:21:04 PM8/8/11
to vim...@googlegroups.com, Benjamin R. Haskell, ZyX

In this message of yours (which I received in quoted-printable UTF-8)
all these characters arrived (AFAICT) correct: a-ball, ae-ligature,
o-bar, open-French-quote, close-French-quote, Pilcrow-mark, and, at the
end, i-diaeresis, Spanish-inverted-question-mark, one-half.

>
> ZyX appears to have received the old thread correctly, too. His response
> there³ has them correctly quoted, but Ben Fritz's response⁴ indicates
> that the erroneously converted characters were simply absent.
>
> All that said, it's unclear how 0xB6 was misinterpreted as 0xC5,0x9B...
> But, alas. Unless you have good reason to stick to explicit Latin-1,
> you're probably better off using UTF-8. In the current HTML specs⁵, for
> example, even stating that something is ISO-8859-1 is now
> *intentionally* treated as CP1252 (Microsoft's version of Latin-1). So,
> the number of places in which using ISO-8859-1 instead of UTF-8 will
> bite you is only going to increase.
>

The only difference between ISO-8859-1 and Windows-1252 is that in the
former, 0x80 to 0x9F are non-printing control characters (which I don't
use), while in the latter most of them are printable characters (for
which I use UTF-8 if I need them: in fact, my mailer is set to fall back
to UTF-8 if the message contains characters not supported by the charset
in which I would otherwise send it). In ISO-8859-15 (another common
replacement for Latin1) 0x80 to 0x9F are the same nonprinting controls,
but some of 0xA0 to 0xBF are /different/ printing characters, to wit,
the Euro sign €, the French oe and OE digraphs œ Œ, the uppercase
Y-diaeresis Ÿ, and the upper- and lowercase z-caron Ž ž.

One advantage of Latin1 over UTF-8 is that it uses one byte rather than
two for every codepoint in the range [U+0080-U+00FF]. That may or may
not be much of an advantage depending on the proportion of non-ASCII
characters in a "Western-text" message. IOW it would be "least"
advantageous for English text.

I'll send this reply in UTF-8, just to see if it makes a difference. I
also checked my character-encoding preferences, and changed the
"encoding to use when replying" from ISO-8859-1 to "whatever the sender
used" (subject, in both cases, to UTF-8 fallback if the message text
doesn't fit). If it isn't good enough I'll change it again.

As for HTML specs, last time I checked they didn't apply to email, and
it's email which gives me problems; with HTML I usually have no problem,
except when the page is badly set up, let's say a page sent in some
bizarre charset with no charset mentioned in an HTML Content-Type header
and also not in any <meta http-equiv="Content-Type"> element.

Oh, and about your reference 5, I thought the normative authority for
HTML was the W3C, in whose Standards I don't find what your whatwg page
displays, and sometimes even the opposite, see for instance items C030
and C076 under "Character Model for the World Wide Web (latest
revision)" which I reached from "HTML for User Agents": namely,
http://www.w3.org/TR/charmod/#C030 and http://www.w3.org/TR/charmod/#C076


Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:

153. You find yourself staring at your "inbox" waiting for new e-mail
to arrive.

Benjamin R. Haskell

unread,
Aug 8, 2011, 9:38:12 PM8/8/11
to Tony Mechelynck, vim...@googlegroups.com, ZyX
On Tue, 9 Aug 2011, Tony Mechelynck wrote:

> On 07/08/11 17:57, Benjamin R. Haskell wrote:
>> That means that, in the old thread { å, æ, ø, «, » } and in the new
>> thread { ¶ } were all replaced by �.
>
> In this message of yours (which I received in quoted-printable UTF-8)
> all these characters arrived (AFAICT) correct: a-ball, ae-ligature,
> o-bar, open-French-quote, close-French-quote, Pilcrow-mark, and, at
> the end, i-diaeresis, Spanish-inverted-question-mark, one-half.

Yep. As input.


>> All that said, it's unclear how 0xB6 was misinterpreted as
>> 0xC5,0x9B... But, alas. Unless you have good reason to stick to
>> explicit Latin-1, you're probably better off using UTF-8. In the
>> current HTML specs⁵, for example, even stating that something is
>> ISO-8859-1 is now *intentionally* treated as CP1252 (Microsoft's
>> version of Latin-1). So, the number of places in which using
>> ISO-8859-1 instead of UTF-8 will bite you is only going to increase.
>
> The only difference between ISO-8859-1 and Windows-1252 is that in the
> former, 0x80 to 0x9F are non-printing control characters (which I
> don't use), while in the latter most of them are printable characters
> (for which I use UTF-8 if I need them: in fact, my mailer is set to
> fall back to UTF-8 if the message contains characters not supported by
> the charset in which I would otherwise send it). In ISO-8859-15
> (another common replacement for Latin1) 0x80 to 0x9F are the same
> nonprinting controls, but some of 0xA0 to 0xBF are /different/
> printing characters, to wit, the Euro sign €, the French oe and OE
> digraphs œ Œ, the uppercase Y-diaeresis Ÿ, and the upper- and
> lowercase z-caron Ž ž.

Right, of course. I was thinking -15 when writing -1.


> One advantage of Latin1 over UTF-8 is that it uses one byte rather
> than two for every codepoint in the range [U+0080-U+00FF]. That may or
> may not be much of an advantage depending on the proportion of
> non-ASCII characters in a "Western-text" message. IOW it would be
> "least" advantageous for English text.

So, pros: possibly, maybe saves a couple of bytes.
Cons: is more likely to be misinterpreted.


> I'll send this reply in UTF-8, just to see if it makes a difference. I
> also checked my character-encoding preferences, and changed the
> "encoding to use when replying" from ISO-8859-1 to "whatever the
> sender used" (subject, in both cases, to UTF-8 fallback if the message
> text doesn't fit). If it isn't good enough I'll change it again.

Seems properly encoded.


> As for HTML specs, last time I checked they didn't apply to email,

My point wasn't about HTML or email, it was about the outmoded nature of
ISO-8859-n ∀n ∈ { x | x ≥ 1 & x ≤ 15 }. ( for all n belonging to the
set { x, where x >= 1 and x <= 15 } if your font's missing any of those
chars)

UTF-8, since it can encode anything in any of those charsets, but has
fewer interoperability problems, is virtually always preferable (at this
point).


> and it's email which gives me problems; with HTML I usually have no
> problem, except when the page is badly set up, let's say a page sent
> in some bizarre charset with no charset mentioned in an HTML
> Content-Type header and also not in any <meta
> http-equiv="Content-Type"> element.

Part of the reason you usually have no problem is that browsers have a
long "tradition" of having to be better at guessing the proper encoding
in the face of bad data (hence HTML is the first major spec [AFAIK] to
break from accepting what's provided as charset).


> Oh, and about your reference 5, I thought the normative authority for HTML
> was the W3C, in whose Standards I don't find what your whatwg page displays,
> and sometimes even the opposite, see for instance items C030 and C076 under
> "Character Model for the World Wide Web (latest revision)" which I reached
> from "HTML for User Agents": namely, http://www.w3.org/TR/charmod/#C030 and
> http://www.w3.org/TR/charmod/#C076

Yes, sorry. WHATWG = Web Hypertext Application Technology Working
Group. The current editor, Ian Hickson, is also the current editor of
the HTML5 spec¹, so I mistook it for official.

The official spec and my original link point out² that the character
override is a "willful violation"³ of the specs that you pointed to.
Which also points to the fact that you're only going to have more
problems in the future should you stick with ISO-8859-n.

--
Best,
Ben

¹: HTML5 spec
current: http://www.w3.org/TR/html5/parsing.html
latest draft: http://dev.w3.org/html5/spec/Overview.html

²: § 8.2.2.1 (last ¶, just above the link below)
current: http://www.w3.org/TR/html5/parsing.html#character-encodings-0
latest draft: http://dev.w3.org/html5/spec/Overview.html#character-encodings-0

³: § 1.5.2 "Compliance with other specifications"
http://www.w3.org/TR/html5/introduction.html#compliance-with-other-specifications

Tony Mechelynck

unread,
Aug 9, 2011, 5:07:41 PM8/9/11
to Benjamin R. Haskell, vim...@googlegroups.com, ZyX
On 09/08/11 03:38, Benjamin R. Haskell wrote:
> On Tue, 9 Aug 2011, Tony Mechelynck wrote:
[...]

>> The only difference between ISO-8859-1 and Windows-1252 is that in the
>> former, 0x80 to 0x9F are non-printing control characters (which I
>> don't use), while in the latter most of them are printable characters
>> (for which I use UTF-8 if I need them: in fact, my mailer is set to
>> fall back to UTF-8 if the message contains characters not supported by
>> the charset in which I would otherwise send it). In ISO-8859-15
>> (another common replacement for Latin1) 0x80 to 0x9F are the same
>> nonprinting controls, but some of 0xA0 to 0xBF are /different/
>> printing characters, to wit, the Euro sign €, the French oe and OE
>> digraphs œ Œ, the uppercase Y-diaeresis Ÿ, and the upper- and
>> lowercase z-caron Ž ž.
>
> Right, of course. I was thinking -15 when writing -1.

Ah. I only rarely use Latin9 (ISO-8859-15) anyway. When the document you
mentioned (and the HTML5 specs you mentioned here in a footnote) say to
use Windows-1252 as a willful violation of previous specs when Latin1
(ISO-8859-1) is requested, there is no big risk of failure since these
nonprinting controls are practically never used in Latin1.

>
>
>> One advantage of Latin1 over UTF-8 is that it uses one byte rather
>> than two for every codepoint in the range [U+0080-U+00FF]. That may or
>> may not be much of an advantage depending on the proportion of
>> non-ASCII characters in a "Western-text" message. IOW it would be
>> "least" advantageous for English text.
>
> So, pros: possibly, maybe saves a couple of bytes.
> Cons: is more likely to be misinterpreted.

For English the balance is probably in favour of UTF-8, but languages
like French, Spanish, German, Danish, etc. use comparatively much more
"accented" characters above 0x7F.

Thanks for the links in your footnotes (but since they are after the
dash-dash-space my mailer removes them when replying); but they all
apply only to HTML5 don't they? When I publish web pages, I use UTF-8
but also HTML 4.01.

For email, I would expect that the Content-Type header be respected (and
that any translation along the way be done in such way as not to corrupt
the data as interpreted at every step according to the Content-Type
header accompanying it); and anyway, when a Pilcrow mark (in Latin1)
comes back in UTF-8 as an s-acute (which doesn't exist in _either_
ISO-8859-1 _or_ Windows-1252), I wonder how that s-acute could have been
injected. I couldn't even imagine any consensus of vendors of email
agents and servers which would approve such a "wilful violation" of the
standards.

Hm, it seems that ISO-8859-2 (a Central- or East-European Latin encoding
AFAICT) has s-acute where ISO-8859-1 -15 and Windows-1252 all have a
Pilcrow mark. Still doesn't explain why or how it got in.


Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:

154. You fondle your mouse.

Reply all
Reply to author
Forward
0 new messages