Search for character that doesn't have a combining character?

13 views
Skip to first unread message

Ben Fritz

unread,
Jun 23, 2015, 4:35:50 PM6/23/15
to vim...@googlegroups.com
I'm working on a custom command to add strikethrough to text, using the Unicode COMBINING LONG STROKE OVERLAY, 0x0336.

In this command, I want to apply a strikethrough to a character, only if it is not already present.

This pattern fails because it doesn't match *anything* with regexpengine set to 2, it does not match an unadorned character immediately before a struck-through base character, and it *does* match the last combining character in a word for some reason:

[^\u0336]\%u0336\@!

This pattern also fails, because it matches already struck-through base characters for some reason (although it does the same thing in both engines):

[^\u0336][^\u0336]\@=

What is the correct way to do this?

Full command (attempted):

'<,'>s;\%#=1\%V[^\u0336]\%u0336\@!;\=submatch(0)."\u0336";g

Note, how I'm also limiting to a visual selection; so I'm trying to use the :s command for simplicity.

Ben Fritz

unread,
Jun 23, 2015, 5:02:49 PM6/23/15
to vim...@googlegroups.com, fritzo...@gmail.com

My next attempt is to do two passes, first to remove the combining character from everywhere in the visual selection, and then to add it to the entire visual selection.

But, my patterns for this task either don't match at all, or they remove the base character along with the combining character! Even this doesn't work, it removes the base character:

echo join(split(getline('.'), "\u0336"),"")

Nikolay Pavlov

unread,
Jun 23, 2015, 5:16:29 PM6/23/15
to vim...@googlegroups.com, Benjamin Fritz
I was about to suggest to use tr(), but:

echo tr("o\u0336", "o", "t") is# "o\u0336"
echo tr("o\u0336", "\u0336", "t") is# "o\u0336"
echo tr("o\u0336", "o\u0336", "t") is# "t"

apparently tr() thinks that character is a unicode codepoint *with*
all of the following combining characters.

I would say here that

1. Regexp engines need proper `\p` support from Perl/PCRE. Or, at
least, the opposite of \Z which tells RE engine to treat all unicode
codepoints in the same way.
2. tr() must *always* use “one character is one unicode codepoint”
when &encoding is unicode. It is too low-level tool to care about
character classes, and especially to join codepoints together.

>
> --
> --
> You received this message from the "vim_use" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups "vim_use" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Nikolay Pavlov

unread,
Jun 23, 2015, 5:20:32 PM6/23/15
to vim...@googlegroups.com, Benjamin Fritz
2015-06-24 0:02 GMT+03:00 Ben Fritz <fritzo...@gmail.com>:
Though there is always one hack to get exactly one unicode codepoint
from *valid* UTF-8 string:

echo nr2char(char2nr(string[position :]))

. You can use `len(nr2char(…))` to get the length of the first
character and thus get to the second. I think this will allow you to
construct needed \= expression, but the result would most likely be a
definition of a new function due to its complexity.

Ben Fritz

unread,
Jun 23, 2015, 6:12:09 PM6/23/15
to vim...@googlegroups.com, zyx...@gmail.com, fritzo...@gmail.com
On Tuesday, June 23, 2015 at 4:20:32 PM UTC-5, ZyX wrote:
> 2015-06-24 0:02 GMT+03:00 Ben Fritz <fritzo...@gmail.com>:
> > On Tuesday, June 23, 2015 at 3:35:50 PM UTC-5, Ben Fritz wrote:
> >> I'm working on a custom command to add strikethrough to text, using the Unicode COMBINING LONG STROKE OVERLAY, 0x0336.
> >>
> >> In this command, I want to apply a strikethrough to a character, only if it is not already present.
> >>
> >> This pattern fails because it doesn't match *anything* with regexpengine set to 2, it does not match an unadorned character immediately before a struck-through base character, and it *does* match the last combining character in a word for some reason:
> >>
> >> [^\u0336]\%u0336\@!
> >>
> >> This pattern also fails, because it matches already struck-through base characters for some reason (although it does the same thing in both engines):
> >>
> >> [^\u0336][^\u0336]\@=
> >>
> >> What is the correct way to do this?
> >>
> >> Full command (attempted):
> >>
> >> '<,'>s;\%#=1\%V[^\u0336]\%u0336\@!;\=submatch(0)."\u0336";g
> >>
> >> Note, how I'm also limiting to a visual selection; so I'm trying to use the :s command for simplicity.
> >
> > My next attempt is to do two passes, first to remove the combining character from everywhere in the visual selection, and then to add it to the entire visual selection.
> >
> > But, my patterns for this task either don't match at all, or they remove the base character along with the combining character! Even this doesn't work, it removes the base character:
> >
> > echo join(split(getline('.'), "\u0336"),"")
>
> Though there is always one hack to get exactly one unicode codepoint
> from *valid* UTF-8 string:
>
> echo nr2char(char2nr(string[position :]))
>
> . You can use `len(nr2char(…))` to get the length of the first
> character and thus get to the second. I think this will allow you to
> construct needed \= expression, but the result would most likely be a
> definition of a new function due to its complexity.
>

Thanks! I agree this needs to be better supported in Vim's regex and tr() function. For my purposes I can pretty much always assume any combining characters are the strikethrough characters, making the replacement function trivial to implement with a nr2char(char2nr(submatch(0))) hack, but obviously this is not a good general solution as it will strip off all other combining characters when adding or removing the one character I'm actually interested in. I guess I could loop through the input string as you suggest if I'm interested in making a general solution at some point.
Reply all
Reply to author
Forward
0 new messages