Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[vim] Jumping from current Unicode string to next/prev appearance

37 views
Skip to first unread message

Janis Papanagnou

unread,
Dec 27, 2023, 8:52:58 PM12/27/23
to
In Vim I frequently jump from string to the next equal string using the
commands '*' (forward search'n'jump) and '#' (backward search'n'jump).

With Unicode characters that doesn't seem to always work (at least not
per default).

In the following (UTF-8 encoded) test sample there is one subset of
Omega words where * and # works correctly and one where it doesn't
(starting with the cursor on the first letter of any word)

Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega

The difference is only the encoding of the first character of that
word ('\x03A9' versus '\x2126'). For words with Ω=\x03A9 it works but
not for words with Ω=\x2126.

Is there a way to fix or achieve that function for all UTF-8 encoded
words?

Janis

Eli the Bearded

unread,
Dec 27, 2023, 9:37:03 PM12/27/23
to
In comp.editors, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
> In Vim I frequently jump from string to the next equal string using the
> commands '*' (forward search'n'jump) and '#' (backward search'n'jump).
>
> With Unicode characters that doesn't seem to always work (at least not
> per default).
>
> In the following (UTF-8 encoded) test sample there is one subset of
> Omega words where * and # works correctly and one where it doesn't
> (starting with the cursor on the first letter of any word)
>
> Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega

This is like complaining that a search for "MISS" does not also match
"МІЅЅ". They are completely different strings that just happen to look
alike with certain font choices. Some of those are "ohm sign", "Latin
small letter m", "Latin small letter e", "Latin small letter g", "Latin
small letter a" and the others are "Greek capital letter omega",
"Latin small letter m", "Latin small letter e", "Latin small letter g",
"Latin small letter a".

Your "difference is only the encoding" fails to grasp that Unicode is
semiotics aware, even if users might not be.

Elijah
------
https://www.unicode.org/reports/tr36/#visual_spoofing

Julieta Shem

unread,
Dec 27, 2023, 9:45:13 PM12/27/23
to
Eli the Bearded <*@eli.users.panix.com> writes:

> In comp.editors, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
>> In Vim I frequently jump from string to the next equal string using the
>> commands '*' (forward search'n'jump) and '#' (backward search'n'jump).
>>
>> With Unicode characters that doesn't seem to always work (at least not
>> per default).
>>
>> In the following (UTF-8 encoded) test sample there is one subset of
>> Omega words where * and # works correctly and one where it doesn't
>> (starting with the cursor on the first letter of any word)
>>
>> Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega
>
> This is like complaining that a search for "MISS" does not also match
> "МІЅЅ". They are completely different strings that just happen to look
> alike with certain font choices.

It looks very much alike with Google's ``Fira Code''.

> Some of those are "ohm sign", "Latin small letter m", "Latin small
> letter e", "Latin small letter g", "Latin small letter a" and the
> others are "Greek capital letter omega", "Latin small letter m",
> "Latin small letter e", "Latin small letter g", "Latin small letter
> a".
>
> Your "difference is only the encoding" fails to grasp that Unicode is
> semiotics aware, even if users might not be.

There's a package for the GNU EMACS that implements the search as the OP
desires. You can invoke it with saying

C-u 42 S E M I O T I C A W A R E RET C-c A I RET A W Y E A H RET

to the minibuffer. (Then press * and # as you wish.)

Janis Papanagnou

unread,
Dec 27, 2023, 10:40:50 PM12/27/23
to
On 28.12.2023 03:36, Eli the Bearded wrote:
> In comp.editors, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
>> In Vim I frequently jump from string to the next equal string using the
>> commands '*' (forward search'n'jump) and '#' (backward search'n'jump).
>>
>> With Unicode characters that doesn't seem to always work (at least not
>> per default).
>>
>> In the following (UTF-8 encoded) test sample there is one subset of
>> Omega words where * and # works correctly and one where it doesn't
>> (starting with the cursor on the first letter of any word)
>>
>> Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega
>
> This is like complaining that a search for "MISS" does not also match
> "МІЅЅ". They are completely different strings that just happen to look
> alike with certain font choices.

No, unfortunately you seem to have MISSed the point. It's not about
same looking but different strings. It's about different behavior of
the same Vim operations (* and #) on _two types_ of words.

Try to copy/paste the line into a Vim session, then move the cursor
onto the first character of the first word, then type * repeatedly.
Then do the same starting with the first character of the third word,
and observe the difference! - Tell me what you think about that.

(You can adjust the test-case to use these two letters in different
contexts, or work on single characters.)

Janis

Janis Papanagnou

unread,
Dec 27, 2023, 10:56:04 PM12/27/23
to
On 28.12.2023 04:40, Janis Papanagnou wrote:
>
> Try to copy/paste the line into a Vim session, then move the cursor
> onto the first character of the first word, then type * repeatedly.
> Then do the same starting with the first character of the third word,
> and observe the difference! - Tell me what you think about that.

Here's the effect visualized, where ^ indicates the cursor position
after a '*' operation


Case 1 (cursor starting at first character of the _third_ word):

Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega
^ ^ ^ ^

(All okay, the four matching words are addressed correctly.)


Case 2 (cursor starting at first character of the _first_ word):

Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega
^ ^ ^ ^ first turn
^ ^ ^ ^ second turn

(Not okay: in all subsequent words the first character is skipped.)


This is what annoys me and where I am looking for a solution (or a
hint that this is, maybe, an unavoidable flaw).

Janis

Janis Papanagnou

unread,
Dec 27, 2023, 11:14:22 PM12/27/23
to
I noticed that the effect is not depending on Unicode characters but
behaves similar to this ASCII-only test-case

'help' 'help' 'help'

If the cursor starts at the first quote we see the same effect

'help' 'help' 'help'
^ ^ ^ first turn
^ ^ ^ second turn

The quote seems to be excluded from consideration of the * command,
and the cursor jumps to the next word part. - Can this be explained?

So one of the Unicode characters mentioned above is not considered
part of the word while the other one is. And only words seem to be
considered, at least in this case.

But on the other hand, I can navigate with * also within non-alpha
characters like

§%" §%" §%" §%"
^ ^ ^ ^

So this also works.

I'm not pleased by that behavior. Looks also inconsistent to me.

Janis

Eli the Bearded

unread,
Dec 28, 2023, 3:13:27 AM12/28/23
to
In comp.editors, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
> Case 2 (cursor starting at first character of the _first_ word):
>
> Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega Ωmega
> ^ ^ ^ ^ first turn
> ^ ^ ^ ^ second turn
>

:help *

*star* *E348* *E349*
* Search forward for the [count]'th occurrence of the
word nearest to the cursor. The word used for the
search is the first of:
1. the keyword under the cursor |'iskeyword'|
2. the first keyword after the cursor, in the
current line
...

:help iskeyword
*'iskeyword'* *'isk'*
'iskeyword' 'isk' string (Vim default for MS-DOS and Win32:
"@,48-57,_,128-167,224-235"
otherwise: "@,48-57,_,192-255"
Vi default: "@,48-57,_")
local to buffer
Keywords are used in searching and recognizing with many commands:
"w", "*", "[i", etc. It is also used for "\k" in a |pattern|. See
'isfname' for a description of the format of this option. For '@'
characters above 255 check the "word" character class.
For C programs you could use "a-z,A-Z,48-57,_,.,-,>".
...

I think it is a bug that "word" is not a link to somewhere in pattern.txt

In any case, it is clear that # and * recognize alphabetic characters
like Greek capital *letter* omega differently from non-alphabet symbol
characters like ohm *sign*. If you move along the line with "w" to jump
between "words" you see the differences. The # and * searches use word
boundaries, so word definitions are very important there.

You are still looking at an ohm sign and thinking of a letter which is
the trap of Unicode "look alikes", not something vim is doing wrong.

Elijah
------
has vim's * remapped to _ and nearly used that writing this

Janis Papanagnou

unread,
Dec 28, 2023, 10:54:26 AM12/28/23
to
On 28.12.2023 09:13, Eli the Bearded wrote:
> [snip]
>
> In any case, it is clear that # and * recognize alphabetic characters
> like Greek capital *letter* omega differently from non-alphabet symbol
> characters like ohm *sign*. If you move along the line with "w" to jump
> between "words" you see the differences. The # and * searches use word
> boundaries, so word definitions are very important there.

Right.

>
> You are still looking at an ohm sign and thinking of a letter which is
> the trap of Unicode "look alikes", not something vim is doing wrong.

Erm, no. (I already explained elsethread that it's not about characters
that are looking alike; the issue turned out to not be about Unicode,
although it got apparent there. That's why I changed the test sample to
a plain ASCII test case.)

Your quotes (from the Vim help) helps explaining the behavior with the
'help' sample I posted: 'help' 'help' 'help'

I still think the behavior of Vim's * command is counterintuitive and
inconsistent. See this example (a file with two lines):

§%" §%" *+*+ §%" §%"
§%" a §%" a *+*+ §%" a §%" a

Starting from the first character of the first word we see the command
'*' jump words as depicted by the ^ symbols:

§%" §%" *+*+ §%" §%"
^ ^ ^ ^ # search-jumps on first line
§%" a §%" a *+*+ §%" a §%" a
^ ^ ^ ^ # continuing/changing on second line
^ ^ ^ ^

It means that * is first identifying the §%" string, and it continues
the search on the next line. But after it located the first §%" on the
second line it ad hoc changes the search pattern. - I would call that
undesired and inconsistent behavior.

We can "explain" (sort of) what happens. As in, say,
"If no alpha character is on the line * tries to match the next string
that matches the current one, but as soon as this search reaches or is
on a line that contains an alpha character the search pattern changes
and * jumps to the next alpha character on that line."

Okay, is it as it is. But shouldn't that feature be straightened? It's
not the first time that I missed a more coherent behavior in contexts
of non-alpha character strings, and I think that it would be generally
useful. - Is there, on the other hand, some sensible use-case for that
current [inconsistent] behavior (of ad hoc changing the pattern)?

Janis

Eli the Bearded

unread,
Dec 28, 2023, 8:53:38 PM12/28/23
to
In comp.editors, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
> Is there, on the other hand, some sensible use-case for that
> current [inconsistent] behavior (of ad hoc changing the pattern)?

It is a keyword search tool, not a random object search tool. The word
boundaries should be the indicator.

Elijah
------
printf, eg, is different than sprintf

Janis Papanagnou

unread,
Dec 29, 2023, 10:36:46 AM12/29/23
to
On 29.12.2023 02:53, Eli the Bearded wrote:
> In comp.editors, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
>> Is there, on the other hand, some sensible use-case for that
>> current [inconsistent] behavior (of ad hoc changing the pattern)?
>
> It is a keyword search tool, not a random object search tool.

Yes, obviously. And that's IMO an unnecessary restriction.
YMMV, of course.

And even as an artificially restricted "keyword search tool"
it's not working consistent if applied to the two lines of
test data that I posted.

I suppose there's little use to discuss that since it won't
change if not widely accepted as a useful generalization of
the * and # command.

In my book it was certainly often a nuisance in the restricted
and inconsistent form and I would have appreciated if it works
also on other (non-alphanumeric) keywords (i.e. on strings).

> The word boundaries should be the indicator.

Janis

PS: Historically (IIRC), in Vi, there was just the # command
(but not the * which I saw later in Vim). A typical use was to
jump from a C function call backwards to find its declaration.
Application of Vi(m) broadened since then, and yet more useful
features and changes entered the Vim command base.

Eli the Bearded

unread,
Dec 30, 2023, 2:00:18 AM12/30/23
to
In comp.editors, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
> PS: Historically (IIRC), in Vi, there was just the # command
> (but not the * which I saw later in Vim).

I do not believe you. For starters, nvi has a completely different
function bound to #, and nvi tries to be backwards compatible with vi.

> jump from a C function call backwards to find its declaration.
> Application of Vi(m) broadened since then, and yet more useful
> features and changes entered the Vim command base.

It occurs to me that you may like the boundary free versions of * and #:
prefix them with a g.

:noremap * g*
:noremap # g#

Elijah
------
uses very few of the g_ library of commands

Janis Papanagnou

unread,
Dec 30, 2023, 1:35:53 PM12/30/23
to
On 30.12.2023 08:00, Eli the Bearded wrote:
> In comp.editors, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
>> PS: Historically (IIRC), in Vi, there was just the # command
>> (but not the * which I saw later in Vim).
>
> I do not believe you. For starters, nvi has a completely different
> function bound to #, and nvi tries to be backwards compatible with vi.

I don't think that the '#' command (with the current semantic) was in
the _original_ Vi. (If that is how you interpreted "historically"). I
observed the command # with the current behavior when I regularly used
Vi starting around 1990 on AIX (and HPUX). And I'm positive - since I
recall to have been looking for that - that at these days there was no
'*' (as counterpart that matches in the opposite direction). - But
please correct me if I am wrong.

>
>> jump from a C function call backwards to find its declaration.
>> Application of Vi(m) broadened since then, and yet more useful
>> features and changes entered the Vim command base.
>
> It occurs to me that you may like the boundary free versions of * and #:
> prefix them with a g.
>
> :noremap * g*
> :noremap # g#

I didn't know of the 'g' variants, but 'g*' seems to behave equivalent
to '*' on my two-line test sample; i.e. when reaching the second line
it jumps from the punctuation character block to the letter a.

§%" §%" *+*+ §%" §%"
^ ^ ^ ^
§%" a §%" a *+*+ §%" a §%" a
^ ^ ^ ^
^ ^ ^ ^

So while 'g*' doesn't address the issue it is actually even worse since
without the \< and \> it then also matches other appearing 'a' in the
text.


I want to provide two more examples to explain my desire for a "better"
behavior with non-alpha character blocks.[*]

1) Matching (non-alpha) shell keywords (or other non-alpha constructs
that are so typical in shells)

f() {
: ${1:?}
}
: ${1:?}
echo "a: b"

Positioning at the first colon I want to find other standalone ones.

2) Matching ASN.1 identifiers (or other not pure-alpha identifiers)

direct-reference OBJECT IDENTIFIER OPTIONAL,
indirect-reference INTEGER OPTIONAL,

Positioning it in one of the "reference" substrings I want to find
the whole identifier (e.g. "direct-reference"), but not any string
with the substring reference.

In other words, a keyword and an identifier (beyond C and alike) has a
broader definition generally, and a quick-match for non-alpha strings
would be very convenient as I regularly observe in various editing
contexts.

I am aware that we cannot cover all matching combinations - e.g. how
should "an-id: 'a value'" be parsed; it might get non-trivial - but
a quick-search for space-separated entities would already be very
convenient as I've often experienced in my editing contexts.

Vim already supports a lot of such settings (breakat, isfname, isident,
iskeyword, and yet more even language specifics), so maybe there's a
not too complex way to achieve that.

Janis

[*] Note: Of course all searching can be done with regular search/regexp
but as I use * for quick match convenience I'd like to have it not only
for alpha sequences.

0 new messages