LSP: cursor positioning on a multi-byte character with composing characters

39 views
Skip to first unread message

Yegappan Lakshmanan

unread,
Jun 3, 2023, 6:06:47 PM6/3/23
to vim_dev
Hi,

I am updating the Vim9 LSP plugin to support various position
encodings (utf-8, utf-16 and utf-32).
I ran into a problem with positioning the cursor on a multibyte
character with composing characters.

The LSP plugin uses the Vim function setcursorcharpos() to position
the cursor. This function ignores composing characters. The LSP
server counts the composing characters separately from
the base character. So when using the character index returned by the
LSP server to
position the cursor, the cursor is placed in an incorrect column.

e.g:

void fn(int aVar)
{
printf("aVar = %d\n", aVar);
printf("😊😊😊😊 = %d\n", aVar);
printf("áb́áb́ = %d\n", aVar);
printf("ą́ą́ą́ą́ = %d\n", aVar);
}

I have tried this test with clangd, pyright and gopls language servers
and all of them count the
composing characters as separate characters.

One approach to solve this issue is to add an optional argument to the
setcursorcharpos() function
that either counts or ignores composing characters. The default is to
ignore the composing
characters. Another approach is to add a function that computes the
character offset ignoring the composing characters from a character
offset that includes the composing characters.

Any suggestions?

Thanks,
Yegappan

Bram Moolenaar

unread,
Jun 3, 2023, 6:48:07 PM6/3/23
to vim...@googlegroups.com, Yegappan Lakshmanan

Yegappan wrote:

> I am updating the Vim9 LSP plugin to support various position
> encodings (utf-8, utf-16 and utf-32).
> I ran into a problem with positioning the cursor on a multibyte
> character with composing characters.
>
> The LSP plugin uses the Vim function setcursorcharpos() to position
> the cursor. This function ignores composing characters. The LSP
> server counts the composing characters separately from
> the base character. So when using the character index returned by the
> LSP server to
> position the cursor, the cursor is placed in an incorrect column.
>
> e.g:
>
> void fn(int aVar)
> {
> printf("aVar = %d\n", aVar);
> printf("𐟘Š𐟘Š𐟘Š𐟘Š = %d\n", aVar);
> printf("áb́áb́ = %d\n", aVar);
> printf("ą́ą́ą́ą́ = %d\n", aVar);
> }
>
> I have tried this test with clangd, pyright and gopls language servers
> and all of them count the
> composing characters as separate characters.
>
> One approach to solve this issue is to add an optional argument to the
> setcursorcharpos() function
> that either counts or ignores composing characters. The default is to
> ignore the composing
> characters. Another approach is to add a function that computes the
> character offset ignoring the composing characters from a character
> offset that includes the composing characters.
>
> Any suggestions?

Whether to count composing characters separately or not applies to many
functions. Adding a flag to each function to specify how composing
characters are to be handled is going to require a lot of changes. And
even for setcursorcharpos() I don't see a good way to add this flag.

Assuming we have the text, using a separate function to ignore composing
characters would be a separate step and a universal solution. I suppose
it could be something like:

idx_without = charpos_dropcomposing({text}, {idx_with})

It may not be needed now, but the opposite should be possible:

idx_with = charpos_addcomposing({text}, {idx_without})

Hopefully we can think of better (shorter) names.

It can possibly already be done with a combination of byteidxcomp() and
charidx(), since these have a choice of counting composing characters or
not. That does require two function calls though.

--
Corduroy pillows: They're making headlines!

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Yegappan Lakshmanan

unread,
Jun 7, 2023, 1:56:55 AM6/7/23
to Bram Moolenaar, vim...@googlegroups.com
Hi Bram,
Yes. I ended up implementing two helper functions (that you are suggesting
above) to convert the character index with and without composing characters:

https://github.com/yegappan/lsp/blob/main/autoload/lsp/util.vim#L189
https://github.com/yegappan/lsp/blob/main/autoload/lsp/util.vim#L224

Using these two functions, the Vim9 LSP plugin can now properly support
multibyte characters with composing characters. But as you mentioned above,
this involves calling two functions (byteidxcomp() and charidx()). I
will create
a PR to add the two functions you have described above to optimally do this.

Regards,
Yegappan

Yegappan Lakshmanan

unread,
Jun 10, 2023, 12:05:31 AM6/10/23
to Bram Moolenaar, vim...@googlegroups.com
Hi Bram,

On Sat, Jun 3, 2023 at 3:48 PM Bram Moolenaar <Br...@moolenaar.net> wrote:
>
>
I have created PR https://github.com/vim/vim/pull/12513 to add these two
new functions. Should we merge these two functions into a single function
with an argument to specify whether to count or not count combining characters?

Regards,
Yegappan

Bram Moolenaar

unread,
Jun 10, 2023, 9:26:38 AM6/10/23
to vim...@googlegroups.com, Yegappan Lakshmanan
Thanks for working on this. My main concern at first is that the user
will be confused by seeing three functions:

charidx({string}, {idx} [, {countcc} [, {utf16}]])
charidx_addcc({string}, {idx})
charidx_dropcc({string}, {idx})

Only when reading the details we can find out that the {idx} of
charidx() is a byte index, the other two are character indexes.
Changing the argument name to {byteidx} would help. We may have to do
that for other functions as well, to keep consistency.

Having the {countcc} argument for charidx() and a separate function name
for the other two is confusing. Also because "addcc" and "dropcc" can
be seen as an alternative for {countcc} (and that's not really
incorrect), but there is no hint that the {idx} argument is used
differently.

Alternatively there would be a function that does have the {countcc}
argument and the name indicating that {idx} is a character index:

charidx_XXX({string}, {idx}, {countcc})

However, is this {countcc} argument really doing the same thing? The
help for charidx() says:

When {countcc} is omitted or |FALSE|, then composing characters
are not counted separately, their byte length is added to the
preceding base character.
When {countcc} is |TRUE|, then composing characters are
counted as separate characters.

We can't use exactly the same for charidx_XXX(), since the index is not
in bytes. And using a character index, we would have to mention whether
composing characters are counted separately. This gets confusing, an
argument {countcc} which actually means something else, depending on
whether you look at the input or the result.

It's probably better to use two separate functions. I hope we find
better names though.

The help for the new functions should be extra clear, since it's easy to
misunderstand. We can discuss that on the PR.

--
Drink wet cement and get really stoned.
Reply all
Reply to author
Forward
0 new messages